Title
#users-public
a

Ajay Pilaniya

11/01/2022, 3:36 AM
Hi... We have ingested 20 million rows in one of the table. One of the column in this table is user_id and one column is request_id. Distinct count queries are taking around 20 seconds on user id and to create index on these columns I need to define these as SYMBOLs but as per documentation I need to define a capacity for these. But my question is how can I define capacity for keys like user_ids? Today I can define capacity at 1 million but what if tomorrow we get 5 million more new users? In that case how would I redefine the capacity? Would I need to monitor it actively and when I see surge I need to drop older index and create new index?
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 6:43 AM
Hi Ajay, Do you have a distinct user id per each row or multiple rows belong to the same user?
a

Ajay Pilaniya

11/01/2022, 6:45 AM
@Andrey Pechkurov Multiple rows can have same user. Basically we are storing user activities so for 100k users there can be 1 million rows but only 100k distinct user ids.
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 6:47 AM
Are those ids strings or integers?
6:47 AM
If they are integers, you could use
int
or
long
type instead of
symbol
6:48 AM
If they're strings, you should configure your symbol capacity with some extra for future growth
6:48 AM
In general, if you want to increase symbol column capacity, you should do the following:https://questdb.io/docs/troubleshooting/faq/#how-do-i-convert-a-string-column-to-a-symbol-or-vice-versa
6:49 AM
The only thing to keep in mind is that UPDATE may take a while on large tables
a

Ajay Pilaniya

11/01/2022, 6:53 AM
Ok so from this link I understand that to increase symbol capacity I need to add a new column with new capacity. Am I right?
6:57 AM
@Andrey Pechkurov Also we have another column request_id but this is high cardinality column. So for 1 million rows there will be around 800-900k unique request ids. We also need to find the distinct count on this column also. Request id value is string. And since this is ever growing distinct column how can we handle distinct queries? Because since we cannot define capacity for this correctly and without symbol we cannot create index. How do we handle this scenario?
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 7:19 AM
Ok so from this link I understand that to increase symbol capacity I need to add a new column with new capacity. Am I right?
Yes, that correct
7:21 AM
Also we have another column request_id but this is high cardinality column.
Using symbol for a high cardinality column isn't the best choice. I'd suggest to use string or any other suitable type. As for the distinct queries, that's tricky - the query execution time will be growing with the volume of your data. You could try to run the query on a fixed time interval (say, last week) if that's acceptable. This way the execution time won't grow
a

Ajay Pilaniya

11/01/2022, 8:17 AM
@Andrey Pechkurov One more doubt : How QuestDB handles multiple filters? As far I can see that in documentation we can create index on single symbol column but what if I have to filter out queries on multiple columns? Usually most of Relation databases have composite index to handle such scenarios. Does quest DB supports composite indices?
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 8:35 AM
No support for composite indexes, at least for now. In most cases, the database picks up one of indexes and applies filter to rows returned from the index. @Bolek Ziobrowski is working on EXPLAIN which should let you see the decisions made by the query engine.
8:36 AM
If you have enough RAM to fit the data work set, you should consider dropping the indexes. Once the data is in the page cache, our multi-threaded SQL filter will be faster than an index-based execution plan.
8:37 AM
If the RAM isn't big enough, indexes make sense as long as they're selective.
a

Ajay Pilaniya

11/01/2022, 8:43 AM
@Andrey Pechkurov Below query is taking more than 2 seconds and since no composite indices are there I am not sure if further optimisation is possible. Currently I have around 8 million records in my table and I have index on user_id.
SELECT user_device,user_device_brand,useros, sum(segment_count) as views,count_distinct(user_id) FROM 'content_analyticsv2' GROUP BY user_device,user_device_brand,useros ORDER BY views DESC LIMIT 10;
Is there any way to improve such Group BY queries response time because most of our user cases are around Group by queries.
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 8:50 AM
Your query doesn't have a filter, so an index wouldn't help. What is the HW you're using?
a

Ajay Pilaniya

11/01/2022, 8:51 AM
@Andrey Pechkurov We are hosting it on AWS r5.2xlarge server.
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 9:11 AM
So, you have 64GB RAM. How big is the table size in GB?
a

Ajay Pilaniya

11/01/2022, 9:12 AM
@Andrey Pechkurov Yes. How can I find table size? Are there some queries which can help finding size?
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 9:15 AM
Just open the database root directory and then
db/<your_table_name>
9:15 AM
The root dir is usually at
~/.questdb
a

Ajay Pilaniya

11/01/2022, 9:22 AM
@Andrey Pechkurov It's around 4.5 GB
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 9:29 AM
I'd certainly suggest removing the indexes
a

Ajay Pilaniya

11/01/2022, 9:30 AM
@Andrey Pechkurov Ok. What about the query time taking 2+ seconds when using group by? All those columns are of SYMBOL type. Is there any way we can improve that?
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 9:40 AM
Do you have any indexes on the table?
9:42 AM
I was wrong when I mentioned that we don't use indexes on such queries. In fact, we may use them
a

Ajay Pilaniya

11/01/2022, 9:42 AM
@Andrey Pechkurov Currently have one index on partner_id. When I query using partner_id filter I am getting response in 500ms.
9:43 AM
But since I have 4 partner and data evenly divided, when using partner id filter it is querying on 1/4th of the total data so that may be the reason of better performance.
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 9:57 AM
Which QuestDB version are you using BTW?
a

Ajay Pilaniya

11/01/2022, 9:57 AM
@Andrey Pechkurov 6.5.4
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 10:03 AM
So, the latest one. All good
10:04 AM
I've asked since there were some improvements in GROUP BY performance in 6.5.4
10:05 AM
Could you try running
SELECT user_device,user_device_brand,useros, sum(segment_count) as views
FROM 'content_analyticsv2'
ORDER BY views DESC LIMIT 10;
and share the execution time?
10:06 AM
BTW you don't have to define an explicit GROUP BY clause. The database is smart enough to understand which columns should be grouped
10:07 AM
The difference with your query is no
count_distinct
aggregate
a

Ajay Pilaniya

11/01/2022, 10:11 AM
@Andrey Pechkurov I am getting random behaviour on this query. On first run execution time was 900ms then on second run it was 1.8 seconds. I ran the query 15 times and 2 times response was below 1 second for others around 1.8 seconds.
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 10:16 AM
then on second run it was 1.8 seconds
This is very weird. Are you running any other workloads or SW on the server?
a

Ajay Pilaniya

11/01/2022, 10:16 AM
@Andrey Pechkurov Some inserts are being performed. Should I stop the ingestion job and try again?
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 10:17 AM
INSERTs via PGWire?
a

Ajay Pilaniya

11/01/2022, 10:17 AM
@Andrey Pechkurov Yes
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 10:17 AM
How many connections are you using for the INSERTs?
a

Ajay Pilaniya

11/01/2022, 10:17 AM
One Connection Only. Every 10 minutes I am inserting 200k records
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 10:18 AM
Long-running INSERTs or lots of them may lead to query slow down
10:18 AM
Every 10 minutes I am inserting 200k records
Is it in a single transaction?
10:18 AM
Are rows in-order or out-of-order?
a

Ajay Pilaniya

11/01/2022, 10:20 AM
@Andrey Pechkurov I am inserting in batches of 10k. So 15 transactions. Between every transaction I have a delay of 5 seconds. Since this is dummy data I am inserting rows one week older (basically
now() - oneWeek
)
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 10:23 AM
If you have lots of O3 (out-of-order) inserts, this may keep all worker threads busy
10:23 AM
Please try stopping the ingestion and then re-run the queries
10:23 AM
If this helps, you should consider using an ILP client to ingest the data. It's more efficient than INSERT over PGWire
a

Ajay Pilaniya

11/01/2022, 10:24 AM
@Andrey Pechkurov I think due to heavy insertions there was some delay. Just stopped the ingestion and getting the results in 500ms constantly.
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 10:25 AM
Could you also check your original query?
a

Ajay Pilaniya

11/01/2022, 10:25 AM
yes we will use ILP in longer run but for just load testing we used PGWire as we had this integrated already in our codebase
10:25 AM
@Andrey Pechkurov For original query also getting response in 550ms. Sometimes even below 500.
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 10:35 AM
That's much closer to what I'd expect to see
10:35 AM
Are you happy with this execution time? If not, what're your expectations?
a

Ajay Pilaniya

11/01/2022, 10:37 AM
@Andrey Pechkurov For now it seems fine to me. By tomorrow evening I should have around 50 million rows and then I can say for sure that it meets expectation.
Andrey Pechkurov

Andrey Pechkurov

11/01/2022, 10:47 AM
OK, sounds good. Please let me know how it goes
a

Ajay Pilaniya

11/02/2022, 3:22 PM
@Andrey Pechkurov We have ingested around 50 million rows in the table. Most of the queries seems fine to me but one thing that is bugging me is response time of
count_distinct
on non-symbol columns. I have a high cardinality column and
count_distinct
is taking around 12 seconds. Is there any way we can improve this?
Andrey Pechkurov

Andrey Pechkurov

11/02/2022, 3:39 PM
Hi Ajay
3:39 PM
Is it a string column?
3:40 PM
count_distinct
also supports int/long types and it should work faster than for a string column
3:41 PM
If string type is a must-have, could you create a GH issue with the description of your scenario and execution times. Maybe that's something we could optimize in one of future releases
a

Ajay Pilaniya

11/03/2022, 4:25 AM
@Andrey Pechkurov It is a string column (we generate unique strings (UUID) for every request). I will log it as GH issue. Also I had one more question, is there any limit on symbol column capacity? How do I decide which column can be created as symbol? I have some columns with 10k distinct values, some with 100k distinct values and some ever growing distinct values. How do I decide which column can be treated as symbol?
Andrey Pechkurov

Andrey Pechkurov

11/03/2022, 6:38 AM
No enforced limit
6:40 AM
How do I decide which column can be treated as symbol?
The rule of thumb is the following. If you know that column values are repeated in multiple rows, use symbol. Symbol is a dictionary type, so you would be saving disk space - internal symbol codes (dictionary ids) will be stored for each row instead of repeated strings
a

Ajay Pilaniya

11/03/2022, 8:00 AM
@Andrey Pechkurov One more thing... Is there any documenation on scaling strategies around QuestDB? Can we have multi node setup in it?
Andrey Pechkurov

Andrey Pechkurov

11/03/2022, 8:34 AM
Can we have multi node setup in it?
Not yet, but we're working on replication now.
8:35 AM
Until it's released, your only option is application-level sharding/replication.
a

Ajay Pilaniya

11/03/2022, 8:39 AM
@Andrey Pechkurov Okay got it