https://questdb.io logo
Title
a

Alex Lourie

05/21/2023, 10:56 AM
I'm looking for guides to optimise QuestDB reading settings - so far almost all the blogs/tech articles I can find are about ingesting rather than reading. Our dataset is roughly 2M devices, with yearly data (30m intervals), so it's about 17k data entries per device. We're fetching this data however it seems to be slow, about 30 seconds per device (those 17k lines). The query is straightforward
SELECT column1, column2 WHERE id=X ORDER BY TIMESTAMP
, the ID is an indexed symbol and TIMESTAMP is a dedicated timestamp. The data is partitioned by DAY. We use a system with 128G of memory and 16 vCPUs, with 10 workers. What can we do to improve the performance significantly?
g

Gabriel Mocan

05/21/2023, 11:20 AM
I would suggest a couple things: • Make sure your tables are created having symbol expected capacities defined, and all other good practices available on https://questdb.io/docs/operations/design-for-performance/ • Try your query like:
SELECT column1, column2 WHERE id=X SAMPLE BY 30m
If you’re only interested on latest values, check out
LATEST ON
keyword.
a

Alex Lourie

05/21/2023, 11:23 AM
Thanks, @Gabriel Mocan. I have tried, I think, everything on that page. Additionally, I run the db with capacities set up in runtime. However, I wonder, would
SAMPLE BY
provide any significant benefits compared to not using it at all?
b

Bolek Ziobrowski

05/22/2023, 7:27 AM
@Alex Lourie Selecting all rows by an indexed id means doing ~ 17k * number_of_columns random reads on the data . If significant chunk of data is not in ram and disk is e.g. EBS, things are going to slow down . EBS is quite slow compared to nvram ssds. I wouldn't expect replacing
order by
with
sample by
to have a visible impact because the number of rows to sort is small . I think you can try the following : • switch to device that's better at random IOs • use a compressed filesystem to fit more data in ram • manually shard the table by id (e.g. id %100) and remove the index
a

Alex Lourie

05/22/2023, 7:30 AM
@Bolek Ziobrowski Thank you for that. The disks we're using are "Premium SSD", whatever that means in Azure context, with maxIOPS of 7500. Is that any good indication for anything?
Also, our dataset is huge, we can't possibly have it all in RAM, so we'll have to read off the disks
b

Bolek Ziobrowski

05/22/2023, 7:48 AM
Due to disk latency doing random iops with a single thread is usually way below max iops . For example - with amazon ebs it's possible to set max to 16k but a single thread won't be able to do more than a few k per second. With filesystems like ZFS it's possible to shrink data size by 2-3 times. Only the columns used (id, timestamp, column1, column2) matter for this query . If there are more columns in the table then it might turn out you need just a small piece of dataset in ram . Is that the main query in your workload ?
a

Alex Lourie

05/22/2023, 8:02 AM
no...Basically we have this huge dataset that we need to fetch from a 3rd party application. That application will use a single thread to fetch the data for ~1000 devices, with about 17k entries per device. And then we can run multiple threads, each doing their own set of 1k devices/17k entries.
so we're trying to find the best way to optimise running QuestDB for this scenario.
and we're looking for a way to scale it
right now we can run 10 sets in about 4.5 minutes, which is ok, but not great. It feels that the more threads on a client we run, the more they get in the way rather than help
b

Bolek Ziobrowski

05/22/2023, 8:24 AM
Does your application only need to process the new data fetched from the 3rd party app or whole data set all the time ?
a

Alex Lourie

05/22/2023, 8:24 AM
only the data it fetched. The dataset is not changing for the most of the app run
it's just there's a lot of it
or, sorry, maybe I misread that. 3rd party app is the application
it's 3rd party to QuestDB
b

Bolek Ziobrowski

05/22/2023, 8:26 AM
Hmm, could you maybe fetch it into temporary table that doesn't have an index ? querying 17m rows (if I get the number right) should be much faster than with the whole table .
a

Alex Lourie

05/22/2023, 8:26 AM
well, let me just rewind a little then.
We have a big data set of 1.7M devices with 17k time series entries per each. Our application needs to fetch all of that from the DB.
the application runs threads, each such thread will fetch the data for about 1000 of those devices
each such thread is a set
so the question is how do we optimise QuestDB to serve such usecase as fast as possible. We are keen on scaling this to many threads/clients if it helps
but from our experience so far, QuestDB is saturated pretty fast with IO
b

Bolek Ziobrowski

05/22/2023, 8:30 AM
Do you fetch all columns from the table ?
a

Alex Lourie

05/22/2023, 8:31 AM
no, only the timestamp and 2 other columns
b

Bolek Ziobrowski

05/22/2023, 9:04 AM
All right . Reading 17k entries for a single devices through index means potentially doing 17k * 3 columns random reads (ignoring index overhead). QuestDB being a column store means that: • query doesn't need to fetch data of columns it doesn't use • column data for a given row is not physically close on disk or ram so reading it requires more random reads If dataset is big then most of these will have to be read from disk . If you compare that to disk limits (7.5k ops/s total but likely 1k per thread) it's not strange it's slow . Reading this amount of data via random reads is a bad idea in general. This would be slow in most databases unless you could adjust table row order relative to the column you use for querying . In QuestDB you can't cluster rows around a symbol value because table rows are ordered by timestamp (and partitioned). What could help is splitting the table into 100 or so tables by e.g. id % 100 , and getting rid of the index. That'd mean using full scan on a 1/100th of a table and would likely be faster than the current approach because : • indexes take a lot of space in cache that could be used by data • sequential reads' throughput is much bigger than random reads'
Another approach would be periodically 'merging' rows per id to minimize IO required for old/static data . If you had two (or more tables), e.g. data_old - non-partitoned no-desginated timestamp table with data ordered by id (you'd need to insert data in this order), storing data older than 1 month. data_latest - table storing last month's worth of data (can be partitioned) Query would look like :
select * from (
SELECT column1, column2 FROM data_old WHERE id=X
union all
SELECT column1, column2 FROM data_latest WHERE id=X
) order by timestamp
Fetching data from old table would be much faster because IOs wouldn't really be random but sequential . Fetching data from new table would be random but the scope would be much smaller. Every month you'd move data from data_latest to data_old ordered by device_id. This way for old data you'd get sequential reads and way less random reads for new data as it would be stored in 30 days * 48 = 1440 row batches. This approach assumes the query is the most important in the workload and you don't need timeseries optimisations/partitioning for data_old .
a

Alex Lourie

05/22/2023, 12:41 PM
this is very helpful, thanks @Bolek Ziobrowski
I'm not sure if it's practical for us to split the data into 100 tables, but otherwise, good hints