Alex Lourie
05/21/2023, 10:56 AMSELECT column1, column2 WHERE id=X ORDER BY TIMESTAMP
, the ID is an indexed symbol and TIMESTAMP is a dedicated timestamp. The data is partitioned by DAY. We use a system with 128G of memory and 16 vCPUs, with 10 workers.
What can we do to improve the performance significantly?Gabriel Mocan
05/21/2023, 11:20 AMSELECT column1, column2 WHERE id=X SAMPLE BY 30m
If you’re only interested on latest values, check out LATEST ON
keyword.Alex Lourie
05/21/2023, 11:23 AMSAMPLE BY
provide any significant benefits compared to not using it at all?Bolek Ziobrowski
05/22/2023, 7:27 AMorder by
with sample by
to have a visible impact because the number of rows to sort is small .
I think you can try the following :
• switch to device that's better at random IOs
• use a compressed filesystem to fit more data in ram
• manually shard the table by id (e.g. id %100) and remove the indexAlex Lourie
05/22/2023, 7:30 AMBolek Ziobrowski
05/22/2023, 7:48 AMAlex Lourie
05/22/2023, 8:02 AMBolek Ziobrowski
05/22/2023, 8:24 AMAlex Lourie
05/22/2023, 8:24 AMBolek Ziobrowski
05/22/2023, 8:26 AMAlex Lourie
05/22/2023, 8:26 AMBolek Ziobrowski
05/22/2023, 8:30 AMAlex Lourie
05/22/2023, 8:31 AMBolek Ziobrowski
05/22/2023, 9:04 AMselect * from (
SELECT column1, column2 FROM data_old WHERE id=X
union all
SELECT column1, column2 FROM data_latest WHERE id=X
) order by timestamp
Fetching data from old table would be much faster because IOs wouldn't really be random but sequential .
Fetching data from new table would be random but the scope would be much smaller.
Every month you'd move data from data_latest to data_old ordered by device_id. This way for old data you'd get sequential reads and way less random reads for new data as it would be stored in 30 days * 48 = 1440 row batches.
This approach assumes the query is the most important in the workload and you don't need timeseries optimisations/partitioning for data_old .Alex Lourie
05/22/2023, 12:41 PM