Title
#users-market-data
R D

R D

08/25/2022, 5:04 AM
I have a question (not exclusively applicable to questdb) that I'm hoping I can get some help on. I have a process that receives tick data from a websocket and writes it to Questdb. Sometimes, this process needs to be updated (ex. to start tracking new exchange listings). What can I do to ensure that there are both no gaps and no duplicates in my data? I'd guess a first step would be to spin up another instance of this producer process before shutting down the first one. The problem is that there will be some time where both processes are active, meaning that the second process cannot simply write to the same db table as the first as that will result in duplicated data. I can't just hash each row and check if that hash already exists as it is possible for 2 trades to occur with the same size and volume at the same time, resulting in an identical hash without being identical entries.
z

Zahlii

08/25/2022, 6:10 AM
Can't you solve this by some kind of shared or message queue? Depending on what os and setup you use, there is a possibility for multiple processes to share e.g. a file lock which you can use to synchronize/notify the second process when the first is about to go down?
R D

R D

08/25/2022, 8:41 PM
On shutdown, in the SIGTERM handler, each process writes the timestamp of the last tick it wrote to the database to the lockfile and releases the lock. On startup, each writer process begins storing market data in a buffer. It checks the lock in fixed intervals (say every 100 ticks added to the buffer). When it can, acquire the lock, write any ticks in the buffer with a timestamp after the one in the lockfile to questdb, clear the buffer. If it already has the lock, then just write all ticks in the buffer to questdb, and clear the buffer. Once it holds the lock, the process does not need to release it until it receives SIGTERM, at which it will write the timestamp of the last tick it sent to questdb to the lockfile, release the lock, and terminate. So I could start process 1, start process 2, kill process 1, and theoretically there should not be any gaps in the data. Since process 2 will only write ticks with a timestamp after the last that process 1 already wrote. However, it does not cover the edge cases where two ticks occur at the same time, or when ticks arrive out of order from the exchange.
z

Zahlii

08/26/2022, 7:44 AM
I don#t think you need to actually track the timestamp as part of the log, or am I seeing things to simple? process 1 will happily send stuff and at one point release the lock. As soon as the lock is acquired by the next process, it can start streaming packets arrived after it acquired the log, independent of the timestamps? This should work as long as your acquire process is faster than the update interval of inbound data. Alternatively, you could store raw inbound data in another service/message queue, and fetch from there?
R D

R D

08/27/2022, 6:29 PM
Ideally I would not rely on performance characteristics. Can you elaborate on using a message queue to deal with this?
z

Zahlii

08/27/2022, 9:22 PM
You write an extremely thin client which just fetches raw data and stores that into a queue, and all clients can fetch from that queue instead
9:22 PM
Then your queue service just needs to check which messages has been fetched
9:22 PM
But doesn't address your main concern
9:22 PM
How many messages are you receiving?
R D

R D

08/28/2022, 3:32 AM
But wouldn't this just shift the responsibility to the client that fetches raw data from the exchange and writes it to the queue? There still needs to be a way for that client to be restarted while another client picks up the slack in some sort of handoff, and for their writes to the queue to by synchronized such that they catch everything sent from the exchange exactly once.
z

Zahlii

08/28/2022, 3:12 PM
Yes, thats what I was meaning with "it doesnt solve the main problem" 😄 Maybe you can instead have a shared linux socket between these processes, and once the first is about to go down, it writes a TERMINATING to it (which the second process receives and sends back an ACK, and immediately after the ACK process 2 starts transmitting live data while the second will stop sending updates after it received the ACK? I'm not sure about the exact specs, but I think that communciation via socket should be near-instant, so not messages lost (unless you are receiving billions of rows each second).