https://questdb.io logo
Title
g

Gabriel Mocan

05/18/2023, 1:02 PM
Hello everybody, as noted on repo, we should expect a hotfix soon for last release? 7.1.3 even before 7.2 I’m having issues with ingestion deadlock on a couple customers where I’ve bumped version to 7.1.2
n

Nicolas Hourcard

05/18/2023, 1:03 PM
is this behaviour something you did not see on 7.1.1 and then saw on 7.1.2?
g

Gabriel Mocan

05/18/2023, 1:04 PM
v

Vlad

05/18/2023, 1:12 PM
hi Gabriel, this particular fix is for a long standing issue where countdown latch may get stuck in case of a race. It is unlikely to have manifested just in 7.1.2 could be something else. That said it would be good to clarify what you bumped exactly. Database or database clients?
g

Gabriel Mocan

05/18/2023, 1:15 PM
Well that might be a coincidence then, I have some charts:
During this IO overload writer got completely stuck
No error logs on DB, except for those I’ve mentioned before
These below. They keep flooding log and database goes spiral
In this case, it recovered by itself without having to revert to 7.1.1
Pretty much constant workload for both cases in this timeframe.
Its important to note that I’m not using QDB for more than 1 month so maybe this issue was about to happen to me anyways, 7.1.2 bump just triggered it.
And @Vlad sorry when I said “clients” I meant customers.
v

Vlad

05/18/2023, 1:54 PM
ah, got it. We will cut a patch today. How are you deploying this?
g

Gabriel Mocan

05/18/2023, 1:54 PM
Deploying on-premise using Docker (swarm for orchestration)
Currently this is running on a dark launch pool, so I have more freedom to test in this small set of customers.
v

Vlad

05/18/2023, 4:59 PM
cool, we have been digging into it and so far we cannot explain why your CPU shot up. Can I ask you to share your database logs that cover the CPU spike period pls? You can DM them to me
i

Imre

05/18/2023, 10:50 PM
hi @Gabriel Mocan, have you tweaked pool size related config by any chance? have you changed
cairo.reader.pool.max.segments
by any chance?
the
cannot close segment lock fd
errors, you said they are flooding the log. aren’t they just printed when the client connects, once for each walId?
i am asking these questions because based on the screenshot you sent you seem to have 900-1000 WalWriters.
how many client connections do you have? i know you open a new connection for each ILP message but can you guess how many of these connections you have concurrently open?
do you see these lines in QuestDB logs when running on 7.1.1?
2023-05-18T23:28:29.098914Z I i.q.c.p.WalWriterPool could not get, busy [table=`mytable~130`, thread=28, retries=2]
2023-05-18T23:28:30.215658Z I i.q.c.l.t.LineTcpMeasurementScheduler could not get table writer [tableName=mytable, ex=`table busy [reason=unknown]`]
g

Gabriel Mocan

05/19/2023, 12:21 PM
Hi @Imre thanks for replying. Lets go thru your points: • I didn’t mess with
cairo.reader.pool.max.segments
at any point; • the `cannot close segment lock fd`happens more frequently when IO overload is taking place; • Regarding 900-1000 walwriters, that’s most likely revealing that ILP golang client maybe is not working properly (poor control over batch flush); • I have no more than exporter number * 8 senders. In that particular case I have 6 exporters meaning that I create at max 48 concurrent senders (I can redesign that); • I do not open a new connection for every single ILP message, I buffer them in a go channel and process them every minute; • Logs were clean on 7.1.1.
I have a question: can I use Postgres COPY protocol to insert bulk data? That would solve all problems for me actually, as it performs close to ILP but way safer.
i

Imre

05/19/2023, 12:54 PM
hi @Gabriel Mocan, thank you. had a theory which could have explained the behaviour described above but the information you provided does not validate it. it must be something else.
probably the best is if you can provide the logs Vlad asked for earlier and we will take a look at it.
re COPY FROM, i think the closest we have is this: https://questdb.io/docs/reference/sql/copy/
g

Gabriel Mocan

05/19/2023, 12:55 PM
I’m working on a PoC here, at first I got:
ERROR: COPY is disabled ['cairo.sql.copy.root' is not set?] (SQLSTATE 00000)
i

Imre

05/19/2023, 12:56 PM
yes, disabled by default. you should set
cairo.sql.copy.root
in the config and restart.
files need to be delimited text (csv) format, placed into that directory. then you can send the COPY SQL command via postgres or REST API to QuestDB.
g

Gabriel Mocan

05/19/2023, 12:59 PM
Right
i

Imre

05/19/2023, 12:59 PM
i guess you have the files on the client
g

Gabriel Mocan

05/19/2023, 1:04 PM
No luck.
2023-05-19T13:03:41.192231Z E i.q.c.p.PGConnectionContext error [pos=79, msg=`'with' expected`]
2023-05-19T13:03:41.192436Z E i.q.c.p.PGConnectionContext unknown message [type=99]
2023-05-19T13:03:41.541469Z E i.q.c.p.PGConnectionContext [-1] wrong circuit breaker secret [idx=61]
2023-05-19T13:03:42.069462Z E i.q.c.p.PGConnectionContext error [pos=93, msg=`'with' expected`]
2023-05-19T13:03:42.069619Z E i.q.c.p.PGConnectionContext unknown message [type=99]
@Imre my idea was to use https://github.com/jackc/pgx/issues/102 I was using this method to write on TimeScaleDB with great performance.
i

Imre

05/19/2023, 1:05 PM
error [pos=79, msg=`'with' expected`]
looks like bad syntax
g

Gabriel Mocan

05/19/2023, 1:06 PM
That’s the reason most likelyh
i

Imre

05/19/2023, 1:08 PM
if you have the file in the specified dir and the file is in the right format, COPY should work. it is very likely that you missed the
WITH
part of the
COPY
command. an example from the doc:
COPY weather FROM 'weather.csv' WITH HEADER true FORMAT 'yyyy-MM-ddTHH:mm:ss.SSSUUUZ' ON ERROR SKIP_ROW;
remote file upload is not supported, that is true
g

Gabriel Mocan

05/19/2023, 1:09 PM
I’m not building the query by hand, pgx lib is doing it
Works with vanilla postgres
i

Imre

05/19/2023, 1:09 PM
i do not think that syntax is supported
g

Gabriel Mocan

05/19/2023, 1:10 PM
No problem. I’ll fiddle more with ILP.
i

Imre

05/19/2023, 1:10 PM
can you send it as a plain SQL command?
not familiar with pgx
maybe something liek this:
commandTag, err := conn.Exec("delete from widgets where id=$1", 42)
if err != nil {
    return err
}
conn.Exec or conn.Query
g

Gabriel Mocan

05/19/2023, 1:16 PM
This will work naturally, but not ideal for bulk inserts. Idea behind using COPY protocol for regular ingestion is to have performance on par with ILP.
If you read the issue I sent, you’ll get the idea even though not familiar to golang.
i

Imre

05/19/2023, 1:28 PM
just read it through, unfortunately QuestDB does not support this functionality. i am not sure it is even planned to support it. there is a plan to implement a binary protocol for QuestDB clients, instead of using text based ILP messages, but that is again something else. for now ILP and the clients are the fastest way to ingest.
g

Gabriel Mocan

05/19/2023, 1:30 PM
Nice, no problem, I’ll make adjustments to my write logic
@Imre I’ve redone whole write logic focusing on reducing concurrent table access and also avoiding O3 writes… It’s running like silk now.
I’m sorry by entitling this title as it is, but this issue really showed up for me after 7.1.2. But now it’s better because I have better sending logic.
Thanks @Vlad for being available to look at my logs and stuff. U guys rocks!
Eventually these still pop up though. But not flooding or looping like before.
v

Vlad

05/19/2023, 8:08 PM
hey Gabriel, this should be fixed in 7.1.3 we just released + the "stuck" bug fixed
g

Gabriel Mocan

05/19/2023, 8:19 PM
Nice @Vlad, will try right away
Good thing I’ve optimized my code though
Things are running very smooth now!!! 7.1.3 deployed.
Logs are clean also.