Title
a

Ajay Pilaniya

12/26/2022, 7:28 PM
Hi… So we went live with QuestDB on production but we are facing one issue with UPDATE queries. So basically we run some jobs to ingest rows in Quest table. But sometimes due to some I/O error or some other error these jobs fail. In these cases we have a retry mechanism to re ingest the data. But problem is that what if during first run we ingested 5k rows out of 10k rows? We don’t want to re ingest same rows during retry. So to handle this we have a boolean flag isValid. So we only consider rows where isValid = true. So during retry we basically run a query
UPDATE isValid = false where JobId = #{currentJobID}
so basically we mark all rows with current job Id false and re ingest the data. so this way even we have duplicate rows but older rows are marked false and will not be considered in our further pipeline. The problem is that when one table reached 2 million rows I am getting timeout error on this UPDATE query. (
PSQLException: ERROR: Timeout expired on waiting for the async command execution result
) I am using PGWire. I am not sure how can I handle this error or is there a better approach to handle this scenario? Any help is appreciated..
n

Nicolas Hourcard

12/26/2022, 9:15 PM
hey @Ajay Pilaniya we will look into this, thanks for reporting the behaviour
a

Andrey Pechkurov

12/26/2022, 9:16 PM
Hi Ajay, Do you have index on
JobId
column?
That could help. One more thing that could speed up UPDATE a lot is an additional filter on the designated timestamp, if that applies to your batches. Something like:
UPDATE my_table SET isValid = false where JobId = #{currentJobID} and ts in `2022-12-26`;
BTW how many rows do you have per partition? UPDATE re-writes only the column specified in SET, but if the partition is very big, it may become a problem. In this case, changing partitioning strategy to something more granular might be a good idea, e.g. PARTITION BY DAY instead of PARTITION BY MONTH.
a

Alex Pelagenko

12/26/2022, 9:30 PM
Agree, 2m rows doesn't sound too much so that you need to worry about updates. Please tell more details, what's the table partitioning, what are column types, row count per partition and QuestDB version
i

Imre

12/26/2022, 10:48 PM
Hi, the error message
Timeout expired on waiting for the async command execution result
suggests that the
UPDATE
statement could not run immediately and it was waiting in the async execution queue. Are the jobs ingesting data via ILP and you are running the UPDATE via PGWire? In that case ILP could hold the table and prevent the
UPDATE
statement from running. It would be put on a queue and executed asynchronously when ILP releases the table. However, if ILP does not release the table for a set timeout, the
UPDATE
will be dropped and the above error will be seen.
As Alex said the amount of data is not too much to result in such a slow processing of
UPDATE
Also would be good to know which version of QuestDB you are using.
a

Ajay Pilaniya

12/27/2022, 4:07 AM
@Andrey Pechkurov JobId column is of type string currently? Should I update it to SYMBOL?
Just now I got same error on one of another table. That table only contains 300k rows. So now I am thinking that this error is related to something else instead of high row count
Full error :
org.postgresql.util.PSQLException: ERROR: Timeout expired on waiting for the async command execution result [instance=12]Timeout expired on waiting for the async command execution result [instance=13]Timeout expired on waiting for the async command execution result [instance=33]Timeout expired on waiting for the async command execution result [instance=36]Timeout expired on waiting for the async command execution result [instance=39]Timeout expired on waiting for the async command execution result [instance=41]Timeout expired on waiting for the async command execution result [instance=48]Timeout expired on waiting for the async command execution result [instance=62]Timeout expired on waiting for the async command execution result [instance=105]Timeout expired on waiting for the async command execution result [instance=125]Timeout expired on waiting for the async command execution result [instance=130]Timeout expired on waiting for the async command execution result [instance=131]Timeout expired on waiting for the async command execution result [instance=136]Timeout expired on waiting for the async command execution result [instance=139]Timeout expired on waiting for the async command execution result [instance=140]Timeout expired on waiting for the async command execution result [instance=151]
a

Andrey Pechkurov

12/27/2022, 7:29 AM
obId column is of type string currently? Should I update it to SYMBOL?
Yes, it might be a good idea in any case. Symbol is more efficient than repeated string values.
Indeed it looks unrelated with the data volume. Could you check Imre's questions on ILP?
a

Ajay Pilaniya

12/27/2022, 7:53 AM
@Andrey Pechkurov I am not using ILP. Both ingestion and updates are being done via PGWire. Also during UPDATE query no ingestion is running. Ingestion only runs after UPDATE query gives OK response. Also I have tried changing JobId to SYMBOL and added an index on it. Also now I am trying to run UPDATE query for one hour range only but still this query is failing. So this might not be related to Data volume at all
a

Andrey Pechkurov

12/27/2022, 8:05 AM
Does it still fail with a timeout after a database restart? If so, could you share the log after you restart the database and run the UPDATE?
a

Ajay Pilaniya

12/27/2022, 8:28 AM
@Andrey Pechkurov Will share the logs shortly. Also how many rows should I have in a partition ideally?
a

Andrey Pechkurov

12/27/2022, 8:34 AM
I'd say each partition should occupy up to 1GB of disk space or so. Hard to say in terms of row count since it hugely depends on the number of columns and their types. Say, if you have 10
LONG
columns, that means up to 13.5M rows per partition.
a

Ajay Pilaniya

12/27/2022, 9:47 AM
@Andrey Pechkurov getting below errors after update timeout on server side -
2022-12-27T09:44:06.106944Z E i.q.g.UpdateOperator could not update
io.questdb.cairo.CairoException: [-1] timeout, query aborted [fd=211]
	at io.questdb.cairo.CairoException.critical(CairoException.java:89)
	at io.questdb.cairo.CairoException.nonCritical(CairoException.java:83)
	at io.questdb.griffin.engine.ops.UpdateOperation.forceTestTimeout(UpdateOperation.java:105)
	at io.questdb.griffin.UpdateOperator.executeUpdate(UpdateOperator.java:119)
	at io.questdb.griffin.engine.ops.UpdateOperation.apply(UpdateOperation.java:65)
	at io.questdb.cairo.TableWriter.processAsyncWriterCommand(TableWriter.java:4460)
	at io.questdb.cairo.TableWriter.processCommandQueue(TableWriter.java:1215)
	at io.questdb.cairo.TableWriter.processCommandQueue(TableWriter.java:4499)
	at io.questdb.cairo.TableWriter.tick(TableWriter.java:1833)
	at io.questdb.cairo.pool.WriterPool.returnToPool(WriterPool.java:555)
	at io.questdb.cairo.pool.WriterPool$Entry.close(WriterPool.java:608)
	at io.questdb.cairo.TableWriter.close(TableWriter.java:756)
	at io.questdb.std.Misc.free(Misc.java:45)
	at io.questdb.cutlass.pgwire.PGConnectionContext.executeTag0(PGConnectionContext.java:1320)
	at io.questdb.cutlass.pgwire.PGConnectionContext.executeTag(PGConnectionContext.java:1309)
	at io.questdb.cutlass.pgwire.PGConnectionContext.processExecute(PGConnectionContext.java:2147)
	at io.questdb.cutlass.pgwire.PGConnectionContext.processExec(PGConnectionContext.java:2131)
	at io.questdb.cutlass.pgwire.PGConnectionContext.parse(PGConnectionContext.java:1568)
	at io.questdb.cutlass.pgwire.PGConnectionContext.handleClientOperation(PGConnectionContext.java:413)
	at io.questdb.cutlass.pgwire.PGJobContext.handleClientOperation(PGJobContext.java:84)
	at io.questdb.cutlass.pgwire.PGWireServer$1.lambda$$0(PGWireServer.java:83)
	at io.questdb.network.AbstractIODispatcher.processIOQueue(AbstractIODispatcher.java:164)
	at io.questdb.cutlass.pgwire.PGWireServer$1.run(PGWireServer.java:110)
	at io.questdb.mp.Worker.run(Worker.java:118)
a

Andrey Pechkurov

12/27/2022, 9:50 AM
Any log messages before this one?
a

Ajay Pilaniya

12/27/2022, 9:50 AM
2022-12-27T09:46:23.453311Z I i.q.c.TableWriter published async command complete event [type=3,tableName=cdn_logs,tableId=4,correlationId=25,errorCode=-3,errorMsg=timeout, query aborted [fd=652]]
2022-12-27T09:46:23.453319Z I i.q.c.TableWriter received async cmd [type=3, tableName=cdn_logs, tableId=4, correlationId=26, cursor=25]
this one was before this… before this other logs seem normal and does not seem to be containing any error logs
@Andrey Pechkurov Found this one also -
2022-12-27T09:46:12.805132Z I i.q.g.e.o.OperationFutureImpl published ASYNC writer command [name=UPDATE,tableName=cdn_logs,instance=27]
2022-12-27T09:46:13.306167Z E i.q.c.p.PGConnectionContext error [pos=0, msg=`Timeout expired on waiting for the async command execution result [instance=6]Timeout expired on waiting for the async command execution result [instance=7]Timeout expired on waiting for the async command execution result [instance=8]Timeout expired on waiting for the async command execution result [instance=9]Timeout expired on waiting for the async command execution result [instance=13]Timeout expired on waiting for the async command execution result [instance=16]Timeout expired on waiting for the async command execution result [instance=20]Timeout expired on waiting for the async command execution result [instance=21]Timeout expired on waiting for the async command execution result [instance=26]Timeout expired on waiting for the async command execution result [instance=27]`]
a

Andrey Pechkurov

12/27/2022, 10:14 AM
It looks like table writer is unavailable when you're executing the UPDATE, so there should be ingestion going on against the table. You could try increasing the wait timeout:
cairo.writer.alter.busy.wait.timeout=5000
(default value is 500ms). This might help to get rid of the timeout errors in situation when another connection is holding the table writer.
a

Ajay Pilaniya

12/27/2022, 10:16 AM
@Andrey Pechkurov But this seems unlikely as I am running all queries related to this table on single thread. So until and unless previous ingestion is finished I will not fire any further queries. Nevertheless let me just try increasing timeout and report back.
a

Andrey Pechkurov

12/27/2022, 10:18 AM
Could you share full server logs after a restart and before you run an UPDATE? They should contain all information about table writer acquire/release
a

Ajay Pilaniya

12/27/2022, 10:19 AM
Yeah sure. In which file I will find
cairo.writer.alter.busy.wait.timeout
property by the way?
a

Andrey Pechkurov

12/27/2022, 10:21 AM
<qdb_root>/conf/server.conf
Thanks. Going to take a look a bit later
You seem to have ingestion going on into
cdn_logs_ENTERR10
table:
2022-12-27T10:26:17.406887Z I i.q.c.p.PGConnectionContext parse [fd=117, q=UPDATE 'cdn_logs_ENTERR10' SET isValid = false where ts >= to_timestamp($1, 'dd-MM-yyyy-HH:mm:ss')
AND ts < to_timestamp($2, 'dd-MM-yyyy-HH:mm:ss') AND JobId = $3]
2022-12-27T10:26:17.410195Z I i.q.c.p.WriterPool busy [table=`cdn_logs_ENTERR10`, owner=16, thread=20, reason=insert]
Notice busy + reason=insert
I also noticed that your PGWire ingestion generates lots of out-of-order writes which is probably the reason for constantly busy table writer
a

Ajay Pilaniya

12/27/2022, 11:02 AM
Yes we are backfilling data so out of order writes are there. If this is the reason I might need to put some timeout in between the jobs? Is there any query using which I can find out if table is busy or not? Yes currently we are backfilling data so there will be out of order writes
a

Andrey Pechkurov

12/27/2022, 11:03 AM
In general, ILP is more efficient than PGWire. With PGWire you could try to accumulate data in large batches and write them in one go.
a

Ajay Pilaniya

12/27/2022, 11:05 AM
I am doing bulk writes only and this was the main reason I used single thread so that table is never blocked as I don’t need to write much data. But if there is a way to check if table is busy or not then I think I can put a timeout
n

Nicolas Hourcard

12/27/2022, 11:06 AM
hey Ajay, is there a reason you picked pgwire vs ILP for inserts?
a

Ajay Pilaniya

12/27/2022, 11:07 AM
@Nicolas Hourcard We already had PGWire implementation in our application and we only needed to write 20-30 million records every day so we though PGWire can handle this
a

Andrey Pechkurov

12/27/2022, 11:07 AM
Is there any query using which I can find out if table is busy or not?
I don't think we have such function. You could try handling the UPDATE timed out error and re-schedule your job or do something like exponential timeout.
BTW is it possible to run all INSERTs and UPDATEs on the same thread/connection?
n

Nicolas Hourcard

12/27/2022, 11:17 AM
gotcha, thanks Ajay - sounds good
a

Ajay Pilaniya

12/27/2022, 11:22 AM
Oh I think I found the gap. I am running insertion on different thread but before submitting the insertion to that thread I am running update on a common thread pool due to which it was trying to update parallelly. Let me just move update to same thread and report back if issue persists.
a

Andrey Pechkurov

12/27/2022, 11:23 AM
Makes sense. Let us know if moving everything on a single thread solves the problem
n

Nicolas Hourcard

12/27/2022, 11:34 AM
👍
a

Ajay Pilaniya

12/27/2022, 1:07 PM
@Andrey Pechkurov I moved the updation to same thread and now everything is working fine. No issues in last one hour. 🎉 Also is there any document regarding integration of ILP with Java? I am seeing that some jobs are taking 2+ minutes to finish the ingestion. Can out of order commits be reason here? Currently I am ingesting data of 26th December and before
a

Andrey Pechkurov

12/27/2022, 1:09 PM
No issues in last one hour. 🎉
Awesome to hear that!
Yes, O3 commits are very likely to be the reason. We have the Java client docs here: https://questdb.io/docs/reference/clients/java_ilp/
It's very simple to use
a

Ajay Pilaniya

12/27/2022, 1:13 PM
Oh now I remember. I saw these docs earlier but as per docs I need to map each column manually. Over PGWire I had ORM library which made it easier to handle objects directly.
a

Andrey Pechkurov

12/27/2022, 1:17 PM
I guess, you could use reflection to make the client more generic, but if you have only a few entity classes, there won't be a lot of boilerplate code, so there is not much need for reflection
a

Ajay Pilaniya

12/27/2022, 1:20 PM
Over ILP multi threaded writes are allowed?
And in given document only single row insertion is shown, if I have 100k rows to insert then also I need to insert these one by one? Wouldn’t that slow down insertions?
a

Andrey Pechkurov

12/27/2022, 1:24 PM
Yes, you can write concurrently over multiple connections
Each connection corresponds to a single
Sender
instance
if I have 100k rows to insert then also I need to insert these one by one? Wouldn’t that slow down insertions?
The client writes rows to the TCP socket in batches, so the performance is good.
a

Ajay Pilaniya

12/27/2022, 1:27 PM
Okay… Got it… Will run some tests and if performance is good I will implement ILP ingestion. Will report back tommorrow regarding performance difference. And thank you so much for all the help throughout the integration.
One more thing, ILP can be used only for insertions? Updates need to be done via PGWire only?
a

Andrey Pechkurov

12/27/2022, 1:28 PM
No problem at all. Let us know if you need some help with further performance tuning
n

Nicolas Hourcard

12/27/2022, 1:29 PM
ILP can be used for inserts as well as updates
a

Andrey Pechkurov

12/27/2022, 1:30 PM
Yes, UPDATEs are only supported over PGWire. We have de-duplication feature on the roadmap, but it's hard to give any ETAs on that one
This is the roadmap issue to follow: https://github.com/questdb/roadmap/issues/3
a

Ajay Pilaniya

12/27/2022, 1:32 PM
``Yes, UPDATEs are only supported over PGWire.`` In that case I am back to square one I think. Because I will need to run ingestions on same thread even with ILP because if two updates ran concurrently then I would get same error
a

Andrey Pechkurov

12/27/2022, 1:36 PM
If you run the UPDATEs on application start and only then start ILP ingestion threads, it should work.
AFAIU you only need UPDATEs to mark the rows from a failed batch, so it's probably a good idea to do that on application start. But I may be missing something.
a

Ajay Pilaniya

12/27/2022, 1:38 PM
These batches keep coming every few seconds. So I will need to make sure that while ILP ingestion is running no UPDATE query is running at same time. So in the end I will have improved ingestion performance but still I need to manage this single threaded behaviour manually.
a

Andrey Pechkurov

12/27/2022, 1:48 PM
In theory you could use something like a
j.u.c.l.ReadWriteLock
. Your ILP threads would be acquiring it as readers while the thread that does UPDATE would act as a writer periodically trying to acquire the lock and then execute the UPDATE, if necessary.
This way you won't need any complex logic to suspend/restart ILP threads. The only thing to keep in mind is that the UPDATE thread shouldn't run very frequently since it would block ILP ingestion.
a

Ajay Pilaniya

12/27/2022, 1:51 PM
yeah seems like a good approach. I will try running some different approaches around this and report back
@Andrey Pechkurov I was trying some query on test db :
```try (Sender sender = Sender.builder().address("10.41.12.239:8812").build()) {
    sender.table("cdn_logs_QA_Manual")
            .symbol("ContentId", "ABC")
            .symbol("RequestId", "ZZZ")
            .symbol("DistributionType", "VOD")
            .symbol("FileType", "X")
            .symbol("Method", "POST")
            .symbol("CdnType", "CF")
            .symbol("DistributionId", "ABC")
            .boolColumn("isValid", false).stringColumn("ts", DateUtils.getQuestDbFormat(System.currentTimeMillis())).atNow();
    sender.flush();
}
But this is not inserting any data. Not getting any errors also. Just getting below log -
2022-12-27T14:30:21.743003Z I i.q.c.l.t.PlainTcpLineChannel Send buffer size change from 131614 to 131072
Am I missing something?
a

Andrey Pechkurov

12/27/2022, 3:02 PM
No, your snippet seems fine
Do you see any related server-side logs?
Oh, wait. Is
ts
a designated column? If so, the snippet should be:
try (Sender sender = Sender.builder().address("10.41.X.X:8812").build()) {
            sender.table("cdn_logs_QA_Manual")
                    .symbol("ContentId", "ABC")
                    .symbol("RequestId", "ZZZ")
                    .symbol("DistributionType", "VOD")
                    .symbol("FileType", "X")
                    .symbol("Method", "POST")
                    .symbol("CdnType", "CF")
                    .symbol("DistributionId", "ABC")
                    // .timestampColumn("ts", System.currentTimeMillis()) - this value is set via atNow() call
                    .boolColumn("isValid", false).atNow();
            sender.flush(); // additional note this call is optional; sender flushes the data on close
        }
One more thing to mention: I've noticed that you're using 6.5.4. You should upgrade to 6.6.1 since it improves ILP by getting rid of having to set commit lag manually. The database now automatically fine-tunes the commit lag when you send ILP data
a

Ajay Pilaniya

12/28/2022, 3:56 AM
@Andrey Pechkurov I created a test table with only four columns (timestamp included) and tried below code but same issue. It is not inserting any data. Even if there is something wrong at least it should give some error but there is no error.
a

Andrey Pechkurov

12/28/2022, 7:22 AM
Also I have a password setup on my db
Do you mean ILP password? If so, you should use
enableAuth()
and
authToken()
calls: https://questdb.io/docs/reference/clients/java_ilp/#example-with-tls-and-authentication-enabled
As for no data inserted into the table, server logs should contain the reason for it
a

Ajay Pilaniya

12/28/2022, 8:39 AM
As for no data inserted into the table, server logs should contain the reason for it
Checked server logs and nothing related to this found. So I guess this data is not being sent to server. But without any error logs how do I find exact issue?
Do you mean ILP password?
In my java app I am using default password of quest. Not sure how to use that in ILP @Andrey Pechkurov
I am using 8812 port to connect my Java application via PGWire. Do I use same port during ILP ingestion?
a

Andrey Pechkurov

12/28/2022, 8:59 AM
Default password is there for only for PGWire. For ILP you need to enable authentication explicitly: https://questdb.io/docs/reference/api/ilp/authenticate/
Do I use same port during ILP ingestion?
No, ILP port is 9009. Probably, that's the reason. Try changing the port.
a

Ajay Pilaniya

12/28/2022, 9:02 AM
@Andrey Pechkurov Yeah not got logs on server . But still an error
could not process line data [table=content_details_test, msg=cast error for line protocol tag [columnWriterIndex=3, columnType=STRING], errno=0]
2022-12-28T09:00:53.827517Z E i.q.c.l.t.LineTcpConnectionContext [175] could not parse measurement, NONE at 59, line (may be mangled due to partial parsing): 'content_details_test,ContentId=ABC,PartnerId=ZZZ,Title=Test'
And one potential issue I see on this is if my java app cannot catch such errors and these are only logged on server side then how will I know if some insertions failed?
This error was due to incorrect data type. For string column I used symbol method. Now it’s inserted but still my question remains how would I know such failed cases in real time?
a

Andrey Pechkurov

12/28/2022, 9:20 AM
If you have tests around your code, type mismatch errors shouldn't be a problem. In many other cases the client throws an exception.
a

Ajay Pilaniya

12/28/2022, 9:45 AM
Ok let me run some more tests around it and hopefully it will provide better response
@Andrey Pechkurov When I am using Epoch milliseconds on timestamp columns it is getting inserted as
1970-01-01T00:27:52.231208Z
This seems random. (I think it is taking server time and not time provided in input) What could be possible reason here?
a

Andrey Pechkurov

12/28/2022, 12:54 PM
Non-designated timestamp columns are expected to be in microseconds: https://questdb.io/docs/reference/api/ilp/columnset-types/#timestamp
Notice that the designated timestamp column should be in nanoseconds (that's due to the Influx Line Protocol compatibility requirements)
a

Ajay Pilaniya

12/28/2022, 1:56 PM
Ok got it.. Just tested with provided information and it is working fine. Will test production load soon
a

Andrey Pechkurov

12/28/2022, 2:03 PM
Awesome to hear that!
a

Ajay Pilaniya

12/30/2022, 6:47 AM
@Andrey Pechkurov we just integrated this and tested it for 100k records but I am getting on average 30-40 seconds latency on insertions. Is it normal?
a

Andrey Pechkurov

12/30/2022, 7:32 AM
30-40 seconds per 100k rows isn't normal. That's on ILP, right? What kind of disk/volume are you using? Also, I remember that you have around 30-40M rows per daily partition and you have lots of out-of-order writes that are inevitable. If the writes from each of your batches end up in one or a few hours per day, I'd suggest trying hourly partitions - with them partition re-writes will be less costly.
a

Ajay Pilaniya

12/30/2022, 7:35 AM
@Andrey Pechkurov We are using EBS volume of gp3 type. Also our datapoints are mostly evenly distributed throughout the day.
a

Andrey Pechkurov

12/30/2022, 7:53 AM
Makes sense. Is it possible for you to increase the batch size or keep the connection open between the batches? This way there would be less commits, hence partition re-writes
Also, what are your EBS volume settings (IOPS and throughput)? Default values or something above?
a

Ajay Pilaniya

12/30/2022, 7:55 AM
I can increase batch sizes to 30k but keeping connection open will be difficult between batches.
Also, what are your EBS volume settings (IOPS and throughput)? Default values or something above?
I am using IOPS 3000 and Throughput 125
a

Andrey Pechkurov

12/30/2022, 7:59 AM
I can increase batch sizes to 30k
You should increase the batch size to 30k then
I am using IOPS 3000 and Throughput 125
That's your main bottleneck. Please try increasing the settings to, say, 8000 IOPS and 500MB/s throughput. You can increase them even more, if the increase volume cost is ok to you. The higher settings are, the faster writes will be
a

Ajay Pilaniya

12/30/2022, 8:00 AM
Ok let me change the setting and retry
@Andrey Pechkurov I am seeing that once I ingest around 1 million records in table then after that writes are taking more time. At any given time one batch will only contain records from a single hour. So if my ingestion job has 500k rows to insert, all these rows will be from a single hour. Should I change my partition to HOUR so that during O3 writes ingestion speed can be improved?
a

Andrey Pechkurov

12/30/2022, 9:30 AM
So if my ingestion job has 500k rows to insert, all these rows will be from a single hour. Should I change my partition to HOUR so that during O3 writes, ingestion speed can be improved?
Yes, this way partitions to be re-written will get much smaller.
a

Ajay Pilaniya

12/30/2022, 11:03 AM
@Andrey Pechkurov Yeah changed to HOUR and now it is handling our production load pretty smoothly
a

Andrey Pechkurov

12/30/2022, 11:07 AM
Awesome to hear that. Still, improving EBS volume settings is a good idea. The defaults are pretty slow
a

Ajay Pilaniya

12/30/2022, 11:08 AM
@Andrey Pechkurov Yes we are doing a cost analysis on that
@Andrey Pechkurov Out of blue database crashed with below error. Can you suggest what’s wrong here?
2022-12-30T16:05:08.953802Z E i.q.c.l.t.LineTcpConnectionContext [37227] could not parse measurement, NONE at 1282, line (may be mangled due to partial parsing): 'cdn_logs,ContentId=6346ba504da63b20baf6ecb0,RequestId=F77E08D57C5646C798ED3A53FC33C4AB,JobId=E1KUOC0ENJS6P5,DistributionType=VOD,FileType=DASH_VIDEO_SEGMENT,Method=GET,CdnType=CloudFront,DistributionId=JJJJC0ENJS6P5,StreamingPartnerId=ABC-ABC,UserCity=London,UserCountry=United Kingdomm,UserDevice=Mobile,UserBrowser=-,UserDeviceBrand=Dangalplay,UserDeviceName=Dangalplay Android Mobilele,UserId=iWGYSC6ZAOR414EBX20,Agent=Special,AgentName=<http://Com.dangalplay.tv|Com.dangalplay.tv>,UserOs=Android,OsClass=Mobile CdnRequestId="-IxVGYSFfWfi_bnXatuj0CLrQ277PqG5ShOSPfLkwFyt3GPmcVwIsw==",UserAgent="<http://com.dangalplay.tv/1.0.10%20(Linux;Android%2010)%20ExoPlayerLib/2.16.1|com.dangalplay.tv/1.0.10%20(Linux;Android%2010)%20ExoPlayerLib/2.16.1>",Referer="",Domain="<http://streams-msp.test.in|streams-msp.test.in>",Ip="1.2.3.4",Extras="",FileName="chunk_ctvideo_ridp0va0br2500000_cs50400000_w1550299147_mpd.m4s",CacheStat="Miss",Resolution="854x480",EdgeLocation="BOM78-P4",FilePath="/ENTERR10/EPISODE/6346ba504da63b20baf6ecb0/dash/854x480/chunk_ctvideo_ridp0va0br2500000_cs50400000_w1550299147_mpd.m4s",AgentVersion="1.0.10",AgentVersionName="<http://Com.dangalplay.tv|Com.dangalplay.tv> 1.0.10",OsVersionName="Android 10",OsVersion="10",isValid=t,SegmentCount=1i,ViewCount=0i,FileSize=263306i,ResponseTime=0.106,ua=1672436108953000t,ca=1672436108953000t 1670508923000000000'^M
2022-12-30T16:05:08.953813Z I tcp-line-server scheduling disconnect [fd=37227, reason=0]^M
2022-12-30T16:05:08.953815Z I tcp-line-server disconnected [ip=10.60.13.143, fd=37227, src=queue]^M
2022-12-30T16:05:08.953969Z I tcp-line-server connected [ip=10.60.13.143, fd=37227]^M
2022-12-30T16:05:08.954072Z I i.q.c.p.PGConnectionContext query cache used [fd=23691]^M
2022-12-30T16:05:08.954122Z I tcp-line-server connected [ip=10.60.13.143, fd=37601]^M
2022-12-30T16:05:08.954343Z I i.q.c.p.PGConnectionContext query cache used [fd=23320]^M
2022-12-30T16:05:08.954628Z I tcp-line-server scheduling disconnect [fd=24098, reason=0]^M
2022-12-30T16:05:08.954630Z I tcp-line-server disconnected [ip=10.60.13.143, fd=24098, src=queue]^M
2022-12-30T16:05:08.954675Z I tcp-line-server connected [ip=10.60.13.143, fd=24098]^M
2022-12-30T16:05:08.954697Z I i.q.c.p.PGConnectionContext query cache used [fd=63343]^M
2022-12-30T16:05:08.955128Z I tcp-line-server connected [ip=10.60.13.143, fd=39064]^M
[thread 10751 also had an error][thread 10750 also had an error]

Compiled method (c1) 209430797 6875       3       io.questdb.cairo.ConcurrentBitmapIndexFwdReader$Cursor::of (368 bytes)
 total in heap  [0x00007f34122a6810,0x00007f34122a9660] = 11856
 relocation     [0x00007f34122a6988,0x00007f34122a6c38] = 688
 main code      [0x00007f34122a6c40,0x00007f34122a87c0] = 7040
 stub code      [0x00007f34122a87c0,0x00007f34122a8970] = 432
 oops           [0x00007f34122a8970,0x00007f34122a8990] = 32
 metadata       [0x00007f34122a8990,0x00007f34122a8a08] = 120
 scopes data    [0x00007f34122a8a08,0x00007f34122a9030] = 1576
 scopes pcs     [0x00007f34122a9030,0x00007f34122a9550] = 1312
 dependencies   [0x00007f34122a9550,0x00007f34122a9560] = 16
 nul chk table  [0x00007f34122a9560,0x00007f34122a9660] = 256
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
#   <https://github.com/AdoptOpenJDK/openjdk-support/issues>
a

Andrey Pechkurov

12/30/2022, 5:42 PM
Can you reproduce it? Also, what's your table schema? I'd like to check what's wrong with the ILP message
Could you also share the JVM crash dump file?
a

Ajay Pilaniya

12/30/2022, 5:46 PM
@Andrey Pechkurov Same job ran successfully after restart so unable to reproduce. Below is the table schema (although I have already inserted millions of same records so not sure why this particular record failed)
CREATE TABLE IF NOT EXISTS cdn_logs_ (ts TIMESTAMP, ua TIMESTAMP, ca TIMESTAMP,
CdnRequestId string not null,
ContentId SYMBOL CAPACITY 50000 CACHE, RequestId SYMBOL CAPACITY 50000 CACHE,
UserAgent string, JobId SYMBOL CAPACITY 20000 CACHE, Referer string,
Domain string, Ip string, Extras string, FileName string not null, CacheStat string, Resolution string, EdgeLocation string,
FilePath string not null, DistributionType SYMBOL CAPACITY 10 CACHE,
StatusCode int, FileType SYMBOL CAPACITY 10 CACHE, Method SYMBOL CAPACITY 10 CACHE,
CdnType SYMBOL CAPACITY 10 CACHE, DistributionId SYMBOL CAPACITY 100 NOCACHE, StreamingPartnerId SYMBOL CAPACITY 500 CACHE,
UserCity SYMBOL CAPACITY 5000 CACHE, UserCountry SYMBOL CAPACITY 300 CACHE, UserDevice SYMBOL CAPACITY 50 CACHE,
UserBrowser SYMBOL CAPACITY 500 CACHE,
UserDeviceBrand SYMBOL CAPACITY 500 CACHE, UserDeviceName SYMBOL CAPACITY 500 CACHE, UserId SYMBOL CAPACITY 500000 NOCACHE,
Agent SYMBOL CAPACITY 5000 NOCACHE, AgentName SYMBOL CAPACITY 500000 NOCACHE, AgentVersion string, AgentVersionName string,
OsVersionName string, UserOs SYMBOL CAPACITY 30 CACHE, isValid boolean, OsClass SYMBOL CAPACITY 30 CACHE, OsVersion string,
ResponseTime double, SegmentCount long, ViewCount long, FileSize long) timestamp(ts) PARTITION BY HOUR;
Could you also share the JVM crash dump file?
At which path I can find this?
@Andrey Pechkurov is it possible that this particular row might have crashed JVM?
Again DB crashed with below logs - (I am attaching exact logs below without any modification) @Andrey Pechkurov
2022-12-30T16:05:08.953802Z E i.q.c.l.t.LineTcpConnectionContext [37227] could not parse measurement, NONE at 1282, line (may be mangled due to partial parsing): 'cdn_logs_ENTERR10,ContentId=6346ba504da63b20baf6ecb0,RequestId=F77E08D57C5646C798ED3A53FC33C4AB,JobId=E1KUOC0ENJS6P5.2022-12-08-14.6f1b4e6f.gz,DistributionType=VOD,FileType=DASH_VIDEO_SEGMENT,Method=GET,CdnType=CloudFront,DistributionId=E1KUOC0ENJS6P5,StreamingPartnerId=ENTERR10-ENTERR10,UserCity=London,UserCountry=United Kingdomm,UserDevice=Mobile,UserBrowser=-,UserDeviceBrand=Dangalplay,UserDeviceName=Dangalplay Android Mobilele,UserId=iWGYSC6ZAOR414EBX20,Agent=Special,AgentName=<http://Com.dangalplay.tv|Com.dangalplay.tv>,UserOs=Android,OsClass=Mobile CdnRequestId="-IxVGYSFfWfi_bnXatuj0CLrQ277PqG5ShOSPfLkwFyt3GPmcVwIsw==",UserAgent="<http://com.dangalplay.tv/1.0.10%20(Linux;Android%2010)%20ExoPlayerLib/2.16.1|com.dangalplay.tv/1.0.10%20(Linux;Android%2010)%20ExoPlayerLib/2.16.1>",Referer="",Domain="<http://enterr10-streams-msp.streamready.in|enterr10-streams-msp.streamready.in>",Ip="45.250.66.25",Extras="",FileName="chunk_ctvideo_ridp0va0br2500000_cs50400000_w1550299147_mpd.m4s",CacheStat="Miss",Resolution="854x480",EdgeLocation="BOM78-P4",FilePath="/ENTERR10/EPISODE/6346ba504da63b20baf6ecb0/dash/854x480/chunk_ctvideo_ridp0va0br2500000_cs50400000_w1550299147_mpd.m4s",AgentVersion="1.0.10",AgentVersionName="<http://Com.dangalplay.tv|Com.dangalplay.tv> 1.0.10",OsVersionName="Android 10",OsVersion="10",isValid=t,SegmentCount=1i,ViewCount=0i,FileSize=263306i,ResponseTime=0.106,ua=1672436108953000t,ca=1672436108953000t 1670508923000000000'^M
2022-12-30T16:05:08.953813Z I tcp-line-server scheduling disconnect [fd=37227, reason=0]^M
2022-12-30T16:05:08.953815Z I tcp-line-server disconnected [ip=10.60.13.143, fd=37227, src=queue]^M
2022-12-30T16:05:08.953969Z I tcp-line-server connected [ip=10.60.13.143, fd=37227]^M
2022-12-30T16:05:08.954072Z I i.q.c.p.PGConnectionContext query cache used [fd=23691]^M
2022-12-30T16:05:08.954122Z I tcp-line-server connected [ip=10.60.13.143, fd=37601]^M
2022-12-30T16:05:08.954343Z I i.q.c.p.PGConnectionContext query cache used [fd=23320]^M
2022-12-30T16:05:08.954628Z I tcp-line-server scheduling disconnect [fd=24098, reason=0]^M
2022-12-30T16:05:08.954630Z I tcp-line-server disconnected [ip=10.60.13.143, fd=24098, src=queue]^M
2022-12-30T16:05:08.954675Z I tcp-line-server connected [ip=10.60.13.143, fd=24098]^M
2022-12-30T16:05:08.954697Z I i.q.c.p.PGConnectionContext query cache used [fd=63343]^M
2022-12-30T16:05:08.955128Z I tcp-line-server connected [ip=10.60.13.143, fd=39064]^M
[thread 10751 also had an error][thread 10750 also had an error]

Compiled method (c1) 209430797 6875       3       io.questdb.cairo.ConcurrentBitmapIndexFwdReader$Cursor::of (368 bytes)
 total in heap  [0x00007f34122a6810,0x00007f34122a9660] = 11856
 relocation     [0x00007f34122a6988,0x00007f34122a6c38] = 688
 main code      [0x00007f34122a6c40,0x00007f34122a87c0] = 7040
 stub code      [0x00007f34122a87c0,0x00007f34122a8970] = 432
 oops           [0x00007f34122a8970,0x00007f34122a8990] = 32
 metadata       [0x00007f34122a8990,0x00007f34122a8a08] = 120
 scopes data    [0x00007f34122a8a08,0x00007f34122a9030] = 1576
 scopes pcs     [0x00007f34122a9030,0x00007f34122a9550] = 1312
 dependencies   [0x00007f34122a9550,0x00007f34122a9560] = 16
 nul chk table  [0x00007f34122a9560,0x00007f34122a9660] = 256
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:
#   <https://github.com/AdoptOpenJDK/openjdk-support/issues>
a

Andrey Pechkurov

12/30/2022, 5:56 PM
CREATE TABLE IF NOT EXISTS cdn_logs_
I guess table name should be
cdn_logs
, not
cdn_logs_
Are you using Java ILP client?
a

Ajay Pilaniya

12/30/2022, 5:57 PM
table name is cdn_logs_ENTERR10 as shared in latest logs. Earlier logs were modified. Yes I am using java ILP client @Andrey Pechkurov
a

Andrey Pechkurov

12/30/2022, 6:17 PM
Which version of JDK are you using?
Could not load hsdis-amd64.so
doesn't seem ok. Looks like JVM failed to write a crash dump
I've also checked the ILP message from the log and it seems to be a valid one. Is it the very first message your client tries writing on application start or there are some prior ILP messages?
a

Ajay Pilaniya

12/30/2022, 6:21 PM
Is it the very first message your client tries writing on application start or there are some prior ILP messages?
No.. It is part of a bigger batch. Since in first crash this batch failed then my code retriggered same batch. Before this row already 100k+ logs were inserted. The logs which I provided are the last few lines of given log file. Before that everything seems to be normal
Which version of JDK are you using?
Could not load hsdis-amd64.so
doesn’t seem ok. Looks like JVM failed to write a crash dump
Which version of JDK are you using? Could not load hsdis-amd64.so doesn't seem ok. Looks like JVM failed to write a crash dump
a

Andrey Pechkurov

12/30/2022, 6:25 PM
When JVM crashes, it creates a
hs_err_pidXXXX.log
crash dump file. But in this case it failed to prepare the crash dump
There seems to be a problem in the JVM preventing it writing the crash dump
Could you also check a few ILP messages before this one and share them with me?
They might be the problematic ones
a

Ajay Pilaniya

12/30/2022, 6:30 PM
@Andrey Pechkurov I also found below log during latest crash. Although I have reserved 30GB for JVM
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 16384 bytes for committing reserved memory.
[thread 25374 also had an error]
a

Andrey Pechkurov

12/30/2022, 6:31 PM
Oh, that makes sense
Seems that you're running out of memory for JVM
How much RAM do you have on the machine?
a

Ajay Pilaniya

12/30/2022, 6:32 PM
32 GB @Andrey Pechkurov After first crash we also changed config to allow JVM to use 30GB
a

Andrey Pechkurov

12/30/2022, 6:33 PM
30GB for JVM is way too much
It should be fine with 2-3GB. The rest should be kept for OS purposes (page cache mainly)
Why did you change it to 30GB?
a

Ajay Pilaniya

12/30/2022, 6:34 PM
Why did you change it to 30GB?
@Andrey Pechkurov after first crash I assumed that JVM is running out of memory so was experimenting around it
a

Andrey Pechkurov

12/30/2022, 6:34 PM
While we're on it, please make sure that you have file and mmapped areas limits increased in the kernel settings: https://questdb.io/docs/operations/capacity-planning/#os-configuration
a

Ajay Pilaniya

12/30/2022, 6:38 PM
@Andrey Pechkurov I have 65535 as open file limit. Do I need to change it? And 65530 for virtual memory area limit
a

Andrey Pechkurov

12/30/2022, 6:38 PM
My bet is that symbol cache requires a lot of heap space, so that's why JVM consumes a lot of memory. Still, 30GB is a way too high setting. Could you try disabling symbol caches on all columns?
I have 65535 as open file limit. Do I need to change it? And 65530 for virtual memory area limit
I'd increase both of these settings 4x times at least to avoid problems later, when your DB grows
a

Ajay Pilaniya

12/30/2022, 6:39 PM
Could you try disabling symbol caches on all columns?
How can I disable symbol cache in existing schema?
If you also have other tables with symbol cache enabled, disable those as well
You can keep the cache enabled on columns on which you're certain to have only a few unique symbol values
BTW if some of the symbol columns are expected to have unique values per row (like
ContentId
or
RequestId
or anything else), you should use STRING type for them instead of SYMBOL. SYMBOL type doesn't make sense if each row gets a unique value - it only adds disk space, CPU and RAM overhead in such situation
a

Ajay Pilaniya

12/30/2022, 6:52 PM
@Andrey Pechkurov Have disabled caching for high cardinality columns. RequestId is in ratio of 1:100 rows. So for 100k records there will be on average 1k request ids that’s why I used symbol. Also during testing phase we found out that string columns had high latency on distinct count queries. Have disabled cache for now and will monitor if now it’s working fine
@Andrey Pechkurov Now I am getting below error, no other information is available in logs…
2022-12-31T03:06:03.379553Z E i.q.c.p.PGConnectionContext error [pos=21, msg=`[24]: could not open read-only [file=/root/.questdb/db/cdn_logs_ENTERR10/ContentId.o]`]
a

Andrey Pechkurov

12/31/2022, 10:11 AM
2022-12-31T03:06:03.379553Z E i.q.c.p.PGConnectionContext error [pos=21, msg=
[24]: could not open read-only [file=/root/.questdb/db/cdn_logs_ENTERR10/ContentId.o]
]
Error code 24 means "Too many open files.". Please check that
fs.file-max
applied to the database process
a

Ajay Pilaniya

12/31/2022, 10:13 AM
Why is it opening too many files? Is it because I am trying to back fill data of last two months which leads to create more partitions? (I have HOURLY partitions) @Andrey Pechkurov
a

Andrey Pechkurov

12/31/2022, 10:14 AM
Yes, the database can open additional column files on writes to older partitions
a

Ajay Pilaniya

12/31/2022, 10:19 AM
Yes, the database can open additional column files on writes to older partitions
How can I avoid this? Currently I am writing mostly out of order so I am accessing partitions in last two months.
a

Andrey Pechkurov

12/31/2022, 10:26 AM
There is no need to avoid it, it's enough to increase the limits. Table writers close older partitions occasionally, so the open file growth is not unbounded.
Could you check
cat /proc/<pid>/limits
where
<pid>
is the database process id?
Also,
Max open files
is
65535
which is the default. Seems that your kernel setting isn't increased
If you're running the database under
root
, changing
/etc/sysctl.conf
may be not enough. Try adding the following lines to `/etc/security/limits.conf`:
root soft     nofile         262144
root hard     nofile         262144
a

Ajay Pilaniya

12/31/2022, 10:42 AM
Could you check
cat /proc/<pid>/limits
where
<pid>
is the database process id?
Screenshot I attached is of database process itself
a

Andrey Pechkurov

12/31/2022, 10:43 AM
Your screenshot shows the default limit
a

Ajay Pilaniya

12/31/2022, 10:44 AM
@Andrey Pechkurov I have modified
/etc/security/limits.conf
but still
cat /proc/<pid>/limits
showing same number 65535
a

Andrey Pechkurov

12/31/2022, 10:46 AM
Did you try restarting the machine to have the new limits applied?
a

Ajay Pilaniya

12/31/2022, 10:49 AM
@Andrey Pechkurov Yes limits applied after restart.
Also I had one more doubt, in error logs I attached there are some parsing errors but I could not found anything suspicious there. These errors were appearing yesterday when database was crashing and today also before this file limit was breached. @Andrey Pechkurov
a

Andrey Pechkurov

12/31/2022, 10:51 AM
Could you restart the ingestion and check the logs after a while?
Any log records that start with
E
or
C
stand for errors
a

Ajay Pilaniya

12/31/2022, 10:52 AM
Okay let me restart ingestion and monitor
@Andrey Pechkurov I have been monitoring logs since ingestion restart and everything seems to be fine. But I am observing a weird behaviour, suddenly database ingestion performance is dropping. I am logging insertions time in my application. Attaching below some logs. As you can see even on 10k-20k it is taking more than usual time. Even with default IOPS and throughput this seems unusually high
INFO 2022-12-31 19:15:39,940 [analytics-job-thread-6] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 8903 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 21 seconds
INFO 2022-12-31 19:16:17,542 [analytics-job-thread-2] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 13159 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 58 seconds
INFO 2022-12-31 19:16:17,557 [analytics-job-thread-1] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 11216 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 55 seconds
INFO 2022-12-31 19:16:17,588 [analytics-job-thread-9] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 10190 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 59 seconds
INFO 2022-12-31 19:16:17,912 [analytics-job-thread-5] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 13614 in table : cdn_logs_ENTERR10, timeTaken : 2 minutes 37 seconds
INFO 2022-12-31 19:16:18,042 [analytics-job-thread-7] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 10729 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 56 seconds
INFO 2022-12-31 19:17:23,715 [analytics-job-thread-1] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 1228 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 5 seconds
INFO 2022-12-31 19:17:23,731 [analytics-job-thread-8] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 12250 in table : cdn_logs_ENTERR10, timeTaken : 3 minutes 1 second
INFO 2022-12-31 19:17:23,917 [analytics-job-thread-2] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 1657 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 6 seconds
INFO 2022-12-31 19:17:24,031 [analytics-job-thread-7] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 3119 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 5 seconds
INFO 2022-12-31 19:17:24,115 [analytics-job-thread-5] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 3509 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 5 seconds
INFO 2022-12-31 19:17:24,234 [analytics-job-thread-3] [] [AbstractQuestDao.java:84] ILP :: Inserted total records : 11325 in table : cdn_logs_ENTERR10, timeTaken : 3 minutes 2 seconds
a

Andrey Pechkurov

12/31/2022, 1:55 PM
Might it be the case that those batches contain rows that belong to many partitions (hours) while the "good" ones only include a few hours?
a

Ajay Pilaniya

12/31/2022, 1:56 PM
@Andrey Pechkurov Each batch only contains rows from same hour. So basically all these batches in logs contain rows in same hour.
For example :
ILP :: Inserted total records : 10190 in table : cdn_logs_ENTERR10, timeTaken : 1 minute 59 seconds (SO THIS WILL HAVE ROWS OF ANY PARTICULAR HOUR, 1PM to 2PM for example)
ILP :: Inserted total records : 10190 in table : cdn_logs_ENTERR10, timeTaken : 2 minute 9 seconds (Same as above, THIS WILL HAVE ROWS OF ANY PARTICULAR HOUR, 2PM to 3PM for example)
a

Andrey Pechkurov

12/31/2022, 1:58 PM
That's good. What's the memory consumption of the database process? Did you decrease xms and xmx settings down from 30GB?
One more thing to check is the network and your client application. 65 seconds per 3.5K rows is nothing. It might be the case that either the network is slow or there is a bottleneck in your Java application
a

Ajay Pilaniya

12/31/2022, 2:03 PM
Did you decrease xms and xmx settings down from 30GB?
Not yet.
What’s the memory consumption of the database process?
It is showing 98% which doesn’t seem right to me
It might be the case that either the network is slow or there is a bottleneck in your Java application
Could be but earlier it was giving good performance and out of nowhere performance started dropping
@Andrey Pechkurov Currently i have these settings -
JAVA_OPTS="
    -D$QDB_PROCESS_LABEL
    -ea -Dnoebug
    -XX:ErrorFile=${QDB_ROOT}/db/hs_err_pid+%p.log
    -XX:+UnlockExperimentalVMOptions
    -XX:+AlwaysPreTouch
    -XX:+UseParallelGC
    -Xms28g 
    -Xmx28g
    "
a

Andrey Pechkurov

12/31/2022, 2:08 PM
You should reduce xms and xmx down to 3g or so
Otherwise the OS won't have enough memory for the page cache and may have to flush it too frequently
a

Ajay Pilaniya

12/31/2022, 2:15 PM
Okay I have set it to 3gb and restarted the db and ingestion. Let’s see if it makes any change
@Andrey Pechkurov So far everything seems to be working fine. Ingestion is running successfully for past four hours and hopefully no more hurdles 🤞 One more question, I had attached a 500 GB EBS volume to it and currently we have backfilled 10% of data which took around 250GB storage. So basically I am assuming that around 3TB space is required for all the data to backfill. Please also note that this backfilled data will be used to calculate further analysis which will take only 5% of original data. So once backfilled, calculated analytics will take around 150GB only and I can drop this 3TB data. But my question here is that having this much data, will it hamper the performance? Ideally all partitions are managed separately so I assume that this data will not affect performance? And to calculate further analysis I will be querying in 24 hour time ranges, so basically current data flows every second and I aggregate this data per content on daily basis. My main concern is regarding aggregation queries? Aggregating data on so many rows can timeout queries on most of the databases.
a

Andrey Pechkurov

12/31/2022, 5:18 PM
Glad to hear that. As for 3TB, you may need to increase the file and mmap limits even higher. Other than that, if you don't actively query older partitions, they don't consume memory and OS resources.
a

Ajay Pilaniya

12/31/2022, 5:20 PM
As for 3TB, you may need to increase the file and mmap limits even higher.
Can you suggest optimal settings for these?
Other than that, if you don’t actively query older partitions, they don’t consume memory and OS resources.
I will be querying this data only one time to calculate aggregated analytics. After that I can drop this data.
a

Andrey Pechkurov

12/31/2022, 5:21 PM
We have a user who also has big volume of data and here are their setting:
fs.nr_open=134217728
fs.file-max=134217728
vm.max_map_count=134217728
a

Ajay Pilaniya

12/31/2022, 5:22 PM
Ok.. will change these settings when extending storage
And what would be the implications of these settings even if I keep them after dropping data and reducing the storage back to 500GB?
a

Andrey Pechkurov

12/31/2022, 5:24 PM
There should be no implications. OS reserves the memory on-demand, so there should be no memory overhead
Happy New Year!
Sounds like the disk (EBS volume) is still a bottleneck. Check disk metrics with something like
iotop
or
iostat
and, if you observe high IOPS and throughput values, you should increase the volume settings
It's better to watch the stats over a minute or so. They'll be going up when the OS tries to flush the page cache. It they're high most of the time, it's an indication of the disk being the bottleneck
a

Ajay Pilaniya

01/02/2023, 1:57 PM
@Andrey Pechkurov I am running a query which is failing and found below logs on server - I think JVM is running out of space. Table in question only has 181k records
2023-01-02T13:54:24.486790Z I i.q.c.h.p.JsonQueryProcessorState [69] exec [q='SELECT ts, RequestContentId as ContentId, RequestUserId as UserId, count_distinct(RequestId) as TotalRequests FROM 'request_analytics_ENTERR10'']
2023-01-02T13:54:24.486805Z I i.q.c.h.p.QueryCache miss [thread=questdb-shared-7, sql=SELECT ts, RequestContentId as ContentId, RequestUserId as UserId, count_distinct(RequestId) as TotalRequests FROM 'request_analytics_ENTERR10']
2023-01-02T13:54:24.487023Z I i.q.g.SqlCompiler plan [q=`select-group-by ts, ContentId, UserId, count_distinct(RequestId) TotalRequests from (select-choose [ts, RequestContentId ContentId, RequestUserId UserId, RequestId] ts, RequestContentId ContentId, RequestUserId UserId, RequestId from (select [ts, RequestContentId, RequestUserId, RequestId] from request_analytics_ENTERR10 timestamp (ts)))`, fd=69]
2023-01-02T13:54:24.488273Z I i.q.c.h.p.JsonQueryProcessorState [69] execute-new [skip: 0, stop: 1000]
2023-01-02T13:54:25.131864Z I i.q.c.p.PGConnectionContext parse [fd=34, q=BEGIN]
2023-01-02T13:54:25.131890Z I i.q.c.p.PGConnectionContext parse [fd=34, q=SELECT 1]
2023-01-02T13:54:25.131892Z I i.q.c.p.PGConnectionContext query cache used [fd=34]
2023-01-02T13:54:25.132416Z I i.q.c.p.PGConnectionContext parse [fd=34, q=BEGIN]
2023-01-02T13:54:25.132419Z I i.q.c.p.PGConnectionContext parse [fd=34, q=SELECT 1]
2023-01-02T13:54:25.132420Z I i.q.c.p.PGConnectionContext query cache used [fd=34]
2023-01-02T13:54:25.374327Z I i.q.c.p.PGConnectionContext parse [fd=34, q=BEGIN]
2023-01-02T13:54:25.374353Z I i.q.c.p.PGConnectionContext parse [fd=34, q=SELECT 1]
2023-01-02T13:54:25.374354Z I i.q.c.p.PGConnectionContext query cache used [fd=34]
2023-01-02T13:54:25.746249Z C i.q.c.h.p.JsonQueryProcessorState [69] Uh-oh. Error!
java.lang.OutOfMemoryError: Java heap space
a

Andrey Pechkurov

01/02/2023, 1:59 PM
java.lang.OutOfMemoryError: Java heap space
Seems that you have lots of
RequestId
values
a

Ajay Pilaniya

01/02/2023, 2:00 PM
Yes… All 181k rows have different request ids in this table
a

Andrey Pechkurov

01/02/2023, 2:01 PM
JVM's heap isn't enough to fit them. You could either try increasing xmx and xmx or try to split your query into multiple ones if that's possible
BTW are you sure you need to group by
ts
column?
a

Ajay Pilaniya

01/02/2023, 2:07 PM
I am storing data of each request in this table daywise. So all these request current have
2023-01-01T00:00:00.000000Z
ts value. Tommorrow there will be more data of
2023-01-02T00:00:00.000000Z
so I need to group by ts as I don’t want to mix up data of two dates. Also since I at least need data of any particular day I am unable to break down this query further.
Even at 16GB xmx this query is failing
@Andrey Pechkurov I am also performing similar heavy queries but there I don’t see this error. Could this be issue due to having same timestamp in all 181k rows?
Also should I try HOURLY partition for this?
a

Andrey Pechkurov

01/02/2023, 2:29 PM
The thing is that the database has to store 181k rows per each of the returned row, i.e. each unique
ts, ContentId, UserId
tuple
You should split your query into multiple ones:
SELECT ts, RequestContentId as ContentId, RequestUserId as UserId, count_distinct(RequestId) as TotalRequests FROM 'request_analytics_ENTERR10' where ts in '2022-01'
SELECT ts, RequestContentId as ContentId, RequestUserId as UserId, count_distinct(RequestId) as TotalRequests FROM 'request_analytics_ENTERR10' where ts in '2022-02'
...
Note the
ts
filter: I've used days just as an example here
a

Ajay Pilaniya

01/02/2023, 2:34 PM
You should split your query into multiple ones:
Currently I have only one day data i.e 2023-01 So even if I use filter it will still need to store all rows as currently all data belongs to 2023-01
The thing is that the database has to store 181k rows per each of the returned row, i.e. each unique
ts, ContentId, UserId
tuple
Even if I use LIMIT 0,10 it still fails. Shouldn’t pagination help here?
a

Andrey Pechkurov

01/02/2023, 2:37 PM
No, pagination won't help since the database has to materialize full result set before applying the limit
Currently I have only one day data i.e 2023-01
You could use hour filters then
In general, in this particular case the database could be smarter and it could release the memory as soon as it iterates over the next ts value. Feel free to open a GH issue and describe the problem if such group by ts + count_distinct queries are important for you. Group by ts is important since that's the reason why there are so many rows in the result set
All request which landed today are floored to start of the day
Is it on purpose?
a

Ajay Pilaniya

01/02/2023, 2:43 PM
And I am confused here as if I do SELECT * query here, all data is returned without any issue. But when I do distinct query it runs out of memory
Is it on purpose?
Yes, because for each day we aggregate this data. At the end of any day we calculate data of past day and that’s why we need to query on whole DAY partition
a

Andrey Pechkurov

01/02/2023, 2:48 PM
How many unique
RequestUserId, RequestContentId
you have per day?
a

Ajay Pilaniya

01/02/2023, 2:50 PM
200k request ids maximum and 50k user id max
a

Andrey Pechkurov

01/02/2023, 3:10 PM
so, 200K request ids x 50K user ids? that's 10B
a

Ajay Pilaniya

01/02/2023, 3:30 PM
so, 200K request ids x 50K user ids? that’s 10B
No, basically when ever we get a hit on a content it generates a unique request id. A user can generate multiple requests on mulitple contents. So basically let’s say UserA generates two Requests on ContentA.. And UserB generates four request on ContentB. Then I will have data like this
Request1, UserA, ContentA
Request2, UserA, ContentA
Request3, UserB, ContentB
Request4, UserB, ContentB
Request5, UserB, ContentB
Request6, UserB, ContentB
As you can see
RequestId
will have unique values but
UserId
and
ContentId
will have repetitive values. So my query only trying to find distinct request count on a User and Content combination like below -
UserA, ContentA - Distinct Request = 2
UserB, ContentB - Distinct Request = 4
So only two rows generated. So rows generated by this query will be unique pair of
UserId + ContentId
which will definitely less than total rows in this table. So for 181k rows table this combination will be less than 100k @Andrey Pechkurov
a

Andrey Pechkurov

01/02/2023, 3:34 PM
But how many combinations do you currently have per day? I don't think it's 181k
a

Ajay Pilaniya

01/02/2023, 3:40 PM
But how many combinations do you currently have per day? I don’t think it’s 181k
It can never go beyond unique count of request ids which is 180k
a

Andrey Pechkurov

01/03/2023, 7:01 AM
Could you try to count distinct values in your table? Say with a query like this one
select count() from (SELECT distinct RequestContentId, RequestUserId, RequestId FROM 'request_analytics_ENTERR10')
It still sounds like you have lots of distinct values and it's necessary to understand the data volume before suggesting any optimizations
a

Ajay Pilaniya

01/03/2023, 8:42 AM
Could you try to count distinct values in your table? Say with a query like this one
This query returns 181k
a

Andrey Pechkurov

01/03/2023, 12:08 PM
Thanks. Let me try to reproduce your table locally and see how much memory the query needs
I've created a PR that should fix the issue for you: https://github.com/questdb/questdb/pull/2897
I can build a Linux artifact for you, so you could test it without having to wait for the next release. That's, of course, if you're using one of our binary artifacts. WDYT?
a

Ajay Pilaniya

01/03/2023, 3:27 PM
Currently we have changed query to avoid this but since our data will increase day by day I will monitor if I face this issue again. In that case we can try binary artifacts otherwise I think we are good for now.
a

Andrey Pechkurov

01/03/2023, 3:32 PM
Good to know. The fix should be included into the nearest release.
a

Ajay Pilaniya

01/03/2023, 3:33 PM
Okay got it. This is great help for us as our API layer heavily relies on count_distinct queries. Thank you 🙂
@Andrey Pechkurov Also got this error recently. What does this mean?
2023-01-04T05:18:15.043223Z E i.q.c.l.t.LineTcpConnectionContext [261610] could not process line data [table=cdn_logs_RAJTV, msg=could not mmap  [size=91392, offset=0, fd=261886, memUsed=181485689701, fileLen=1048576], errno=12]
a

Ajay Pilaniya

01/04/2023, 7:00 AM
@Andrey Pechkurov How would I decide optimal setting for max map count?
a

Andrey Pechkurov

01/04/2023, 7:01 AM
What's your current setting?
Setting it to a value higher than necessary won't have any significant downsides, so it's better to be conservative
a

Ajay Pilaniya

01/04/2023, 7:02 AM
Current setting : 262144
I have updated the limit to a higher number. Do I need to restart database process?
a

Andrey Pechkurov

01/04/2023, 7:10 AM
Let's set it to to a 512x value:
134217728
. Also, please update the max open file limit to the same value
Do I need to restart database process?
Yes, you need to restart it
a

Ajay Pilaniya

01/04/2023, 7:11 AM
Let’s set it to to a 512x value:
134217728
. Also, please update the max open file limit to the same value
Updated both.. Restarting the process now…
@Andrey Pechkurov We had increased throughput to 500 and IOPS to 8000 6 hours ago but still we are seeing some latency during ingestion. And it happens periodically when we try to backfill data. Today we didn’t even write very heavy batches (just few 100k) but still ingestion performance dropped
a

Andrey Pechkurov

01/04/2023, 11:53 AM
Could you collect server logs around this slow batch, as well as the disk metrics?
O3 commits on
cdn_logs_ENTERR10
table are pretty consistent - they take around 16-20 seconds:
2023-01-04T12:37:14.619642Z I i.q.c.TableWriter o3 commit [table=cdn_logs_ENTERR10, maxUncommittedRows=500000, o3TimestampMin=2023-01-03T18:37:08.000000Z, o3TimestampMax=2023-01-04T12:28:16.000000Z, o3MaxLagUs=600000000, o3EffectiveLagUs=600000000, lagError=63670000000, o3SpreadUs=64268000000, lagThresholdTimestamp=2023-01-04T12:18:16.000000Z, o3LagRowCount=161304, srcOooMax=45391, o3RowCount=206695]
2023-01-04T12:37:14.754463Z I i.q.c.TableWriter sorted [table=cdn_logs_ENTERR10, o3RowCount=206695]
2023-01-04T12:37:14.754484Z I i.q.c.TableWriter o3 partition task [table=cdn_logs_ENTERR10, srcOooLo=0, srcOooHi=6562, srcOooMax=45391, o3RowCount=206695, o3LagRowCount=161304, srcDataMax=2062592, o3TimestampMin=2023-01-03T18:37:08.000000Z, o3Timestamp=2023-01-03T18:37:08.000000Z, o3TimestampMax=2023-01-04T07:56:09.000000Z, partitionTimestamp=2023-01-03T18:00:00.000000Z, partitionIndex=264, partitionSize=2069155, maxTimestamp=2023-01-04T12:18:18.000000Z, last=false, append=false, pCount=1, flattenTimestamp=true, memUsed=65644850618]
2023-01-04T12:37:14.754742Z I i.q.c.TableWriter o3 partition task [table=cdn_logs_ENTERR10, srcOooLo=6563, srcOooHi=28056, srcOooMax=45391, o3RowCount=206695, o3LagRowCount=161304, srcDataMax=1078712, o3TimestampMin=2023-01-03T18:37:08.000000Z, o3Timestamp=2023-01-04T05:12:38.000000Z, o3TimestampMax=2023-01-04T07:56:09.000000Z, partitionTimestamp=2023-01-04T05:00:00.000000Z, partitionIndex=308, partitionSize=1100206, maxTimestamp=2023-01-04T12:18:18.000000Z, last=false, append=false, pCount=2, flattenTimestamp=false, memUsed=65644850618]
2023-01-04T12:37:14.754755Z I i.q.c.TableWriter o3 partition task [table=cdn_logs_ENTERR10, srcOooLo=28057, srcOooHi=45390, srcOooMax=45391, o3RowCount=206695, o3LagRowCount=161304, srcDataMax=891340, o3TimestampMin=2023-01-03T18:37:08.000000Z, o3Timestamp=2023-01-04T07:54:24.000000Z, o3TimestampMax=2023-01-04T07:56:09.000000Z, partitionTimestamp=2023-01-04T07:00:00.000000Z, partitionIndex=316, partitionSize=908674, maxTimestamp=2023-01-04T12:18:18.000000Z, last=false, append=false, pCount=3, flattenTimestamp=false, memUsed=65644850618]
2023-01-04T12:37:31.526618Z I i.q.c.TableWriter merged partition [table=`cdn_logs_ENTERR10`, ts=2023-01-03T18:00:00.000000Z, txn=3448]
2023-01-04T12:37:31.526629Z I i.q.c.TableWriter merged partition [table=`cdn_logs_ENTERR10`, ts=2023-01-04T05:00:00.000000Z, txn=3448]
2023-01-04T12:37:31.526631Z I i.q.c.TableWriter merged partition [table=`cdn_logs_ENTERR10`, ts=2023-01-04T07:00:00.000000Z, txn=3448]
2023-01-04T12:37:31.688634Z I i.q.c.TableWriter purged [path=/root/.questdb/db/cdn_logs_ENTERR10/2023-01-03T18.3447]
2023-01-04T12:37:31.777367Z I i.q.c.TableWriter purged [path=/root/.questdb/db/cdn_logs_ENTERR10/2023-01-04T05.3447]
2023-01-04T12:37:31.839572Z I i.q.c.TableWriter purged [path=/root/.questdb/db/cdn_logs_ENTERR10/2023-01-04T07.3447]

2023-01-04T12:37:32.341342Z I i.q.c.TableWriter o3 commit [table=cdn_logs_ENTERR10, maxUncommittedRows=500000, o3TimestampMin=2023-01-03T18:37:44.000000Z, o3TimestampMax=2023-01-04T12:28:16.000000Z, o3MaxLagUs=600000000, o3EffectiveLagUs=600000000, lagError=63634000000, o3SpreadUs=64232000000, lagThresholdTimestamp=2023-01-04T12:18:16.000000Z, o3LagRowCount=161304, srcOooMax=51701, o3RowCount=213005]
2023-01-04T12:37:32.390143Z I i.q.c.TableWriter sorted [table=cdn_logs_ENTERR10, o3RowCount=213005]
2023-01-04T12:37:32.390154Z I i.q.c.TableWriter o3 partition task [table=cdn_logs_ENTERR10, srcOooLo=0, srcOooHi=1248, srcOooMax=51701, o3RowCount=213005, o3LagRowCount=161304, srcDataMax=2069155, o3TimestampMin=2023-01-03T18:37:44.000000Z, o3Timestamp=2023-01-03T18:37:44.000000Z, o3TimestampMax=2023-01-04T07:56:24.000000Z, partitionTimestamp=2023-01-03T18:00:00.000000Z, partitionIndex=264, partitionSize=2070404, maxTimestamp=2023-01-04T12:18:18.000000Z, last=false, append=false, pCount=1, flattenTimestamp=true, memUsed=65989374589]
2023-01-04T12:37:32.390244Z I i.q.c.TableWriter o3 partition task [table=cdn_logs_ENTERR10, srcOooLo=1249, srcOooHi=25825, srcOooMax=51701, o3RowCount=213005, o3LagRowCount=161304, srcDataMax=1100206, o3TimestampMin=2023-01-03T18:37:44.000000Z, o3Timestamp=2023-01-04T05:12:41.000000Z, o3TimestampMax=2023-01-04T07:56:24.000000Z, partitionTimestamp=2023-01-04T05:00:00.000000Z, partitionIndex=308, partitionSize=1124783, maxTimestamp=2023-01-04T12:18:18.000000Z, last=false, append=false, pCount=2, flattenTimestamp=false, memUsed=65989374589]
2023-01-04T12:37:32.390248Z I i.q.c.TableWriter o3 partition task [table=cdn_logs_ENTERR10, srcOooLo=25826, srcOooHi=51700, srcOooMax=51701, o3RowCount=213005, o3LagRowCount=161304, srcDataMax=908674, o3TimestampMin=2023-01-03T18:37:44.000000Z, o3Timestamp=2023-01-04T07:54:36.000000Z, o3TimestampMax=2023-01-04T07:56:24.000000Z, partitionTimestamp=2023-01-04T07:00:00.000000Z, partitionIndex=316, partitionSize=934549, maxTimestamp=2023-01-04T12:18:18.000000Z, last=false, append=false, pCount=3, flattenTimestamp=false, memUsed=65989374589]
2023-01-04T12:37:48.002811Z I i.q.c.TableWriter merged partition [table=`cdn_logs_ENTERR10`, ts=2023-01-03T18:00:00.000000Z, txn=3449]
2023-01-04T12:37:48.002823Z I i.q.c.TableWriter merged partition [table=`cdn_logs_ENTERR10`, ts=2023-01-04T05:00:00.000000Z, txn=3449]
2023-01-04T12:37:48.002825Z I i.q.c.TableWriter merged partition [table=`cdn_logs_ENTERR10`, ts=2023-01-04T07:00:00.000000Z, txn=3449]
2023-01-04T12:37:48.024692Z I i.q.c.TableWriter purged [path=/root/.questdb/db/cdn_logs_ENTERR10/2023-01-03T18.3448]
2023-01-04T12:37:48.025186Z I i.q.c.TableWriter purged [path=/root/.questdb/db/cdn_logs_ENTERR10/2023-01-04T05.3448]
2023-01-04T12:37:48.025720Z I i.q.c.TableWriter purged [path=/root/.questdb/db/cdn_logs_ENTERR10/2023-01-04T07.3448]

2023-01-04T12:37:48.529259Z I i.q.c.TableWriter o3 commit [table=cdn_logs_ENTERR10, maxUncommittedRows=500000, o3TimestampMin=2023-01-04T05:16:41.000000Z, o3TimestampMax=2023-01-04T12:28:16.000000Z, o3MaxLagUs=600000000, o3EffectiveLagUs=600000000, lagError=25297000000, o3SpreadUs=25895000000, lagThresholdTimestamp=2023-01-04T12:18:16.000000Z, o3LagRowCount=161304, srcOooMax=28759, o3RowCount=190063]
2023-01-04T12:37:48.588145Z I i.q.c.TableWriter sorted [table=cdn_logs_ENTERR10, o3RowCount=190063]
2023-01-04T12:37:48.588165Z I i.q.c.TableWriter o3 partition task [table=cdn_logs_ENTERR10, srcOooLo=0, srcOooHi=12105, srcOooMax=28759, o3RowCount=190063, o3LagRowCount=161304, srcDataMax=1124783, o3TimestampMin=2023-01-04T05:16:41.000000Z, o3Timestamp=2023-01-04T05:16:41.000000Z, o3TimestampMax=2023-01-04T08:42:53.000000Z, partitionTimestamp=2023-01-04T05:00:00.000000Z, partitionIndex=308, partitionSize=1136889, maxTimestamp=2023-01-04T12:18:18.000000Z, last=false, append=false, pCount=1, flattenTimestamp=true, memUsed=65757715604]
2023-01-04T12:37:48.588259Z I i.q.c.TableWriter o3 partition task [table=cdn_logs_ENTERR10, srcOooLo=12106, srcOooHi=26088, srcOooMax=28759, o3RowCount=190063, o3LagRowCount=161304, srcDataMax=934549, o3TimestampMin=2023-01-04T05:16:41.000000Z, o3Timestamp=2023-01-04T07:54:54.000000Z, o3TimestampMax=2023-01-04T08:42:53.000000Z, partitionTimestamp=2023-01-04T07:00:00.000000Z, partitionIndex=316, partitionSize=948532, maxTimestamp=2023-01-04T12:18:18.000000Z, last=false, append=false, pCount=2, flattenTimestamp=false, memUsed=65757715604]
2023-01-04T12:37:48.588265Z I i.q.c.TableWriter o3 partition task [table=cdn_logs_ENTERR10, srcOooLo=26089, srcOooHi=28758, srcOooMax=28759, o3RowCount=190063, o3LagRowCount=161304, srcDataMax=351048, o3TimestampMin=2023-01-04T05:16:41.000000Z, o3Timestamp=2023-01-04T08:39:05.000000Z, o3TimestampMax=2023-01-04T08:42:53.000000Z, partitionTimestamp=2023-01-04T08:00:00.000000Z, partitionIndex=320, partitionSize=353718, maxTimestamp=2023-01-04T12:18:18.000000Z, last=false, append=false, pCount=3, flattenTimestamp=false, memUsed=65757715604]
2023-01-04T12:37:57.686194Z I i.q.c.TableWriter merged partition [table=`cdn_logs_ENTERR10`, ts=2023-01-04T05:00:00.000000Z, txn=3450]
2023-01-04T12:37:57.686206Z I i.q.c.TableWriter merged partition [table=`cdn_logs_ENTERR10`, ts=2023-01-04T07:00:00.000000Z, txn=3450]
2023-01-04T12:37:57.686207Z I i.q.c.TableWriter merged partition [table=`cdn_logs_ENTERR10`, ts=2023-01-04T08:00:00.000000Z, txn=3450]
2023-01-04T12:37:57.700192Z I i.q.c.TableWriter purged [path=/root/.questdb/db/cdn_logs_ENTERR10/2023-01-04T05.3449]
2023-01-04T12:37:57.700526Z I i.q.c.TableWriter purged [path=/root/.questdb/db/cdn_logs_ENTERR10/2023-01-04T07.3449]
2023-01-04T12:37:57.700862Z I i.q.c.TableWriter purged [path=/root/.questdb/db/cdn_logs_ENTERR10/2023-01-04T08.2766]
What does
totalWriteCounts
column stand for in your table? The number of sent rows? Also, how do you measure
durationPretty
?
One more thing: could you try setting
cairo.o3.min.lag=120000
server config property and gather the logs once again? This setting should help with making commits less frequent. It should be set in
<qdb_root>/conf/server.conf
or via
QDB_CAIRO_O3_MIN_LAG=120000
environment variable
a

Ajay Pilaniya

01/04/2023, 4:15 PM
What does
totalWriteCounts
column stand for in your table? The number of sent rows? Also, how do you measure
durationPretty
?
totalWriteCounts
= total rows sent by my Java app, duration is time elapsed between first row and last row of the batch. Sample code below
long startTime = System.currentTimeMillis();
try (Sender sender = .....)
{
    sender.send(1st row), sender.send(2nd row)......
}
long endTime = System.currentTimeMillis()

long totalDuration = endTime - startTime;
a

Andrey Pechkurov

01/04/2023, 5:12 PM
Could you try setting `cairo.o3.min.lag`>
a

Ajay Pilaniya

01/04/2023, 5:15 PM
Yeah changing the setting
@Andrey Pechkurov We are getting below error occasionally nally when ingesting data, what is the cause of this?
2023-01-06T04:19:24.222354Z E i.q.c.l.t.LineTcpConnectionContext [670] could not process line data [table=cdn_logs_ENTERR10, msg=Partition '2023-01-06T03.4978' does not exist in table 'cdn_logs_ENTERR10' directory. Run [ALTER TABLE cdn_logs_ENTERR10 DROP PARTITION LIST '2023-01-06T03.4978'] to repair the table or restore the partition directory., errno=0]^M
2023-01-06T04:19:24.222393Z E i.q.c.l.t.LineTcpConnectionContext [670] could not parse measurement, NONE at 1216, line (may be mangled due to partial parsing): 'cdn_logs_ENTERR10,ContentId=6332fb57abe8c9102e3719ee,RequestId=Unknown,JobId=E1KUOC0ENJS6P5.2022-12-22-09.add14fbb.gz,DistributionType=VOD,FileType=DASH_VIDEO_SEGMENT,Method=GET,CdnType=CloudFront,DistributionId=E1KUOC0ENJS6P5,StreamingPartnerId=Unknown,UserCity=Ranchi,UserCountry=India,UserDevice=Mobile,UserBrowser=-,UserDeviceBrand=Unknown,UserDeviceName=Unknown,UserId=Unknown,Agent=Special,AgentName=ENTERR10,UserOs=Unknown,OsClass=Mobile CdnRequestId="SUNA5XLg8VXgNZA07kivMD66-46EHW0EugzNHbDsnGeTiLOwPG2Frw==",UserAgent="enterr10/112(Mobile;Android30;realme;Realme%20RMX3231)ExoPlayerLib/2.16.0",Referer="",Domain="<http://enterr10-streams-msp.streamready.in|enterr10-streams-msp.streamready.in>",Ip="2401:4900:3842:2b12:1:2:f580:4621",Extras="",FileName="chunk_ctvideo_ridp0va0br2500000_cs4680000_w1160222018_mpd.m4s",CacheStat="Hit",Resolution="854x480",EdgeLocation="MAA50-C2",FilePath="/ENTERR10/EPISODE/6332fb57abe8c9102e3719ee/dash/854x480/chunk_ctvideo_ridp0va0br2500000_cs4680000_w1160222018_mpd.m4s",AgentVersion="112",AgentVersionName="ENTERR10 112",OsVersionName="Unknown ??",OsVersion="??",isValid=t,SegmentCount=1i,ViewCount=0i,FileSize=351796i,StatusCode=200i,ResponseTime=0.001,ua=1672998564216000t,ca=1672998564216000t 1671702572000000000'^M
a

Andrey Pechkurov

01/06/2023, 6:20 AM
Do you drop partitions on
cdn_logs_ENTERR10
table before this error occurs? If so, it's probably the following bug. We're going to release the fix soon, but before that the workaround is to restart the database https://github.com/questdb/questdb/pull/2864
a

Ajay Pilaniya

01/06/2023, 6:26 AM
@Andrey Pechkurov I am not dropping any partitions by the way. Also this occurs randomly. So far I have seen it three times in two days
a

Andrey Pechkurov

01/06/2023, 6:48 AM
Could you share the logs before the above lines? I wonder if they contain more errors
a

Ajay Pilaniya

01/06/2023, 6:54 AM
@Andrey Pechkurov No other errors. This is the only error log in whole file
a

Andrey Pechkurov

01/06/2023, 6:57 AM
Did you try executing the suggested command? It could be the case that a partition wasn't created due to an mmapped area limit error you faced before and now every time you're trying to insert data into the partition, you face the error
a

Ajay Pilaniya

01/06/2023, 7:04 AM
Ok let me try that
a

Alex Pelagenko

01/06/2023, 9:35 AM
It can be connected to the bug fixed even without dropping partitions. Full log for the few hours around the time it happened would help to be more certain
a

Ajay Pilaniya

01/07/2023, 5:02 AM
@Andrey Pechkurov I am facing a weird situation during query executions - Query 1 -
SELECT count_distinct(UserId) FROM (SELECT timestamp_floor('d', ts) as ts, ContentId, UserId, UserDevice, StreamingPartnerId, UserDeviceBrand, RequestId, UserBrowser, UserOs, UserCountry, UserCity, sum(SegmentCount) as SegmentCount, sum(FileSize) as TotalFileSize FROM 'cdn_logs_ENTERR10' where ts IN '2023-01-01');
This query returns distinct user count = 20k Query 2 -
SELECT timestamp_floor('d', ts) as ts, ContentId, UserId, UserDevice, StreamingPartnerId, UserDeviceBrand, RequestId, UserBrowser, UserOs, UserCountry, UserCity, sum(SegmentCount) as SegmentCount, sum(FileSize) as TotalFileSize FROM 'cdn_logs_ENTERR10' where ts IN '2023-01-01' ORDER BY SegmentCount DESC LIMIT 0,20000
Using Query 2 I insert this data in a second table. I am using pagination to fetch data in batches of 20k rows each time. Query 1 is returning total 321k rows while using pagination I am getting only 317k rows. That’s the first gap. Secondly data inserted using Query 2 only returns 4k distinct user count. I feel like here pagination is not working properly and might return duplicate rows due to which all user ids are not present when using query 2 pagination.
Just ingested the data using query 2 in single batch (without LIMIT clause) and now data is fine. What could be reason of incorrect data during pagination?
a

Andrey Pechkurov

01/07/2023, 7:46 AM
Hi Ajay, The limit in query 2 applies to the group by data - that's because you have aggregation functions (
sum(SegmentCount)
and
sum(FileSize)
) in the select. Did you insert all pages or only the first one (
LIMIT 0,20000
)? A side note: group by queries are materialized fully and only then LIMIT applies. So, from the performance stand point the most efficient way to insert the data is to do something like this:
INSERT INTO my_new_table 
SELECT timestamp_floor('d', ts) as ts, ContentId, UserId, UserDevice, StreamingPartnerId, UserDeviceBrand, RequestId, UserBrowser, UserOs, UserCountry, UserCity, sum(SegmentCount) as SegmentCount, sum(FileSize) as TotalFileSize FROM 'cdn_logs_ENTERR10' where ts IN '2023-01-01' ORDER BY SegmentCount DESC; -- note: you may want to sort the data via timestamps if my_new_table has a designated timestamp
a

Ajay Pilaniya

01/07/2023, 11:23 PM
Did you insert all pages or only the first one (
LIMIT 0,20000
)?
I inserted all pages BTW but somehow results were not correct
a

Andrey Pechkurov

01/09/2023, 7:35 AM
I tried to reproduce this on 6.6.1 with no success. Would be great if you could help me with a reproducer
a

Ajay Pilaniya

01/09/2023, 7:43 AM
Strange. Let me try to reproduce it with my data and report back
@Andrey Pechkurov Can you also share the binary with fix you made for our of memory issue while using distinct count query? I am facing similar issue on a 4Million rows table
a

Andrey Pechkurov

01/09/2023, 4:36 PM
Here is the latest no-jre snapshot
a

Ajay Pilaniya

01/10/2023, 7:15 AM
@Andrey Pechkurov Thank you.. Just updated and now it’s working fine 🙂
a

Andrey Pechkurov

01/10/2023, 7:18 AM
Glad to hear that Ajay 👍