Title
w

WCKD

10/01/2022, 8:07 PM
Hello, guys! Quick question, can you limit the amount of query row size batches after
cursor.execute(big_query)
? - similar to how you can do with
cursor.itersize = value
in a postgresql server-side cursor? Is there a config setting in the
server.conf
file? For example, if I run a big query returning 300M Rows, when looping through it with a
for row in cursor
loop, I would like to be able to set mini batches of 100M rows for example, so that I can loop through the first 100M in RAM, then through the next batch of 100M and so on until the entire query is done. At the moment it seems like all 300M Rows are trying to be stored into the RAM, which my machine can't handle, so I was wondering if there's a way to limit the maximum amount of rows per batch just like you can by setting the
itersize
value in a postgresql server-side cursor.
j

Jaromir Hamala

10/01/2022, 10:24 PM
Hello, this should be handled by your Postgres driver. For example the Postgres JDBC driver exposes a property
fetchSize
which does exactly what you want: It splits the overall result into mini-batches and fetches these mini-batches one at the time. In fact, this is a part of the Postgres wire protocol. What client/driver do you use?
w

WCKD

10/02/2022, 7:30 AM
Hello, @Jaromir Hamala! Thanks for replying! I'm using the psycopg2 package in python and I'm aware of the
fetchall()
,
fetchmany(x)
or
fetchnone()
methods, but I just want to loop through to cursor without using any of those. By default: Turned out that psycopg2 fetches the entire result set to the client by default (when using
fetchall(), fetchmany(x)
or
fetchnone()
— unless you use a named cursor (server-side cursor). When you use a named cursor, the result set is maintained on the server-side allowing you to fetch rows as necessary. You can also control the no. of records fetched over the network in each go by setting the
cursor.itersize
property to a reasonable value (default 2000). However, QuestDB SQL does not support named cursors as I have already tried them, that's why I asked if there is a posibility of setting this itersize-like value somewhere for the client-side cursor without using an unsuported server-side cursor. Or, perhaps, you could add named-cursors as a feature in questdb?
I created a feature request - #2570
j

Jaromir Hamala

10/03/2022, 7:20 AM
Hello @WCKD, thanks for documenting the feature request! I think exposing server-side cursors is likely non-trivial, especially when taking the REST interface into considerations. As it brings questions like: When the cursor should be cleaned-up? What if a client dies? This is relatively easy to answer with pgwire: A client disconnects -> you clean-up all associated resources, including cursors. But it’s harder with REST. as HTTP clients can create a new TCP connection for each requests. I reckon introducing a property limiting the maximum batch size could be much simpler (=faster) to implement.
w

WCKD

10/03/2022, 7:28 AM
Hi, @Jaromir Hamala! Thanks for your reply! Indeed, I wasn't taking the REST API into consideration when I was creating the feature request, you're right. However, I don't know how you could create a limiting feature, as the psycopg driver itself states that the only way to retrieve chunk sized data from a big query is by only using a server-side cursor. Do you have any idea in mind on how you can bypass this and add a limiting maximum batch/chunk between questdb and postgres wire?
I have literally tried every possible database systems these past weeks and questdb is by far the fastest, nicest and easiest to use, congratulations for that, however this batch size feature keeps me from using it at the moment as my query literally fills 24GB of RAM, haha. 😒miling_face_with_tear:
j

Jaromir Hamala

10/03/2022, 7:32 AM
thanks for very nice words! isn’t
fetchmany()
a good enoung workaround before implicit fetch limit / server-side cursors are implemented?
w

WCKD

10/03/2022, 7:38 AM
Unfortunately, when using a client-side cursor in postgres, when executing a query using
cursor.execute()
no matter what, the entire query gets into RAM, as least that's the case with psycopg driver. It doesn't even matter if you fetch the query with
fetchone()
one row by one row, it would still fill up the RAM, as the query execution itself sends the result into RAM. I have compared three scenarios,
fetchone()
,
fetchmany(30M)
which fits in the RAM perfectly, and just looping through the cursor with for loop and in every scenario, the RAM would get filled up.
def query_database(query, fetchmany, fetchone):
    with pg.connect(CONNECTION) as conn:
        with conn.cursor(row_factory=pg.rows.dict_row) as cursor:
            cursor.execute(query)
            nr = 0
            if(fetchone):
                while(cursor.fetchone() != None):
                    nr +=1
            elif(fetchmany):
                while True:
                    result = (cursor.fetchmany(30_000_000))
                    if not result:
                        break
                    for row in result:
                        nr +=1
            else:
                for row in cursor:
                    nr +=1
    print(nr)
I mean, the problem is that with the client-side cursor, either way when looping through the query result, the entire result is set into ram and never cleaned until the connection closes. 😑
j

javier ramirez

10/03/2022, 8:24 AM
A not supernice –but probably working– solution would be to write a function to paginate over the results using limit bounds https://questdb.io/docs/reference/sql/limit/
Notice lower bound is exclusive but upper is inclusive, so you paginate like this LIMIT 0,1000 - LIMIT 1000, 2000 - LIMIT 2000, 3000…
w

WCKD

10/03/2022, 8:43 AM
I have thought about this and will give it a try, however calling a lot of queries instead of 1 query and process it batch by batch is a waste of resources
j

javier ramirez

10/03/2022, 8:47 AM
Agreed it is not ideal, but If your queries are light enough, this might work. Regarding wasting resources, since ingestion and reader thread pools are different, unless you have many concurrent queries, it is likely those reader threads are idle
w

WCKD

10/03/2022, 10:58 AM
Thank you so much, @javier ramirez, will definitely give it a try soon!
I'm back with the feedback, @javier ramirez, your solution works as intended, thank you so much! Also, I have noticed that doing this via the REST API with the LIMIT hack, it works even faster than postgres cursors! Can you suggest me some
server.conf
settings I could improve from the default ones so that the json's might be calculated and retrieved even faster? Thank you again!
I was too soon, haha, they are pretty much equal, however the batch_size for the JSON is significantly smaller than the cursor one, I quest the json takes up a whole more space in the RAM, check this out:
So, I was able to fill up my RAM with 150M batch_size for the cursor and only 20M batch_size for the rest api json. Even considering this, the times are similar for querying and looping through 206M rows from the db. I don't know how I could improve this even further, I think this is the bottleneck of my hardware, haha