Does anyone have experience with next.jdbc/plan + Postgres + vast results? I'm asking because it looks like JDBC driver for Postgres doesn't support lazy reading or pagination; instead, the whole data comes into memory. Here is how to reproduce. The plan:
(def -p (jdbc/plan {:host "127.0.0.1" :dbtype "postgres" :dbname "test" :port 5432 :user "test" :password "test"}
["select jsonb_build_object('x', x) from generate_series(1, 999999999) as foo(x)"]
))
Now I take only the first item:
(reduce (fn [acc row]
(println row)
(reduced 42) ;; or (/ 0 0) to throw
)
0
-p)
It hangs for a long time, and then:
Execution error (OutOfMemoryError) at org.postgresql.core.v3.QueryExecutorImpl/processResults
(QueryExecutorImpl.java:2390).
Java heap space
My guess is JDBC Postgres driver just doesn't support plan . Any positive experience?I just created a table in postgres with 10 million rows (an actual table, not a virtual one with select x from generate_series...). And run the code posted by @chris.klaou, changing the SQL query to read from the table I created. I run the code from a Leiningen REPL inside a Docker container (to have a rough number of the consumed amount of RAM, without having to use VisualVM or similar tools).
It read the 10 million rows and wrote the myfile.edn file to completion, with the Docker container never consuming more than 530 MB total. The actual number is not that important. The important thing is that once it reached the 520-530 MB number (in the first 5-10 seconds), it stayed there for the whole execution duration (between 85-90 seconds on my not-that-fast laptop).
By the way, I get the same results if I don't add the :read-only true option to the jdbc/get-connection call.
Just for reference:
β’ Clojure version: 1.12.3
β’ JDK version: OpenJDK 64-Bit Server VM Temurin-21.0.8+9 (build 21.0.8+9-LTS, mixed mode, sharing)
β’ next.jdbc version: com.github.seancorfield/next.jdbc "1.3.1070"
β’ Postgresql driver version: org.postgresql/postgresql "42.7.8"
Thank you, @iarenaza! I didn't post my updates because, if briefly, they're are the same. The memory consumption is too high and I don't understand why does it go up to half a gigabyte. Even if it doesn't trigger OOM Exception, that's still too high.
I have artificially restricted the amount of memory available both for the container, and for the JVM. I have restricted the container RAM size to 512 MB. And configured the JVM to not use more than 70% of the container memory. Launching the REPL (with Leiningen) and evaluating the namespace (except for the query itself, to have a base number for comparisons) consumes 200-205 MB of memory (according to docker-stats). I have waited a bit, to let the GC reclaim any unneeded memory, but the amount of consumed memory has stayed in those numbers.
Then I've run the query, using the same code except for two details:
β’ the SQL query itself, that I have changed to select number from foo limit 1. That is, I have requested a single row.
β’ I have added the :raw true option to the plan-options map. This tells next.jdbc to not build any Clojure maps for the rows. Which probably churns a lot of memory. But if you donΒ΄t need the Clojure maps because you are accessing the row columns by name, it probably both reduces the total memory consumption, and the memory churn that the GC has to deal with.
With those two changes, when I've run the (single row return) query, the memory consumption has gone up to 208 MB. So those 3-5 extra MB probably are some kind of "base number" that you cannot reduce.
Then I've run the query for the 10 million rows (still with the :raw true plan option). And this time the memory consumption has gone up to 229 MB. Which is another 21 MB. Which it's not nothing, but it doesn't look like a huge number to me.
Looking at the new numbers, I suspect that if the JVM sees that you have a lot of available memory to burn, it doesn't try to be too conservative on memory usage and doesn't trigger the GC as often. That, and the fact that I told next.jdbc not to build 10 million Clojure maps that I wasn't going to use at all.
Iβve seen this that if you have auto commit on it does not support pg fetch size
https://stackoverflow.com/questions/3682614/how-to-read-all-rows-from-huge-table/47517489#47517489
I also tried it in a transaction but nothing has changed:
(jdbc/with-transaction [tx {:host "127.0.0.1" :dbtype "postgres" :dbname "test" :port 15432 :user "test" :password "test"}]
(let [plan (jdbc/plan tx ["select jsonb_build_object('x', x) from generate_series(1, 999999999) as foo(x)"])]
(reduce (fn [acc row]
(println row)
(/ 0 0)
)
0
plan
)))Actually, this page brings some details: https://jdbc.postgresql.org/documentation/query/
Did you see this part of the docs? Several options need to be passed to get PG to stream properly: https://cljdoc.org/d/com.github.seancorfield/next.jdbc/1.3.1070/doc/getting-started/tips-tricks#streaming-result-sets-1
Yeah I saw it recently but still, I cannot reproduce the proper behaviour:
(with-open [conn (jdbc/get-connection
{:host "127.0.0.1"
:dbtype "postgres"
:dbname "test"
:port 15432
:user "test"
:password "test"
:auto-commit false})]
(let [plan (jdbc/plan conn
["select x from generate_series(1, 999999999) as foo(x)"]
{:fetch-size 20,
:concurrency :read-only,
:cursors :close,
:result-type :forward-only})]
(reduce (fn [acc row]
(println row)
(/ 0 0))
0
plan)))
It hangs for a very long timeHmm, well, I don't use PG so that section is based on what the community told me worked so... π€·π» If you figure out the right incantation, LMK and I can update the docs.
It might be your fetch-size is too small there (`20`) and PG ignores it?
hm maybe, I recall I got something when setting it to 4000, let me try
I tried 500 but it still fails with OOM. That's a bit sad there is so much workarounds with plan + Postgres.
Every DB has different streaming behavior / configuration. There's nothing special about plan here except that it provides a reducible. Under the hood, it just runs the SQL you provide but the rows are consumed eagerly while the Connection is active.
If you look at that Tips & Tricks page, you'll see a Streaming Result Sets section for several DBs.
Yes sure I know, I didn't mean to blame neither plan nor next.jdbc.
Like I say, if you can figure out a working set of options for PostgreSQL, I'm happy to update the docs... What you showed above looks "right" based on what PG users have told me to add to the docs π
Yes I'll spend some time on trying to find a proper configuration. I know that plan just relies on ResultSet, and this ResultSet comes from the depths of the underlying driver. Maybe it's worth reading Java sources of PG driver
If you can find a Java language example of streaming with PG, that might be a good starting point...
Hi this snippet worked for me (in the sense that it didn't fail with OOM):
(def db-spec
{:host "127.0.0.1"
:dbtype "postgres"
:dbname "postgres"
:port 5432
:user "chris"})
(def plan-options
{:concurrency :read-only
:cursors :close
:result-type :forward-only
:fetch-size 4000})
(with-open [conn (-> db-spec
jdbc/get-datasource
(jdbc/get-connection {:auto-commit false
:read-only true}))
w (
"myfile.edn"
:append true)]
(.write w "[")
(reduce
(fn [acc row]
(.write w (str row))
(inc acc))
0
(jdbc/plan conn
["select x from generate_series(1, 999999999) as foo(x)"]
plan-options))
(.write w "]")) Thank you! I see you pass :read-only true into the connection, let me check this out
No worries, please do let me know how it went. My local machine is beefy on ram (32g) though my repl process didn't go more than 500 megs, and it took quite a while for it to finish
@igrishaev If that solves the issue, LMK and I'll update the docs...
Yep, going to test it soon