Fork me on GitHub
#sql
<
2020-12-09
>
nikp14:12:10

Hi there, I'm doing a basic read all rows into memory as hashmaps with next.jdbc and saw there's support for clojure.core.reducers functions to reduce the rows across multiple threads while using plan. Most examples I saw in the tips & tricks doc look at reducing early or using a subset of data, so for the use case of reading all rows into memory as clojure hashmaps would this even be beneficial? For what it's worth, here is the code I've been looking at, though am probably doing something terribly wrong since it doesn't seem to run any faster than the basic execute! call for ~30k db rows :

(clojure.core.reducers/foldcat
  (next-jdbc/plan (:datasource *db*)
                 ["SELECT * FROM data WHERE dataset_id = ?" dataset_id]
                 {:builder-fn result-set/as-unqualified-maps}))
Has anyone else dealt with something like this or has any advice? Thanks!

dharrigan14:12:07

I'm sure Sean may weigh in with something, but have you told your db driver to handle large fetch sizes?

dharrigan14:12:00

(in addition to setting this particular connection to be read-only etc..)

nikp14:12:19

Re fetch size, attempted setting the :fetch-size opt to plan to both 512 (the default fold partition n) and 10000, with no change one way or the other

nikp14:12:27

Could try a :concurrency :read-only Statement later. Stepped out for a bit right now, but can give that a swing, though it can't be with plan so won't be done with reducers

nikp14:12:14

(though tbf, I'm not entirely sure I understand what the concurrency option does)

nikp17:12:57

Okay, tried with

:concurrency :read-only
:cursors     :close
:result-type :forward-only
in the opts as well, but no change

seancorfield22:12:22

@nikp I think you're misunderstanding what reducer support in plan enables: the underlying query still has to run sequentially -- what reducer support allows for is for chunks of that to be processed "in parallel" and then combined so you will only see performance increases if your reducing function does enough work that it can be overlapped in parallel. foldcat does no "work" other than appending the chunks so you can't possibly get any overlapping work and therefore no speed up.

seancorfield22:12:11

Two things from the docs to be aware of with folding: > The combining operations are forked and interleaved with the reducing operations, so the order (of forked tasks) is batch-1, batch-2, combine-1-2, batch-3, combine-1&2-3, batch-4, combine-1&2&3-4, etc. With foldcat, there's really only "combining" going on so you can see that it is a strictly sequential operation. And then: > The amount of parallelization you get will depend on many factors including the number of processors, the speed of your reducing function, the speed of your combining function, and the speed with which result sets can actually be streamed from your database. The slower your reducing function, the more parallel work that can be done. The slower your combining function, the less parallel work that can be done. And even then there is still a limitation in how slow your reducing function can be and still allow for lazy consumption of the result set: > There is no back pressure here so if your reducing function is slow, you may end up with more of the realized result set in memory than your system can cope with.