Fork me on GitHub
#sql
<
2020-06-26
>
antonmos19:06:49

hi, @seancorfield I am trying to stream process a largish dataset like this

(let [res (jdbc/plan (db/connection) params db-opts)]
  (r/foldcat 
    (r/map #(do (println (.getId (Thread/currentThread)))
                (do-stuff %)) 
           res)))
But it seems to be getting processed in a single thread. Any idea what’s going on? Thanks!

seancorfield19:06:51

That's nothing to do with next.jdbc. It's just about clojure.core.reducers.

antonmos20:06:15

it seems that r/fold requires the collection to be foldable

antonmos20:06:45

is there any reason why that idea is not compatible with jdbc.next?

antonmos20:06:05

i am trying to undestand if i am thinking about this wrong 🙂

seancorfield20:06:12

next.jdbc/plan returns an IReduceInit like lots of things.

antonmos20:06:52

do you know of any other way to do streaming parallel processing?

antonmos20:06:06

of jdbc.next/plan result

seancorfield20:06:43

Streaming data onto a channel and processing it in parallel as it is taken off?

antonmos20:06:43

i can certainly try to implement foldable on to top of IReduceInit, but not sure if that’s conceptually possible or not

seancorfield20:06:59

No idea. I've never used clojure.core.reducers stuff.

antonmos20:06:31

thank you for the channel idea

seancorfield20:06:28

(of course, now you have me curious -- which is dangerous on a Friday afternoon!)

antonmos20:06:09

muahaha 🙂

antonmos20:06:44

@dpassen1 and I are trying to grok what it takes to implent CollFold from coljure.core.reducers

seancorfield20:06:49

Looking at clojure.core.reducers, the answer to your original question is: a plain old IReduceInit is just reduce'd directly by the reducers stuff, so that's a single thread.

antonmos20:06:32

btw, i am curious, how come plan returns an IReduceInit and not a LazySeq?

seancorfield20:06:04

Resource management: you can't let an open connection escape in a lazy sequence.

seancorfield20:06:55

It was one of the big problems people naively fell into with clojure.java.jdbc whenever they provided a :result-set-fn that was lazy. And, earlier on, it was even easier to make that mistake with the old API.

antonmos20:06:59

ah yes, that makes sense

seancorfield20:06:45

I believe you could write a reducing function over an IReduceInit that conj'd rows into vectors of size N and dispatched reduce over each of those via fork-join and then at the end of that process you'd have a set of reduces happening in threads and a final vector of < size N and you'd then combine all of those results (and reduce over the last piece).

seancorfield20:06:03

Essentially implementing the foldvec helper from clojure.core.reducers to work inside an into-like reduction...

antonmos20:06:00

we just wrote something similar

+  (->>
+    (reduce (fn [{:keys [batch futures] :as acc} row]
+             (if (= (count batch) 100))
+               (assoc acc :batch []
+                          :futures (conj futures (future (map db/ybr-loan-fix-base batch))))
+               (update acc :batch conj row))
+           {} res)
+    (mapcat deref))
where res is result of plan

seancorfield22:06:42

@antonmos This seems to ignore the row when the batch has 100 elements in it -- and I assume there's some additional step needed to mapv over the last batch if there aren't exactly a multiple of 100 elements in the result set?

antonmos14:06:55

yup, that was a todo 🙂

seancorfield18:06:25

Let me know if you start using the built-in fold support and how that goes @antonmos

seancorfield20:06:22

So that's "just" parallel map with an unbounded thread pool I think?

antonmos20:06:07

kinda like pmap + batching + over a reducable

seancorfield20:06:07

That's certainly an easier thing to implement than a fork-join-based parallel fold, but it's good to see it can be done "easily".

antonmos20:06:19

ill let you know if it works 🙂

antonmos20:06:25

testing now

seancorfield20:06:57

I think it would be very interesting to have plan's result implement CollFold...

seancorfield20:06:42

Note that it would have to fully realize each row (because of the mutable result set under the hood).

antonmos20:06:11

that’s fine with me, i am not going after navigation

seancorfield20:06:47

It's not an issue of navigation, it's about the performance boost you can get by not realizing rows when reducing.

seancorfield20:06:15

Currently, reducing over plan can completely avoid converting ResultSet "rows" into Clojure data structures if you access columns by name (label) in your reducing function and don't do anything that requires a sequence.

seancorfield20:06:40

It can't avoid that if there's going to be any parallel evaluation (because mutability).

seancorfield20:06:56

Which also means that the reducing function you used in the fold would have to expect to operate on the fully built hash maps and therefore the final column names, rather than being able to use the short names (labels) as in a simple reduce over plan.

antonmos21:06:45

when we ran the code above, row objects came out as #object[next.jdbc.result_set$mapify_result_set$reify__40720 0x76d61149 {row} from plan` -- missing map or reduce?]`

antonmos21:06:01

what’s the best way to realize it?

hiredman21:06:56

the big thing is fold needs some way to split the results, building a tree of parallel computations, which it is hard to see how you could do that generically on top of a sql query

👍 3
antonmos21:06:36

should i be using rs/datafiable-row?

hiredman21:06:18

if you assume some common things like ids and getting the highest and lowest ids being cheap you can build something like that for scanning whole tables, or ranges of tables, but that breaks down for complex queries

hiredman21:06:27

(I built something like this a long time ago against s3 if I recall, not sql, where getting the highest id for a split was not cheap, but I did a binary search kind of probe for it)

hiredman21:06:49

something to bare in mind is it is kind of frowned on to do io in a forkjoin threadpool (which is what fold puts tasks on)

hiredman21:06:34

the forkjoin pools have this notion of managed blockers where you can tell the pool you are doing a blocking task, so it spins up another thread to avoid the threadpool doing nothing while waiting for io

seancorfield21:06:36

@antonmos Yes, I missed that you need to realize the rows in your code.

antonmos21:06:26

could you help me understand why http://jdbc.next.re/mapify-result-set does not have an path for just calling

(row-builder @builder)

seancorfield21:06:44

Because it wants to discourage you from doing it -- because you are losing the performance benefit of what plan tries to do.

antonmos21:06:22

hmmm i feel like i am missing something.

antonmos21:06:00

what i am trying to do is to start processing the result set on all my cpus while the record set is being paged in from the db

antonmos21:06:42

which means that i need to realize each row as it gets downloaded (as governed by :fetch-size)

seancorfield21:06:06

So you need to call rs/datafiable-row on each row before you decide whether to put it into a batch...

antonmos21:06:41

yup, that’s what i am trying now. it’s just odd that i have do that even those i have no intention of using the navigized version of it

antonmos21:06:07

as i am understand the value of plan is that i can get the data as it comes in fetch-size batches

antonmos21:06:14

that’s the primary value

antonmos21:06:38

secondary value is ability to get it in even thinner fashion and delay relalization of rows until navigation occurs

seancorfield21:06:51

No, the primary value is not realizing ResultSet data as Clojure data structures.

antonmos21:06:36

is there another to get the first thing only, without the second thing?

antonmos21:06:12

btw, i am just trying to understand if i am using the library correctly, not arguing about what’s more useful or not

seancorfield21:06:09

You have to realize the row from the ResultSet data as a Clojure data structure if you are going to defer processing (beyond that row).

seancorfield21:06:39

Realizing that data is expensive -- hence the very first bullet of the README motivation for next.jdbc: "Performance: there's a surprising amount of overhead in how ResultSet objects are converted to sequences of hash maps in clojure.java.jdbc – which can be really noticeable for large result sets – so I wanted a better way to handle that."

seancorfield21:06:13

The datafiable-row call is deliberately ugly/complex (requiring the connectable etc) so that you are very clear in your code that you are indeed eating that cost.

seancorfield21:06:39

You can use select-keys on the row to pull out a subset of the data -- and that will produce just a bare hash map with the simple keys that you asked for. That's probably the fastest way to produce a simple hash map containing just the data you want.

seancorfield21:06:33

(it relies on the get-column-by-label machinery to extract the column value without otherwise realizing any data)

antonmos21:06:29

oh, i can call select-keys on the row produced by plan?

antonmos21:06:21

“extract column value without otherwise realizing any data” do you mean without pulling data for columns that i dont want?

seancorfield21:06:17

It's about more than just that but, yes, it would not pull columns you did not specify.

seancorfield21:06:16

(`select-keys` is mentioned in the Getting Started page, BTW)

seancorfield21:06:27

"This means that select-keys can be used to create regular Clojure hash map from (a subset of) columns in the row, without realizing the row, and it will not implement Datafiable or Navigable."

antonmos21:06:40

oh i missed that!

seancorfield21:06:20

The important thing to bear in mind is that the builder fn is not used in that case.

antonmos21:06:53

gotcha. so, in my usecase, is it better to use select-keys or rs/datafiable-row?

seancorfield21:06:51

If you know the keys you want to operate on in your db/ybr-loan-fix-base function, use select-keys -- although it ties the logic together a bit.

seancorfield21:06:17

Also, in your code, you're going to want to use mapv inside the future I think, not map.

antonmos21:06:33

i do know the columns, but ideally i wouldnt have to hardcode them. is there difference in performace between select-keys and rs/datafiable-row?

antonmos21:06:56

(btw, i really appreciate your time on this friday afternoon)

antonmos21:06:26

btw, looks like rs/datafiable-row approach is working with our parallel reduction

seancorfield21:06:24

The difference in performance is that select-keys will only cause a column read by label from the current result set row for each key, and then however select-keys builds the hash map, whereas datafiable-row will construct the result set builder and will then call row-builder on it, which in turn does this

(->> (reduce (fn [r i] (with-column builder r i))
               (->row builder)
               (range 1 (inc (column-count builder))))
       (row! builder)))

seancorfield21:06:03

->row just creates a transient hash map and row! just makes it persistent.

seancorfield21:06:57

If you are using a builder adapter with your own choice of label-fn (and maybe qualifier-fn) that could be quite a bit of code being executed.

seancorfield21:06:01

(on the other hand, select-keys doesn't use a transient hash map to build the result -- probably because it assumes you're building small hash maps?)

noisesmith21:06:13

it might be as simple as "it was written before transients existed and nobody prioritized improving its performance"

3
antonmos21:06:56

wow! i had no idea

antonmos21:06:00

good to know!

seancorfield21:06:25

For what it's worth, the difference between calling (row-builder @builder) and (datafiable-row row connectable opts) is almost nothing.

seancorfield22:06:50

A call to ResultSet.getRow() (fast), two field accesses on @builder (fast), a call to with-meta to add a hash map with four keys that have function literals as values (pretty fast).

seancorfield22:06:53

@antonmos I realized I misspoke on this. There's a datafy call in there on the result set metadata but that won't do anything unless you've also required the next.jdbc.datafy namespace.

seancorfield22:06:27

So datafiable-row is still very cheap -- unless you require next.jdbc.datafy and even then it's not horribly expensive.

seancorfield22:06:26

(note, in particular, that the call to navize-row simple returns a function literal!)

antonmos22:06:22

totally, very minor overhead. It’s just a bit confusion to the reader why must the caller provider a connection when all the data has already been queried

antonmos22:06:40

i had to read the source to understand that it’s required for navigation only

seancorfield22:06:57

Connectable, not connection. I would suggest passing in a datasource 🙂

antonmos22:06:26

our (db/connection) is a datasource, poorly named

seancorfield22:06:14

Naming is hard! 🙂

seancorfield22:06:30

(of course, if you want to avoid any attempt at navigation, you could pass nil into it)

fabrao22:06:48

@seancorfield Hello Sean, I have a TSQL script that last command is SELECT, but the result is [{:next.jdbc/update-count 85}] , and I´m using (jdbc/execute! (buscar-datasource) [query]) to execute it. How does it work?

fabrao22:06:16

If I execute (jdbc/execute! (buscar-datasource) ["BEGIN SELECT 1 AS VALOR END"]) returns resultset

antonmos22:06:54

@seancorfield thank you for pointing out that we have to use mapv in the future!!

3
seancorfield22:06:59

The behavior of trying to run multiple SQL commands in a single call is database-specific @fabrao

seancorfield22:06:41

You may also need to use the (work-in-progress) multiple result set support on the multi-rs branch and pass :multi true in as an option.

fabrao22:06:25

but I think this is only one resultset

seancorfield22:06:52

You said the script has select as the last command -- which implies it has other commands before that?

fabrao22:06:54

as the result in Management Studio

fabrao22:06:12

yes, updates, inserts ...

seancorfield22:06:25

Are you using deps.edn?

fabrao22:06:56

is it different using with it?

seancorfield22:06:05

Can you change your next.jdbc dependency to use git and this SHA d974ed23139331015fda3ac281336eccccfe3ff2

seancorfield22:06:47

{:git/url "" :sha "d974ed23139331015fda3ac281336eccccfe3ff2"} instead of {:mvn/version "..."}

seancorfield22:06:12

and then in your execute! call, pass {:multi true} as options (a third argument).

fabrao22:06:01

let me try

seancorfield22:06:25

Instead of just one result, you will get a vector of all available results.

fabrao22:06:13

the same result

fabrao22:06:41

(jdbc/execute! (buscar-datasource) [query] {:multi true})

fabrao22:06:49

[{:next.jdbc/update-count 85}]

seancorfield22:06:24

I don't think you got the updated version.

seancorfield22:06:49

You restarted your REPL or process or whatever? To make sure you pick up the different dependency.

fabrao22:06:49

:deps {org.clojure/clojure {:mvn/version "1.10.1"}

        seancorfield/next.jdbc {:git/url "" :sha "d974ed23139331015fda3ac281336eccccfe3ff2"}
        com.microsoft.sqlserver/mssql-jdbc {:mvn/version "7.0.0.jre8"}

fabrao22:06:30

last query is select DISTINCT * from #TAB_RESULTADO

fabrao22:06:43

I have one insert before

seancorfield22:06:00

Oh, my bad! Should be {:multi-rs true}

seancorfield22:06:17

Sorry, I was working from memory as I had a different branch checked out locally.

fabrao22:06:06

going try now

fabrao22:06:15

you are the SQLMAN !!!

seancorfield22:06:43

You are the first user of the "coming soon" multiple result set functionality! 🙂

antonmos22:06:47

@seancorfield again, many thanks for your help!! it seems that we are seeing the improvement that we were looking for! needs a bit more testing but looking pretty good.

seancorfield22:06:18

@antonmos That's great to hear!

fabrao22:06:13

So, you can commit it !!!

seancorfield22:06:56

@fabrao Haha... it's committed... it's just not merged to develop yet because I haven't written the documentation 🙂

seancorfield22:06:06

But you've given me an idea for something I can use to test it further, without trying to wrangle stored procedures (which are all over the map across different DBs!).

seancorfield22:06:48

Turns out begin statement; statement end seems to be supported only by SQL Server...

fabrao22:06:38

yes, I don´t want to use SP just script

fabrao22:06:47

and it works

fabrao22:06:17

many thanks

fabrao22:06:38

do I have to use this git for good?

seancorfield22:06:24

Once I've written the docs for multiple result set support, I'll merge it to develop and cut a new release. That's the only thing holding it up.