This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2020-06-26
Channels
- # aleph (5)
- # announcements (16)
- # babashka (36)
- # beginners (161)
- # calva (24)
- # cider (8)
- # circleci (45)
- # clj-kondo (5)
- # cljs-dev (25)
- # cljsrn (5)
- # clojure (116)
- # clojure-europe (10)
- # clojure-nl (18)
- # clojure-uk (14)
- # clojuredesign-podcast (6)
- # clojurescript (50)
- # cursive (12)
- # data-science (8)
- # datomic (8)
- # duct (39)
- # emacs (6)
- # fulcro (21)
- # graalvm (12)
- # kaocha (17)
- # off-topic (184)
- # pathom (1)
- # pedestal (2)
- # re-frame (31)
- # reagent (24)
- # reitit (1)
- # sci (1)
- # shadow-cljs (23)
- # sql (147)
- # tools-deps (8)
- # vrac (3)
- # xtdb (35)
hi, @seancorfield I am trying to stream process a largish dataset like this
(let [res (jdbc/plan (db/connection) params db-opts)]
(r/foldcat
(r/map #(do (println (.getId (Thread/currentThread)))
(do-stuff %))
res)))
But it seems to be getting processed in a single thread.
Any idea what’s going on? Thanks!That's nothing to do with next.jdbc
. It's just about clojure.core.reducers
.
next.jdbc/plan
returns an IReduceInit
like lots of things.
Streaming data onto a channel and processing it in parallel as it is taken off?
i can certainly try to implement foldable on to top of IReduceInit, but not sure if that’s conceptually possible or not
No idea. I've never used clojure.core.reducers
stuff.
(of course, now you have me curious -- which is dangerous on a Friday afternoon!)
@dpassen1 and I are trying to grok what it takes to implent CollFold from coljure.core.reducers
Looking at clojure.core.reducers
, the answer to your original question is: a plain old IReduceInit
is just reduce
'd directly by the reducers stuff, so that's a single thread.
Resource management: you can't let an open connection escape in a lazy sequence.
It was one of the big problems people naively fell into with clojure.java.jdbc
whenever they provided a :result-set-fn
that was lazy. And, earlier on, it was even easier to make that mistake with the old API.
Here's an example of that "bug" from 2017 https://stackoverflow.com/questions/46886158/result-set-fn-in-clojure-jdbc-return-error-the-result-set-is-closed-why
I believe you could write a reducing function over an IReduceInit
that conj'd rows into vectors of size N and dispatched reduce over each of those via fork-join and then at the end of that process you'd have a set of reduces happening in threads and a final vector of < size N and you'd then combine all of those results (and reduce over the last piece).
Essentially implementing the foldvec
helper from clojure.core.reducers
to work inside an into-like reduction...
we just wrote something similar
+ (->>
+ (reduce (fn [{:keys [batch futures] :as acc} row]
+ (if (= (count batch) 100))
+ (assoc acc :batch []
+ :futures (conj futures (future (map db/ybr-loan-fix-base batch))))
+ (update acc :batch conj row))
+ {} res)
+ (mapcat deref))
where res is result of plan
@antonmos This seems to ignore the row
when the batch
has 100 elements in it -- and I assume there's some additional step needed to mapv
over the last batch
if there aren't exactly a multiple of 100 elements in the result set?
Let me know if you start using the built-in fold support and how that goes @antonmos
So that's "just" parallel map
with an unbounded thread pool I think?
That's certainly an easier thing to implement than a fork-join-based parallel fold, but it's good to see it can be done "easily".
I think it would be very interesting to have plan
's result implement CollFold
...
Note that it would have to fully realize each row (because of the mutable result set under the hood).
It's not an issue of navigation, it's about the performance boost you can get by not realizing rows when reducing.
Currently, reducing over plan
can completely avoid converting ResultSet
"rows" into Clojure data structures if you access columns by name (label) in your reducing function and don't do anything that requires a sequence.
It can't avoid that if there's going to be any parallel evaluation (because mutability).
Which also means that the reducing function you used in the fold would have to expect to operate on the fully built hash maps and therefore the final column names, rather than being able to use the short names (labels) as in a simple reduce
over plan
.
when we ran the code above, row objects came out as #object[next.jdbc.result_set$mapify_result_set$reify__40720 0x76d61149 {row} from
plan` -- missing map
or reduce
?]`
the big thing is fold needs some way to split the results, building a tree of parallel computations, which it is hard to see how you could do that generically on top of a sql query
if you assume some common things like ids and getting the highest and lowest ids being cheap you can build something like that for scanning whole tables, or ranges of tables, but that breaks down for complex queries
(I built something like this a long time ago against s3 if I recall, not sql, where getting the highest id for a split was not cheap, but I did a binary search kind of probe for it)
something to bare in mind is it is kind of frowned on to do io in a forkjoin threadpool (which is what fold puts tasks on)
the forkjoin pools have this notion of managed blockers where you can tell the pool you are doing a blocking task, so it spins up another thread to avoid the threadpool doing nothing while waiting for io
@antonmos Yes, I missed that you need to realize the rows in your code.
could you help me understand why http://jdbc.next.re/mapify-result-set does not have an path for just calling
(row-builder @builder)
Because it wants to discourage you from doing it -- because you are losing the performance benefit of what plan
tries to do.
what i am trying to do is to start processing the result set on all my cpus while the record set is being paged in from the db
which means that i need to realize each row as it gets downloaded (as governed by :fetch-size)
So you need to call rs/datafiable-row
on each row
before you decide whether to put it into a batch...
yup, that’s what i am trying now. it’s just odd that i have do that even those i have no intention of using the navigized version of it
as i am understand the value of plan
is that i can get the data as it comes in fetch-size batches
secondary value is ability to get it in even thinner fashion and delay relalization of rows until navigation occurs
No, the primary value is not realizing ResultSet data as Clojure data structures.
btw, i am just trying to understand if i am using the library correctly, not arguing about what’s more useful or not
You have to realize the row from the ResultSet data as a Clojure data structure if you are going to defer processing (beyond that row).
Realizing that data is expensive -- hence the very first bullet of the README motivation for next.jdbc
: "Performance: there's a surprising amount of overhead in how ResultSet objects are converted to sequences of hash maps in clojure.java.jdbc – which can be really noticeable for large result sets – so I wanted a better way to handle that."
The datafiable-row
call is deliberately ugly/complex (requiring the connectable etc) so that you are very clear in your code that you are indeed eating that cost.
You can use select-keys
on the row to pull out a subset of the data -- and that will produce just a bare hash map with the simple keys that you asked for. That's probably the fastest way to produce a simple hash map containing just the data you want.
(it relies on the get-column-by-label machinery to extract the column value without otherwise realizing any data)
“extract column value without otherwise realizing any data” do you mean without pulling data for columns that i dont want?
It's about more than just that but, yes, it would not pull columns you did not specify.
(`select-keys` is mentioned in the Getting Started page, BTW)
"This means that select-keys
can be used to create regular Clojure hash map from (a subset of) columns in the row, without realizing the row, and it will not implement Datafiable
or Navigable
."
The important thing to bear in mind is that the builder fn is not used in that case.
If you know the keys you want to operate on in your db/ybr-loan-fix-base
function, use select-keys
-- although it ties the logic together a bit.
Also, in your code, you're going to want to use mapv
inside the future
I think, not map
.
i do know the columns, but ideally i wouldnt have to hardcode them. is there difference in performace between select-keys and rs/datafiable-row?
The difference in performance is that select-keys
will only cause a column read by label from the current result set row for each key, and then however select-keys
builds the hash map, whereas datafiable-row
will construct the result set builder and will then call row-builder
on it, which in turn does this
(->> (reduce (fn [r i] (with-column builder r i))
(->row builder)
(range 1 (inc (column-count builder))))
(row! builder)))
->row
just creates a transient hash map and row!
just makes it persistent.
If you are using a builder adapter with your own choice of label-fn (and maybe qualifier-fn) that could be quite a bit of code being executed.
(on the other hand, select-keys
doesn't use a transient hash map to build the result -- probably because it assumes you're building small hash maps?)
it might be as simple as "it was written before transients existed and nobody prioritized improving its performance"
Good point.
For what it's worth, the difference between calling (row-builder @builder)
and (datafiable-row row connectable opts)
is almost nothing.
A call to ResultSet.getRow()
(fast), two field accesses on @builder
(fast), a call to with-meta
to add a hash map with four keys that have function literals as values (pretty fast).
@antonmos I realized I misspoke on this. There's a datafy
call in there on the result set metadata but that won't do anything unless you've also required the next.jdbc.datafy
namespace.
So datafiable-row
is still very cheap -- unless you require next.jdbc.datafy
and even then it's not horribly expensive.
(note, in particular, that the call to navize-row
simple returns a function literal!)
totally, very minor overhead. It’s just a bit confusion to the reader why must the caller provider a connection when all the data has already been queried
Connectable, not connection. I would suggest passing in a datasource 🙂
Naming is hard! 🙂
(of course, if you want to avoid any attempt at navigation, you could pass nil
into it)
@seancorfield Hello Sean, I have a TSQL script that last command is SELECT, but the result is [{:next.jdbc/update-count 85}]
, and I´m using (jdbc/execute! (buscar-datasource) [query])
to execute it. How does it work?
If I execute (jdbc/execute! (buscar-datasource) ["BEGIN SELECT 1 AS VALOR END"])
returns resultset
The behavior of trying to run multiple SQL commands in a single call is database-specific @fabrao
You may also need to use the (work-in-progress) multiple result set support on the multi-rs branch and pass :multi true
in as an option.
You said the script has select as the last command -- which implies it has other commands before that?
Are you using deps.edn
?
Can you change your next.jdbc
dependency to use git and this SHA d974ed23139331015fda3ac281336eccccfe3ff2
{:git/url "
instead of {:mvn/version "..."}
and then in your execute!
call, pass {:multi true}
as options (a third argument).
Instead of just one result, you will get a vector of all available results.
I don't think you got the updated version.
You restarted your REPL or process or whatever? To make sure you pick up the different dependency.
:deps {org.clojure/clojure {:mvn/version "1.10.1"}
seancorfield/next.jdbc {:git/url "" :sha "d974ed23139331015fda3ac281336eccccfe3ff2"}
com.microsoft.sqlserver/mssql-jdbc {:mvn/version "7.0.0.jre8"}
Oh, my bad! Should be {:multi-rs true}
Sorry, I was working from memory as I had a different branch checked out locally.
You are the first user of the "coming soon" multiple result set functionality! 🙂
@seancorfield again, many thanks for your help!! it seems that we are seeing the improvement that we were looking for! needs a bit more testing but looking pretty good.
@antonmos That's great to hear!
@fabrao Haha... it's committed... it's just not merged to develop yet because I haven't written the documentation 🙂
But you've given me an idea for something I can use to test it further, without trying to wrangle stored procedures (which are all over the map across different DBs!).
Turns out begin statement; statement end
seems to be supported only by SQL Server...
Once I've written the docs for multiple result set support, I'll merge it to develop and cut a new release. That's the only thing holding it up.
Turns out begin statement; statement end
seems to be supported only by SQL Server...