This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2021-01-19
Channels
- # adventofcode (7)
- # aws (6)
- # babashka (7)
- # beginners (200)
- # calva (49)
- # chlorine-clover (3)
- # cider (24)
- # clj-kondo (115)
- # cljs-dev (5)
- # cljsrn (16)
- # clojure (44)
- # clojure-australia (9)
- # clojure-czech (1)
- # clojure-dev (1)
- # clojure-europe (63)
- # clojure-france (6)
- # clojure-losangeles (1)
- # clojure-nl (2)
- # clojure-spec (27)
- # clojure-uk (77)
- # clojurescript (45)
- # clojurewerkz (3)
- # conjure (5)
- # cryogen (1)
- # cursive (2)
- # datahike (6)
- # datascript (3)
- # datomic (18)
- # fulcro (5)
- # graalvm (55)
- # jobs (3)
- # luminus (4)
- # malli (1)
- # pathom (1)
- # reagent (16)
- # shadow-cljs (67)
- # spacemacs (18)
- # sql (57)
- # testing (6)
- # tools-deps (9)
I'm using next.jdbc to do a simple select query, and it's awfully slow compared to the same query being run in C#. I'm finding Clojure to be around 10x slower. I'm using the exact same select query, what am I doing wrong?
(defn query-sentinel [computer-name sql-query]
(let [db {:dbtype "sqlserver"
:dbname dbname
:host host
:user user
:password password
}
ds (jdbc/get-datasource db)]
(jdbc/execute! ds [sql-query])))
sql-query
Β is just a string that holds my select query.
The query returns one field. In Clojure its taking around 600ms, in C# around 50ms
var query = GetRole(computerName, databaseName, instanceName);
using(var sqlConnection = new SqlConnection(ConnectionString))
using(var sqlCommand = new SqlCommand(query, sqlConnection))
{
sqlConnection.Open();
var result = sqlCommand.ExecuteReader();
return !result.Read() ? null : result["RoleSummary"].ToString();
}
Is their something I can do to speed up the clojure version? The difference over a five hundred to a thousand queries is really adding up. Sometimes the clojure version is as high as 750ms per query.you are creating a new db connection each time you call the function which is expensive
@qmstuart As it says in the docs, it's a good idea to use a connection pooling library. Support for c3p0 and HikariCP are built into next.jdbc
-- see next.jdbc.connection/->pool
for simple use cases one could use just get-connection
(for reusing the connection object) too and close the object on shutdown?
Yup, calling get-connection
on the db
spec hash map and reusing that across multiple SQL operations would also work -- but it would not necessarily be robust enough for production code (if the connection is broken for whatever reason, your app is hosed, whereas with a connection pool, bad connections can usually be detected and thrown away, and then you get a new, valid connection from the pool on the next call).
How expensive it is to stand up a brand new connection depends very much on the DB and whatever network you have between the app and the actual DB. In production, we use a c3p0 connection pool and we use ProxySQL in front of MySQL (Percona), so our app->db hop is via loopback (to the proxy), and that makes things pretty robust in terms of bad connections / connection loss.
Thanks, I'll have a look into connection pooling, I didn't bother about that since it's jsut a console application that is needs to make some db queries. Its not long running, but does hit the db a lot. I was just surprised with teh time difference, as even the C# version is creating a new connection each time, yet it's still really fast.
What database? FWIW, with postgres the overhead of creating a new connection each time (after the first one) is ~25ms for me
@kaxaw75836 According to the db
spec, it's MS SQL Server
It's also hard to do timing/benchmarking "in the small" so you may not be comparing like with like.
The first time you open a connection in a Java app, it's doing a lot of work including loading and initializing the JDBC driver, but subsequent connections should be much faster.
I know if I run the C# version of the program, it runs in about 2 mins, that's also writing to an excel spreadsheet the result of each db call. In clojure i'm just making the db calls and no file IO and it's taking closer to 10-12 mins
OK, I'll look into creating the connection once and reusing it see how it goes, thanks guys
π first time poster here! I'm using next.jdbc
and trying to get my results as kebab-case maps when using plan
for a streamable result set. I have two approaches in the works.
Approach 1 = use :builder-fn
(let [db' (jdbc/with-options db {:auto-commit false})
q (-> (honey/select etc...))]
(jdbc/plan db' (hsql/format q) {:fetch-size 5000
:concurrency :read-only
:cursors :close
:result-type :forward-only
:builder-fn rs/as-kebab-maps}) ;;<===
However, when I try to reduce over the resulting ResultSet
and access the kabab-case keys, I can't. I still have to use snake_case to pull keys from the rows. If I explicitly convert the rows to hashmaps using next.jdbc.result-set/datafiable-row
that works, but I'm wondering if I can tweak something to be able access my kebab-case keys right away, since I don't love having to use next.jdbc
functions outside of my database namespaces. I also noticed certain functions like println
will print the kebab-case keys with no prior datafication. Is the best solution to use next.jdbc.result-set/datafiable-row
? Or is there a better way to do this that keeps the next.jdbc
specific logic within the function?
Approach 2 = wrap the ResultSet
to create a new reducible
(let [db' (jdbc/with-options db {:auto-commit false})
q (-> (honey/select etc...))]
(eduction (map my-kebab-maps-fn) ;;<===
(jdbc/plan db' (hsql/format q) {:fetch-size 5000
:concurrency :read-only
:cursors :close
:result-type :forward-only})
My issue with this is I've never used eduction
before and I'm not sure I'm using it correctly here. I really want to make sure I'm not accidentally evaluating the result set too greedily or in a way that would cause a memory error on large result sets. Does using eduction
here preserve the memory efficient, streamable nature of jdbc/plan
?
Thanks!I've been doing something like this to test:
(run! #(-> % :ns/my-field println) result-set)
Inside the reducing function over plan
, the builder function is irrelevant -- you are working with column labels because there is no Clojure data structure in play. That's the whole point of using plan
: to avoid that overhead.
If you just want a fully realized vector of hash maps back, use execute!
, not plan
.
I see, that makes sense. I was under the impression that execute!
would not work well for streaming large result sets though, because it fully realizes the entire thing (I'm using pedestal's http://pedestal.io/reference/streaming by passing a fn
in my response body). So should I stick with plan
if I want to handle the results as a stream, or am I misunderstanding how execute!
can be used?
It's important to understand that plan
lets you stream-from-the-DB but you must reduce it eagerly.
Resource management is why: when you reduce over plan, it opens a connection, reads and processes the data, and then closes the connection.
Thanks for the clarification. So if I understand, does that mean I don't need to use plan
here, since I plan to materialize every row? Or should I still use plan
because I can materialize one row at a time, which will save memory?
And from what I understand, the use of eduction
around the result set should also be fine, as in Approach 2
@aidan Sorry, stepped away for lunch... If you use execute!
the entire result set is realized and returned. If you can fit the result easily in memory, that's fine. If you can't fit the result in memory -- and so you need streaming-results-from-the-DB -- or you want to avoid the overhead associated with creating a full Clojure data structure from the result set object, then you need plan
.
If you use execute!
, the builder function is used -- to build the Clojure data structure.
If you use plan
, the intent is that you can avoid the overhead of building the Clojure data structure by accessing columns by their label (a simple string or keyword). You are responsible for creating whatever data structure you need from the row-abstraction. There are many ways to approach that, depending on exactly what you want to do.
I see, so while I may like my idiomatic kebab-case (I'm recently aware there is a spirited online debate about this in Clojure), it may be most efficient to use snake case with plan
It is most efficient inside the reducing function over plan
to use the simple column names.
It really depends what you want to do with the data from each row.
I'm streaming a csv file to the client with the results, so just accessing certain fields of the row in a given order. Seems like the best bet is to leave it as snake_case
Streaming the result of a plan
call is problematic because plan
expects the result to be consumed eagerly.
eduction
returns a reducible so you're in the same situation as plan
: you are expected to eagerly consume the result, as I understand it.
Ah, I see, I think even though I'm streaming to the client the result set may be consumed eagerly, since I just have a call to run!
that puts all the results onto the IO stream.
is it still better than execute!
even if I realize/materialize the maps as I reduce over the plan
result set? As I reduce using run!
, only one row is realized at a time right? So while the time complexity would be worse, I was hoping the space complexity would not be too much worse.
(run! <something> (plan ...))
You control that <something>
so you can determine what gets realized. If you can send snake_case to the IO stream, then you could use #(select-keys % [:the :set_of :cols :you_need])
and it will only create a simple hash map with just those keys, rather than realize the entire row into a hash map with transformed keys.execute!
will materialize the entire result set into memory, yes. With plan
you can choose how much or how little you materialize. If you are realizing each row completely, you'll use about the same amount of memory overall but it can be GC'd as you walk the result set.
(unless you are reducing each row into a data structure, in which case you're just doing what execute!
would do anyway! π )
First off, thanks so much for taking the time to respond to my million questions, it's a great help.
When you say "If you are realizing each row completely, you'll use the same amount of memory overall", do you mean the same amount of memory as execute!
or plan
? I just figured garbage collection was quick enough that the memory used for the materialized row would be freed up with each iteration...
Realizing a row to a (qualified/transformed) hash map takes X amount of memory and Y amount of time. If your result set has 1,000 rows, you'll use X,000 memory and Y,000 time -- in both execute!
and plan
(if you are realizing rows).
It is possible to overrun GC even with plan
but relatively unlikely (although it depends what else your system is doing).
I'm really just pointing out that plan
offers the ability to avoid using X amount of memory per row if you can use just the column labels -- but it's going to depend on what you do with those column values.
There's no simple nor single answer here. "It. Depends."
hmm ok thanks for spelling it out for me! So I guess most of the memory improvement I was seeing was from following https://github.com/seancorfield/next-jdbc/blob/develop/doc/tips-and-tricks.md#streaming-result-sets-1 in your docs, which mitigated the fact that I was fully realizing the rows
I'm not sure what "memory improvement" you're talking about there.
If you're saying "with execute!
I got an Out Of Memory error and when I used plan
I did not" well, yes, your throughput on processing the rows allowed the GC to avoid the OOM error.
If I recall we were getting an OOM before we added those options to plan
, but I'm confirming now with someone who made the PR
The "problem" here is that SQL and databases are fussy things and every one is different so there's no single, "true" answer for a variety of problems. Which means it's important to understand what's really going on in your code and what trade offs you are making. When I reach for plan
, I'm generally also trying to avoid the overhead of realizing a full hash map from a row in the first place -- more so that trying to stream a very large result set that wouldn't fit in memory.
(which is why I may seem to be a) laboring the point b) being obtuse c) not answering the question definitively π )
Ok, that's very useful to know. I think given this discussion I'll move forward and not attempt to kebabify the maps as it would interfere with the gains from plan
I realize most issues like this don't have a definitive answer, so I appreciate the thoughtful responses π
You could call datafiable-result-set
on each row if you want streaming but also want a fully-realized hash map for each row that uses your builder function.
But if your data volumes are large enough to require streaming, you probably don't want to fully-realize a Clojure hash map for each row!