sql 2021-01-19 | Slack Archive

Stuart16:01:15

I'm using next.jdbc to do a simple select query, and it's awfully slow compared to the same query being run in C#. I'm finding Clojure to be around 10x slower. I'm using the exact same select query, what am I doing wrong?

(defn query-sentinel [computer-name sql-query]
  (let [db {:dbtype   "sqlserver"
            :dbname   dbname
            :host     host
            :user     user
            :password password
            }
        ds (jdbc/get-datasource db)]
    (jdbc/execute! ds [sql-query])))

sql-query is just a string that holds my select query. The query returns one field. In Clojure its taking around 600ms, in C# around 50ms

var query = GetRole(computerName, databaseName, instanceName);
using(var sqlConnection = new SqlConnection(ConnectionString))
using(var sqlCommand = new SqlCommand(query, sqlConnection))
{
  	sqlConnection.Open();
   	var result = sqlCommand.ExecuteReader();
   	return !result.Read() ? null : result["RoleSummary"].ToString();
}

Is their something I can do to speed up the clojure version? The difference over a five hundred to a thousand queries is really adding up. Sometimes the clojure version is as high as 750ms per query.

Nassin17:01:32

you are creating a new db connection each time you call the function which is expensive

seancorfield17:01:38

@qmstuart As it says in the docs, it's a good idea to use a connection pooling library. Support for c3p0 and HikariCP are built into next.jdbc -- see next.jdbc.connection/->pool

Nassin18:01:52

for simple use cases one could use just get-connection(for reusing the connection object) too and close the object on shutdown?

seancorfield18:01:41

Yup, calling get-connection on the db spec hash map and reusing that across multiple SQL operations would also work -- but it would not necessarily be robust enough for production code (if the connection is broken for whatever reason, your app is hosed, whereas with a connection pool, bad connections can usually be detected and thrown away, and then you get a new, valid connection from the pool on the next call).

👍 3

seancorfield18:01:31

How expensive it is to stand up a brand new connection depends very much on the DB and whatever network you have between the app and the actual DB. In production, we use a c3p0 connection pool and we use ProxySQL in front of MySQL (Percona), so our app->db hop is via loopback (to the proxy), and that makes things pretty robust in terms of bad connections / connection loss.

Stuart18:01:07

Thanks, I'll have a look into connection pooling, I didn't bother about that since it's jsut a console application that is needs to make some db queries. Its not long running, but does hit the db a lot. I was just surprised with teh time difference, as even the C# version is creating a new connection each time, yet it's still really fast.

Nassin18:01:49

What database? FWIW, with postgres the overhead of creating a new connection each time (after the first one) is ~25ms for me

Stuart18:01:55

its mssqlserver.

seancorfield18:01:58

@kaxaw75836 According to the db spec, it's MS SQL Server

Stuart18:01:14

I mean, the C# creates teh connection, runs the sql and returns me the class in 50ms.

Nassin18:01:26

hikary is a pretty light dependency anyway

Stuart18:01:29

Or maybe my clojure isn't equivalent to my C# ?

seancorfield18:01:09

It's also hard to do timing/benchmarking "in the small" so you may not be comparing like with like.

seancorfield18:01:09

The first time you open a connection in a Java app, it's doing a lot of work including loading and initializing the JDBC driver, but subsequent connections should be much faster.

Stuart18:01:32

I know if I run the C# version of the program, it runs in about 2 mins, that's also writing to an excel spreadsheet the result of each db call. In clojure i'm just making the db calls and no file IO and it's taking closer to 10-12 mins

Stuart18:01:11

OK, I'll look into creating the connection once and reusing it see how it goes, thanks guys

Aidan19:01:24

👋 first time poster here! I'm using next.jdbc and trying to get my results as kebab-case maps when using plan for a streamable result set. I have two approaches in the works. Approach 1 = use :builder-fn

(let [db' (jdbc/with-options db {:auto-commit false})
      q (-> (honey/select etc...))]
  (jdbc/plan db' (hsql/format q) {:fetch-size 5000
                                  :concurrency :read-only
                                  :cursors :close
                                  :result-type :forward-only
                                  :builder-fn rs/as-kebab-maps})  ;;<===

However, when I try to reduce over the resulting ResultSet and access the kabab-case keys, I can't. I still have to use snake_case to pull keys from the rows. If I explicitly convert the rows to hashmaps using next.jdbc.result-set/datafiable-row that works, but I'm wondering if I can tweak something to be able access my kebab-case keys right away, since I don't love having to use next.jdbc functions outside of my database namespaces. I also noticed certain functions like println will print the kebab-case keys with no prior datafication. Is the best solution to use next.jdbc.result-set/datafiable-row ? Or is there a better way to do this that keeps the next.jdbc specific logic within the function? Approach 2 = wrap the ResultSet to create a new reducible

(let [db' (jdbc/with-options db {:auto-commit false})
      q (-> (honey/select etc...))]
  (eduction (map my-kebab-maps-fn)  ;;<===
            (jdbc/plan db' (hsql/format q) {:fetch-size 5000
                                            :concurrency :read-only
                                            :cursors :close
                                            :result-type :forward-only})

My issue with this is I've never used eduction before and I'm not sure I'm using it correctly here. I really want to make sure I'm not accidentally evaluating the result set too greedily or in a way that would cause a memory error on large result sets. Does using eduction here preserve the memory efficient, streamable nature of jdbc/plan ? Thanks!

hiredman19:01:54

with #1, when it doesn't work, how are you accessing the kebab case keys?

Aidan19:01:16

I've been doing something like this to test:

(run! #(-> % :ns/my-field println) result-set)

Aidan19:01:53

So just "calling" the keyword fieldname on the row, as if it were a map

seancorfield19:01:29

Inside the reducing function over plan, the builder function is irrelevant -- you are working with column labels because there is no Clojure data structure in play. That's the whole point of using plan: to avoid that overhead.

seancorfield19:01:12

If you just want a fully realized vector of hash maps back, use execute!, not plan.

Aidan20:01:45

I see, that makes sense. I was under the impression that execute! would not work well for streaming large result sets though, because it fully realizes the entire thing (I'm using pedestal's http://pedestal.io/reference/streaming by passing a fn in my response body). So should I stick with plan if I want to handle the results as a stream, or am I misunderstanding how execute! can be used?

seancorfield20:01:13

It's important to understand that plan lets you stream-from-the-DB but you must reduce it eagerly.

seancorfield20:01:11

Resource management is why: when you reduce over plan, it opens a connection, reads and processes the data, and then closes the connection.

Aidan20:01:05

Thanks for the clarification. So if I understand, does that mean I don't need to use plan here, since I plan to materialize every row? Or should I still use plan because I can materialize one row at a time, which will save memory?

Aidan20:01:35

And from what I understand, the use of eduction around the result set should also be fine, as in Approach 2

seancorfield21:01:36

@aidan Sorry, stepped away for lunch... If you use execute! the entire result set is realized and returned. If you can fit the result easily in memory, that's fine. If you can't fit the result in memory -- and so you need streaming-results-from-the-DB -- or you want to avoid the overhead associated with creating a full Clojure data structure from the result set object, then you need plan.

seancorfield21:01:04

If you use execute!, the builder function is used -- to build the Clojure data structure.

👍 3

seancorfield21:01:13

If you use plan, the intent is that you can avoid the overhead of building the Clojure data structure by accessing columns by their label (a simple string or keyword). You are responsible for creating whatever data structure you need from the row-abstraction. There are many ways to approach that, depending on exactly what you want to do.

Aidan21:01:44

I see, so while I may like my idiomatic kebab-case (I'm recently aware there is a spirited online debate about this in Clojure), it may be most efficient to use snake case with plan

seancorfield21:01:08

It is most efficient inside the reducing function over plan to use the simple column names.

seancorfield21:01:31

It really depends what you want to do with the data from each row.

Aidan21:01:50

I'm streaming a csv file to the client with the results, so just accessing certain fields of the row in a given order. Seems like the best bet is to leave it as snake_case

seancorfield21:01:05

Streaming the result of a plan call is problematic because plan expects the result to be consumed eagerly.

seancorfield21:01:11

eduction returns a reducible so you're in the same situation as plan: you are expected to eagerly consume the result, as I understand it.

Aidan21:01:09

Ah, I see, I think even though I'm streaming to the client the result set may be consumed eagerly, since I just have a call to run! that puts all the results onto the IO stream.

Aidan21:01:53

is it still better than execute! even if I realize/materialize the maps as I reduce over the plan result set? As I reduce using run!, only one row is realized at a time right? So while the time complexity would be worse, I was hoping the space complexity would not be too much worse.

seancorfield22:01:53

(run! <something> (plan ...))

You control that <something> so you can determine what gets realized. If you can send snake_case to the IO stream, then you could use #(select-keys % [:the :set_of :cols :you_need]) and it will only create a simple hash map with just those keys, rather than realize the entire row into a hash map with transformed keys.

seancorfield22:01:10

execute! will materialize the entire result set into memory, yes. With plan you can choose how much or how little you materialize. If you are realizing each row completely, you'll use about the same amount of memory overall but it can be GC'd as you walk the result set.

seancorfield22:01:42

(unless you are reducing each row into a data structure, in which case you're just doing what execute! would do anyway! 🙂 )

😅 3

Aidan22:01:59

First off, thanks so much for taking the time to respond to my million questions, it's a great help. When you say "If you are realizing each row completely, you'll use the same amount of memory overall", do you mean the same amount of memory as execute! or plan ? I just figured garbage collection was quick enough that the memory used for the materialized row would be freed up with each iteration...

seancorfield22:01:25

Realizing a row to a (qualified/transformed) hash map takes X amount of memory and Y amount of time. If your result set has 1,000 rows, you'll use X,000 memory and Y,000 time -- in both execute! and plan (if you are realizing rows).

seancorfield22:01:48

It is possible to overrun GC even with plan but relatively unlikely (although it depends what else your system is doing).

seancorfield22:01:41

I'm really just pointing out that plan offers the ability to avoid using X amount of memory per row if you can use just the column labels -- but it's going to depend on what you do with those column values.

seancorfield22:01:09

There's no simple nor single answer here. "It. Depends."

Aidan22:01:16

hmm ok thanks for spelling it out for me! So I guess most of the memory improvement I was seeing was from following https://github.com/seancorfield/next-jdbc/blob/develop/doc/tips-and-tricks.md#streaming-result-sets-1 in your docs, which mitigated the fact that I was fully realizing the rows

seancorfield22:01:56

I'm not sure what "memory improvement" you're talking about there.

seancorfield22:01:32

If you're saying "with execute! I got an Out Of Memory error and when I used plan I did not" well, yes, your throughput on processing the rows allowed the GC to avoid the OOM error.

Aidan22:01:24

If I recall we were getting an OOM before we added those options to plan , but I'm confirming now with someone who made the PR

seancorfield22:01:59

The "problem" here is that SQL and databases are fussy things and every one is different so there's no single, "true" answer for a variety of problems. Which means it's important to understand what's really going on in your code and what trade offs you are making. When I reach for plan, I'm generally also trying to avoid the overhead of realizing a full hash map from a row in the first place -- more so that trying to stream a very large result set that wouldn't fit in memory.

seancorfield22:01:09

(which is why I may seem to be a) laboring the point b) being obtuse c) not answering the question definitively 🙂 )

Aidan22:01:00

Ok, that's very useful to know. I think given this discussion I'll move forward and not attempt to kebabify the maps as it would interfere with the gains from plan

Aidan22:01:15

I realize most issues like this don't have a definitive answer, so I appreciate the thoughtful responses 🙂

seancorfield21:01:59

You could call datafiable-result-set on each row if you want streaming but also want a fully-realized hash map for each row that uses your builder function.

seancorfield21:01:32

But if your data volumes are large enough to require streaming, you probably don't want to fully-realize a Clojure hash map for each row!

2021-01-19

Channels