Fork me on GitHub
#sql
<
2022-07-14
>
Benjamin09:07:55

related to jdbc next plan - can I make a lazy-seq from an IRecudeInit ? The use case is to feed it into https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html

seancorfield14:07:16

The short answer is "no". Reduction is inherently eager. next.jdbc deliberately avoids lazy seqs because of the resource management issue. Parquet would need to be enhanced to accept a ReduceInit I think.

Benjamin15:07:11

yes this makes sense :thumbsup:

Benjamin15:07:52

@UDRJMEFSN I don't know if I'm an outlier but making a parquet file via reduce instead of doseq would help me

chrisn17:07:40

So you want a sink of some sort (function that takes 1 argument and returns nothing) that can be closed - implements java.lang.AutoCloseable? In java land you want a Consumer that also implements AutoCloseable which takes datasets and writes those to a parquet file.

chrisn17:07:53

Can you get a resultSet from a plan?

chrisn17:07:12

If you can get a java.sql.ResultSet from a plan then you can get a sequence of datasets from that.

seancorfield17:07:29

The thinking was for a Clojure-style API that could accept an IReduceInit (reducible) and reduce it into Parquet files, as an addition to being able to pass a dataset in as a (lazy) sequence.

seancorfield17:07:43

It isn't about being "closeable" specifically.

Benjamin17:07:20

Yea currently I think I don't really get the result set when with jdbc plan. If I do I think I'd have to realize the whole result.

seancorfield17:07:50

next.jdbc's machinery (for plan) is based on not actually running the query until you reduce the result of plan -- at which point it sets up the Connection (from whatever "connectable" you passed in), runs the query, runs the reducing function over the ResultSet (in a lightly-wrapped form), and then cleans up whatever resource allocation it had to do to get the Connection in the first place. The ResultSet itself is not directly exposed (because it is managed by next.jdbc under the hood).

chrisn19:07:02

OK - resultset->dataset-seq is a more efficient pathway than resultset->seq-of-maps->partition-all->dataset-seq. TMD's sql bindings already have https://techascent.github.io/tech.ml.dataset.sql/tech.v3.dataset.sql.html#var-result-set-.3Edataset-seqdataset-seq> for that purpose. It may also have type fidelity differences depending on date types and things like that - it always standardizes on java.time types as those have a compressed representation in tmd land. If I want to flip it around to a push model then the thing that makes the most sense to me is what I said - something that has both accept and close as you have to pass it tmd datasets and then terminate the parquet stream. I think for your use case what you want, if you don't want to use the TMD machinery is to build a record-oriented interface directly on parquet something that the parquet libraries independent of TMD support via generic object pathways. This seems to match your processing style more directly but then you would have to rebuild the type mapping in that maps TMD to Parquet. Stepping away - kid is upset 🙂.

👀 1
seancorfield18:07:46

Given that API, I'd say the best approach is to use next.jdbc to produce a PreparedStatement (via the prepare call) and then use Java interop to .execute() it and .getResultSet() and pass that to TMD -- but beware of the caveat in the TMD docs about closing the statement etc when you're done, esp. if you terminate processing before consuming all the data (liberal use of with-open) @U02CV2P4J6S

❤️ 1
✔️ 1
aaelony16:07:41

hello - does anyone have an example project that can connect over jdbc to sql server with MFA (`[com.microsoft.azure/msal4j "1.13.0"]` and Authentication=ActiveDirectoryInteractive) ? much appreciated

chrisn21:07:21

For any SQL people interested in cutting edge stuff I have a basic https://github.com/cnuernber/tmducken to https://duckdb.org/. We were talking in another thread about writing some SQL and saving the results as one or more parquet files so I thought maybe I would bring this up here. DuckDB can run SQL queries directly against parquet and arrow data on disk and is extremely highly optimized - moreso by a bit than what is possible on the JVM. The results can be realized in memory and in-place converted to a https://cnuernber.github.io/tmducken/tmducken.duckdb.html#var-sql-.3Edataset where you can do more custom work or work that is irritating to translate into SQL. tmducken is not a finished concept and duckdb is a moving target but I thought perhaps some people in this channel may be interested in what I consider next-gen SQL systems for high performance processing. That basics of running sql and visualizing the result from the REPL (via conversion to a TMD dataset) are there.

🙌 1
👀 2
chrisn21:07:50

One thing I will caution - their (tmducken) JDBC driver is not an efficient way (or robust) way to access the api - hence the C bindings.