2024-04-24 data-science | Clojure Slack Archive

data-science 2024-04-24

Sagar Vrajalal 2024-04-24T11:23:54.343589Z

This might be a bit of a broad and subjective question, but I have some data on Postgres on which I need to run an expensive SQL query, whose results I need to paginate, sort and filter. The resulting data is deeply nested and the schema is dynamic. This query is slow and I'd like to save it to disk periodically so it can be served quicker (for reporting purposes). I also want to do the pagination, sorting and filtering in memory (I know, bad idea!), because there are various combinations that can be applied and persisting them all to disk seems inefficient. A couple of questions: - Can I use tech.ml.dataset to do these operations efficiently in memory? (It looks like the purpose of the library, but I'm not sure if my use case fits) - What is the best data format for me to use? After experimentation, nippy works out of the box (as expected), whereas parquet and arrow keep giving me various errors (Unsupported types like :decimal and :object, doesn't handle nil values, etc). My first thought was a simple CSV but I ruled it out because of the nested data. Perhaps I'm in the wrong space and I shouldn't be doing any of this at all 🙂

bocaj 2024-04-26T02:04:39.463399Z

Try https://github.com/techascent/tech.ml.dataset.sql Tech ml dataset will fit very large datasets in memory, so I’m always surprised how far in memory will take me.

🙏 1

Daniel Slutsky 2024-04-24T13:14:02.125209Z

Hi! For this kind of discussion, the best space is the Clojurians Zulip chat. https://scicloj.github.io/docs/community/chat/ There are a few people there who are experienced with tech.ml.dataset, combining it with nippy, etc.

respatialized 2024-04-24T15:14:24.689099Z

On the “don’t do this all in memory” side of the equation, you should look into reducing over your result set rather than loading it all into memory with t.m.d. It can be very efficient to use transducers if you know you’re going to be filtering and transforming on a row by row basis; that way you can push other transformations to the end after your data has been preprocessed by the transducer pipeline. https://cljdoc.org/d/com.github.seancorfield/next.jdbc/1.3.925/doc/getting-started#plan--reducing-result-sets

Sagar Vrajalal 2024-04-25T06:07:59.788929Z

Reducing over the result set definitely helps making things more efficient, but I guess my bigger concern is around pagination combined with sorting all the rows. That's where the query just slows down (obviously). Hence why I wanted to just persist the raw query results to disk and apply any sorting/transforms when reading it back

Clojurians Log v2

data-science 2024-04-24