2024-10-17 data-science | Clojure Slack Archive

data-science 2024-10-17

ag 2024-10-17T14:52:36.589389Z

Can someone please comment on this notion of someone's: > Clojure doesn't have a standard, well-maintained dataframe library - so it is not suitable for any medium to large data science. I don't do much of data science, don't want to reply with anything wrong or misleading

teodorlu 2024-10-18T10:37:08.574139Z

I'd also point to Tablecloth instead of TMD — Tablecloth, it's a big higher level, but uses TMD under the hood.

Ludger Solbach 2024-10-18T12:14:17.285189Z

With noj, you get the curated set of dependencies.

Rupert (Sevva/All Street) 2024-10-18T13:48:05.714919Z

Can someone please comment on this notion of someone's:

I firmly disagree with it. A huge amount of data science is just processing TSV/CSV/JSON files - which Clojure does great at. Clojure has access to every library (including parquet/Arrow/Spark etc) that Java has which absolutely is a data science language. >> well-maintained dataframe library Data frame libraries are a crutch for slow and non-expressive programming languages - Clojure on the other hand is fast and a joy to use directly to manipulate data - with Clojure you don't have to rely on a dataframe library unless you really want to.

👍 2

Ludger Solbach 2024-10-18T16:00:26.365199Z

I have used plain clojure and incanter for data science and data engineering tasks working with csv, json and avro files. Plain clojure with transducers (and https://github.com/cgrand/xforms library) get's you very far. But it's good, we have a data science story for clojure, too.

2024-10-19T09:34:20.689619Z

I am also using https://github.com/techascent/tech.ml.dataset extensively. It is both, well maintained and very performant and space efficient, even for very large datasets with 100 of millions of rows as it is using primitive arrays as backend store. Using it via https://scicloj.github.io/tablecloth adds a very carefully designed API on top of it for easier use. Of course you need to be careful to not "copy" your whole data into a Clojure sequence, which will eventually explode your heap. For me it plays absolutely in the same "quality" field then pandas for python or dplyr for R.

Rupert (Sevva/All Street) 2024-10-19T09:38:41.868359Z

> Of course you need to be careful to not "copy" your whole data into a Clojure sequence, which will eventually explode your heap. You can absolutely copy your whole data into a ‘lazy sequence’ though. Then process it with functions like map and filter no matter how big it is.

2024-10-19T09:43:18.237699Z

Right, I meant "being carefully doing it". For a beginner it is not obvious to see when this will explode, when combining "laziness" and "big memory needs". Laziness can give the (sometime nice) illusion that you have a low memory / fast situation, until you aggregate and "realize the sequences" and it "suddenly" fails.

👍 2

2024-10-19T09:44:47.122649Z

Specially as we have as well "caching" as feature in the standard Clojure lazy idioms.

Rupert (Sevva/All Street) 2024-10-19T10:56:16.678189Z

Yeah, sometimes I will build/run algorithms with a low memory setting (eg -Xmx128m) so that things ‘fail fast’ if they are not correctly lazy.

Harold 2024-10-17T14:58:51.500329Z

TMD: https://github.com/techascent/tech.ml.dataset Both clauses of that someone's notion are false, proceed with caution. Unless it's someone you care about personally, it may be better to just ignore them. TMD is a great library for datasets, and with or without it Clojure is very well suited for data science of all shapes and sizes.

➕ 3

ag 2024-10-17T15:10:53.232679Z

Also, anything used in Java/Javascript can be used as well, right? Spark, Tablesaw, Joinery, Morpheus, Dataframes. On JS side - Danfo, Apache Arrow, Data-forge, Jsstat, Arquero. Any of these have any relevance here?

fedreg 2024-10-17T15:50:55.859089Z

Hi Ag! FWIW, we use java interop directly with native spark. We ship clojure jars that spit out dataframes and call those from scala jobs. works great!! It was a bit tricky to set up but has worked wonderfully so I wouldn't say the lack of a native clojure DF lib has hurt us in any way

ag 2024-10-17T16:06:33.990359Z

Hey, hi, man, great to see you. This sounds very cool. Thank you.

Ludger Solbach 2024-10-17T16:15:35.017759Z

As Harold said, there is no lack of a native clojure DF library. Plus you can use anything on JVM/JS, and even Python, if you're inclined to do so.

➕ 1

respatialized 2024-10-17T19:22:01.515709Z

Also, using next.jdbc with DuckDB can get you quite far for quite a lot of workaday data wrangling tasks that you might use pandas or dplyr for (and may be faster than either of them in many cases). We don’t lack for options. I wonder if this is just based on the fact that one of the only published books on data science in Clojure recommends Incanter, which has indeed largely fallen out of favor. It’s probably not as fast as TMD/tablecloth, but it’s stable and usable as of last year.

👍 1

Harold 2024-10-17T20:44:49.665259Z

Yes. Leveraging DuckDB is smart. TMD connector: https://github.com/techascent/tmducken Blog post: https://techascent.com/blog/just-ducking-around.html --- Over the next few years I would not be surprised at all to see a second generation of such books - daslu's tireless work + kira's, and many others. The book you mention was bold, but unfortunately plays into stereotypes like the one shared by OP. This is all temporary, though. Practitioners are already discovering the benefits of functional data science.

respatialized 2024-10-17T20:59:43.990449Z

also, just because it hasn’t been directly mentioned: tablecloth. https://scicloj.github.io/tablecloth/ It’s a wrapper library that brings the ergonomics of something like dplyr to tech.ml.dataset and has quickly become my favorite API for data manipulation. Joins, aggregations, columnar operations, windows, you name it. Fantastic to work with.

👍 1

➕ 2

Ludger Solbach 2024-10-18T06:56:20.873319Z

and to get you started, just include a https://github.com/scicloj/noj dependency, which contains the dependencies to the data science stack.

Clojurians Log v2

data-science 2024-10-17