Can someone please comment on this notion of someone's: > Clojure doesn't have a standard, well-maintained dataframe library - so it is not suitable for any medium to large data science. I don't do much of data science, don't want to reply with anything wrong or misleading
I'd also point to Tablecloth instead of TMD — Tablecloth, it's a big higher level, but uses TMD under the hood.
With noj, you get the curated set of dependencies.
Can someone please comment on this notion of someone's:I firmly disagree with it. A huge amount of data science is just processing TSV/CSV/JSON files - which Clojure does great at. Clojure has access to every library (including parquet/Arrow/Spark etc) that Java has which absolutely is a data science language. >> well-maintained dataframe library Data frame libraries are a crutch for slow and non-expressive programming languages - Clojure on the other hand is fast and a joy to use directly to manipulate data - with Clojure you don't have to rely on a dataframe library unless you really want to.
I have used plain clojure and incanter for data science and data engineering tasks working with csv, json and avro files. Plain clojure with transducers (and https://github.com/cgrand/xforms library) get's you very far. But it's good, we have a data science story for clojure, too.
I am also using https://github.com/techascent/tech.ml.dataset extensively. It is both, well maintained and very performant and space efficient, even for very large datasets with 100 of millions of rows as it is using primitive arrays as backend store. Using it via https://scicloj.github.io/tablecloth adds a very carefully designed API on top of it for easier use. Of course you need to be careful to not "copy" your whole data into a Clojure sequence, which will eventually explode your heap. For me it plays absolutely in the same "quality" field then pandas for python or dplyr for R.
> Of course you need to be careful to not "copy" your whole data into a Clojure sequence, which will eventually explode your heap. You can absolutely copy your whole data into a ‘lazy sequence’ though. Then process it with functions like map and filter no matter how big it is.
Right, I meant "being carefully doing it". For a beginner it is not obvious to see when this will explode, when combining "laziness" and "big memory needs". Laziness can give the (sometime nice) illusion that you have a low memory / fast situation, until you aggregate and "realize the sequences" and it "suddenly" fails.
Specially as we have as well "caching" as feature in the standard Clojure lazy idioms.
Yeah, sometimes I will build/run algorithms with a low memory setting (eg -Xmx128m) so that things ‘fail fast’ if they are not correctly lazy.
TMD: https://github.com/techascent/tech.ml.dataset Both clauses of that someone's notion are false, proceed with caution. Unless it's someone you care about personally, it may be better to just ignore them. TMD is a great library for datasets, and with or without it Clojure is very well suited for data science of all shapes and sizes.
Also, anything used in Java/Javascript can be used as well, right? Spark, Tablesaw, Joinery, Morpheus, Dataframes. On JS side - Danfo, Apache Arrow, Data-forge, Jsstat, Arquero. Any of these have any relevance here?
Hi Ag! FWIW, we use java interop directly with native spark. We ship clojure jars that spit out dataframes and call those from scala jobs. works great!! It was a bit tricky to set up but has worked wonderfully so I wouldn't say the lack of a native clojure DF lib has hurt us in any way
Hey, hi, man, great to see you. This sounds very cool. Thank you.
As Harold said, there is no lack of a native clojure DF library. Plus you can use anything on JVM/JS, and even Python, if you're inclined to do so.
Also, using next.jdbc with DuckDB can get you quite far for quite a lot of workaday data wrangling tasks that you might use pandas or dplyr for (and may be faster than either of them in many cases). We don’t lack for options. I wonder if this is just based on the fact that one of the only published books on data science in Clojure recommends Incanter, which has indeed largely fallen out of favor. It’s probably not as fast as TMD/tablecloth, but it’s stable and usable as of last year.
Yes. Leveraging DuckDB is smart. TMD connector: https://github.com/techascent/tmducken Blog post: https://techascent.com/blog/just-ducking-around.html --- Over the next few years I would not be surprised at all to see a second generation of such books - daslu's tireless work + kira's, and many others. The book you mention was bold, but unfortunately plays into stereotypes like the one shared by OP. This is all temporary, though. Practitioners are already discovering the benefits of functional data science.
also, just because it hasn’t been directly mentioned: tablecloth. https://scicloj.github.io/tablecloth/ It’s a wrapper library that brings the ergonomics of something like dplyr to tech.ml.dataset and has quickly become my favorite API for data manipulation. Joins, aggregations, columnar operations, windows, you name it. Fantastic to work with.
and to get you started, just include a https://github.com/scicloj/noj dependency, which contains the dependencies to the data science stack.