This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-10-11
Channels
- # announcements (5)
- # babashka (43)
- # beginners (78)
- # calva (1)
- # cider (35)
- # clj-kondo (15)
- # clj-otel (3)
- # cljs-dev (2)
- # clojure (24)
- # clojure-denmark (1)
- # clojure-dev (9)
- # clojure-europe (43)
- # clojure-israel (1)
- # clojure-italy (1)
- # clojure-losangeles (3)
- # clojure-nl (1)
- # clojure-norway (54)
- # clojure-romania (1)
- # clojure-uk (2)
- # clojurescript (1)
- # core-async (25)
- # cursive (7)
- # datascript (6)
- # datomic (7)
- # docker (2)
- # emacs (2)
- # events (8)
- # exercism (2)
- # fulcro (2)
- # hyperfiddle (16)
- # lsp (46)
- # malli (10)
- # membrane (2)
- # music (6)
- # nbb (30)
- # off-topic (49)
- # polylith (4)
- # reagent (3)
- # releases (4)
- # shadow-cljs (5)
- # slack-help (1)
- # sql (2)
- # testing (2)
- # timbre (6)
- # tools-deps (29)
- # xtdb (36)
Hi all, The recording of this event in now available on YouTube: https://www.youtube.com/watch?v=WA5O7jNoNGE Link to the slides is in the video description.
Fantastic talk! @UDRJMEFSN that talk idea you mention at the end about the commonalities between the carbon cycle and the metacircular interpreter sounds like an ideal topic for a Strange Loop type of event – hope you have an audience for that some day.
On the Persistence - a cautionary tale slide, it says that the panda options are 1) obliterate your data or 2) run out of memory. I get that immutable data means your data doesn't get obliterated (ie #1). To prevent excessive memory usage (#2), is the idea that tmd uses structural sharing or that transformations can be independently specified and composed without generating extra intermediate copies? In other words, intermediate copies aren't required to extract, transform, and then load data to some destination.
Or is there some other trick for getting the advantages of immutable data without excessive memory usage?
@UFTRLDZEW - thanks! I agree that Strange Loop would be a forum for just such a musing 🙂. @U7RJTCH6J - Both of those above. That specific operation creates a map of categorical value to integer and then creates a https://github.com/techascent/tech.ml.dataset/blob/master/src/tech/v3/dataset/categorical.clj#L122. So at a cost of per-element-access which is minified as much as possible a virtualized definition is created on top of the original. So, without writing back to the original and still using the original column a new column that reflects the transformation is created. When data is only carefully copied (a clone call) we can have trade space for performance and in cases of O(N) transformations that tends to be the ideal tradeoff. That was the original theory anyway and it worked fine but we found times where it was just expedient to copy the data when - for instance - we need to process it for some reason like applying a generic transformation that has no information associated with it. In that case you are still only specifically copying the column you care about and the dataset itself is persistent so it gets persistently modified and thus all your older variables still are untouched. Its just generally more correct at no meaningful performance cost or often a memory savings. When we were working in pandas we got burned often by not copying when doing an operation. Sometimes - but not always - there are options that will make an operation create a new dataframe but these aren't the default leading to yet more chaos and most often more copies of the entire dataframe than are necessary. More or less standard functional reasoning but applied in a domain where the assumption is functional can't perform well...
this is incredible work. I’ve started moving some pandas code over to tech.ml.dataset and while I’m still sorting through cases where clojure idioms are good calls and where hugging the library routines close makes more sense, I’m very happy to omit all the df.copy(deep=True)
😆
Due to social reasons (use by other data scientists, pointy-haired boss opinions, etc) I’ve had to write the preprocessing scripts at the front of candel ingest in R and Python (via pret, the tables -> datomic tx tool I built at PICI that got open sourced) . I’m happy to have been able to move a lot of that to clojure for my personal projects. Pandas sharp edges aside, just to be rid of the nightmare that is python dependency management. And the evolving catastrophe that is the colonization of the entire python data science ecosystem by type hinting zealots, big company enterprise OO, and people reacting to the fact that notebooks suck by trying to forego interactive dev entirely and embrace TDD, etc. naively 😛
awesome, awesome stuff.
@U06GLTD17 - Man it is great to hear that from you! For those of you who don't know Ben and I worked together years ago and our discussions and sometimes arguments really helped clarify in my mind why something like TMD was required on the JVM. So this is sort of like closing the loop - very satisfying 🙂.
Devastated I couldn’t make this one. I’ll be catching up on YouTube soon. Thanks for organising, @U0LCHMJTA and to @UDRJMEFSN for speaking. 🙏