This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2020-11-13
Channels
- # aleph (1)
- # announcements (18)
- # babashka (11)
- # beginners (112)
- # business (1)
- # calva (19)
- # cider (8)
- # clj-kondo (63)
- # cljsrn (10)
- # clojure (188)
- # clojure-australia (1)
- # clojure-dev (38)
- # clojure-europe (112)
- # clojure-nl (3)
- # clojure-provo (1)
- # clojure-spec (22)
- # clojure-uk (108)
- # clojurescript (37)
- # cryogen (4)
- # cursive (8)
- # data-science (1)
- # datomic (13)
- # emacs (9)
- # events (1)
- # fulcro (26)
- # funcool (3)
- # graalvm (2)
- # graphql (11)
- # helix (8)
- # jobs (1)
- # jobs-discuss (7)
- # nrepl (3)
- # off-topic (72)
- # pathom (10)
- # pedestal (1)
- # reagent (6)
- # reitit (7)
- # remote-jobs (1)
- # shadow-cljs (28)
- # xtdb (12)
re: https://clojurians.slack.com/archives/CBJ5CGE0G/p1605207260374000 I had a play and you are of course correct
for the stuff I'm doing passing around eductions works as they all end up in transduce or into in the end
I can use it in run!
tho, which is handy. I wonder about adding an ISeq interface to the things in reducibles
I once had a fun time discovering this exact problem in the code from a highly-paid consultant which left me a little sensitive to it
eduction is going to recalculate things each time you run through it, so it is cheap in memory, but expensive in CPU
sequence realises things one at a time like eduction, but keeps the results in memory, so if you pass it around to other things they get to use the cached values. It will only realise as much of the underlying thing as you ask for tho, so if you don't need all the data then it won't get it all
into will greedily realise everything at the beginning, so if you are always going to want all of it then it is a good replacement for sequence
@ben.hammond so calling
(sequence (take 10) (seq eduction-thingie))
works. I'm trying to figure out the downside (other than seq realising things in 32 element chunks I think)
I would question what the eduction
is actually buying you
(sequence (comp (take 10) xfrom-previously-hidden-inside-eduction) coll)
may work just as well
and it will be massively larger if the reducible transit on top of Fressian works and is performant enough
so those two things are fundamentally in tension because you dont know when a sequences resources may be disposed of
as eduction returns an iterateable thing that can be handed to seq I thought this might be my escape hatch
reducibles know exactly when they are no longer required
sequences do not
it might just be a dumb idea and the reason to stick to eduction and reducible is to close things down ASAP
so you could end up contorting yourself into things like
(defn lazywalk-reducible
"walks the reducible in chunks of size n,
returns an iterable that permits access"
[n reducible]
(reify java.lang.Iterable
(iterator [_]
(let [bq (java.util.concurrent.ArrayBlockingQueue. n)
finished? (volatile! false)
traverser (future (reduce (fn [_ v] (.put bq v)) nil reducible)
(vreset! finished? true))]
(reify java.util.Iterator
(hasNext [_] (or (false? @finished?) (false? (.isEmpty bq))))
(next [_] (.take bq)))))))
I appreciate that code can be made more concise... finished?
is superfluous
but if you want to end up with a sequence, and you only have a reducible to plug into it
I dont see how you can avoid this
and it has the downside that if you dont walk the entire Iterator, then it leaks resource
I suppose the eduction makes it pretty obvious that the thing on disk is fundamentally mutable too
er, does it? I don't follow
every time you query the csv file you get a different sequence of lines?
not usually in practice, but fundamentally. Another process could be writing things into that file (which would probably mess things up, but is what the OS allows)
Ive always ended up with a single
(transduce
do-my-inputdata-transformations-on-lines-of-csv
write-my-outputs-insomeway-that-can-handle-it
initialise-my-outputs-somehow
dredge-my-enormous-csvfile-for-its-current-status)
kind of thing
when I'v had to do this kind of processingthat's what my stuff looks like, but I've got lots of different cuts I need to do-my-inputdata-transformations-on-lines-of-csv
and write-my-outputs-insomeway-that-can-handle-it
is the file always available in its entirety?
but being able to reuse reducing step functions and transducer pipelines for one off things using eduction is handy while developing
do you have to wait for it to arrive in dribs and drabs?
so if you compose the xform functions doesn't that give you the same thing as the eduction? just not tied to a specific input
(which is probably a good thing)
you can compose infinitely deeply
so I'll do basic "read it and clean it" and then I'll have others that add particular derived fields or do some filtering or reduce things and then spit out a large reduced thing
composure all the way down
but being able to cache the intermediate results w/o having to go back to the original files usually gives me a good speed boost
but an eduction does not do caching?
so you are back to writing intermediate results into postgres? 8P
I've avoided that so far. Creating postgres tables on the fly and introducing a big external dependency for something batch based like this feels like a big pain.
are these intermediate results infinitely long?
are they manageabfe
ah right so you want to have alot of simultaneous calculations so that you can leverage
and then a moving window of intermediate calcs
hence the core.async
on a 16GB machine a -Xmx of 12GB is about the most I'm happy with to consistently avoid the OOM Killer
could be a job for https://clojure.github.io/clojure/clojure.core-api.html#clojure.core/agent ?,
theres a thing I'v never said before...
each agent running a transducer to process its own bit
and punting its results out to other agents in the reducing function
maybe?
kinda thing
what you gain on the swings...
at least w/core.async other people are looking at the guts of the pipeline too and I can always ask Alex for design advice 😄
interesting that there is another data engineering/data science DAG out there: https://clojurians.slack.com/archives/C0BQDEJ8M/p1605250177093100
feels like everyone is doing one. I quite like being able to do it in just a library and have things like transducers and reducing step functions work in lots of different ways
They are very useful, but it feels a bit like systems that abstract out network calls. They are very easy, but they are hiding a lot of things underneath that you might want control over or view of or will fail in ways you wouldn't expect
I've had some problems with things like x/by-key in transducers as there is a bug in core.async(?) that means the completing arity of the reducing bit is getting called multiple times
and obviously it changes how I need to reason about things moving through a core.async system as some channels will be storing up big memory stores of data rather than passing it downstream