Fork me on GitHub

Good Morning!


God morgen!


Friday, yo!


for the stuff I'm doing passing around eductions works as they all end up in transduce or into in the end


tho having them as a sequence would be handy as they just fit into memory


I can use it in run! tho, which is handy. I wonder about adding an ISeq interface to the things in reducibles

Ben Hammond10:11:57

I once had a fun time discovering this exact problem in the code from a highly-paid consultant which left me a little sensitive to it


That is a fair enough reason to be touchy about it


just thinking about the trade offs between eduction/sequence/into


eduction is going to recalculate things each time you run through it, so it is cheap in memory, but expensive in CPU


sequence realises things one at a time like eduction, but keeps the results in memory, so if you pass it around to other things they get to use the cached values. It will only realise as much of the underlying thing as you ask for tho, so if you don't need all the data then it won't get it all


AFAIU it also doesn't do chunks the way seq does


into will greedily realise everything at the beginning, so if you are always going to want all of it then it is a good replacement for sequence


are those good rules of thumb?


sounds good to me


@ben.hammond so calling (sequence (take 10) (seq eduction-thingie)) works. I'm trying to figure out the downside (other than seq realising things in 32 element chunks I think)

Ben Hammond11:11:37

I would question what the eduction is actually buying you


atm, eduction is wrapping up some IO on a csv

Ben Hammond11:11:46

(sequence (comp (take 10) xfrom-previously-hidden-inside-eduction) coll) may work just as well


the InputStream is pointing at something largish


and it will be massively larger if the reducible transit on top of Fressian works and is performant enough


so I want a sequence that manages the file handle using reducible (I think)

Ben Hammond11:11:16

so those two things are fundamentally in tension because you dont know when a sequences resources may be disposed of


as eduction returns an iterateable thing that can be handed to seq I thought this might be my escape hatch


yeah, I agree that they are in tension

Ben Hammond11:11:02

reducibles know exactly when they are no longer required

Ben Hammond11:11:07

sequences do not


it might just be a dumb idea and the reason to stick to eduction and reducible is to close things down ASAP


(and I'm happy for that to be the answer)

Ben Hammond11:11:59

so you could end up contorting yourself into things like

(defn lazywalk-reducible
  "walks the reducible in chunks of size n,
  returns an iterable that permits access"
  [n reducible]
  (reify java.lang.Iterable
    (iterator [_]
      (let [bq (java.util.concurrent.ArrayBlockingQueue. n)
            finished? (volatile! false)
            traverser (future (reduce (fn [_ v] (.put bq v)) nil reducible)
                              (vreset! finished? true))]
        (reify java.util.Iterator
          (hasNext [_] (or (false? @finished?) (false? (.isEmpty bq))))
          (next [_] (.take bq)))))))

Ben Hammond11:11:34

I appreciate that code can be made more concise... finished? is superfluous but if you want to end up with a sequence, and you only have a reducible to plug into it I dont see how you can avoid this

Ben Hammond11:11:35

and it has the downside that if you dont walk the entire Iterator, then it leaks resource


I suppose the eduction makes it pretty obvious that the thing on disk is fundamentally mutable too

Ben Hammond11:11:12

er, does it? I don't follow

Ben Hammond11:11:33

every time you query the csv file you get a different sequence of lines?


not usually in practice, but fundamentally. Another process could be writing things into that file (which would probably mess things up, but is what the OS allows)

Ben Hammond11:11:56

Ive always ended up with a single

kind of thing when I'v had to do this kind of processing


that's what my stuff looks like, but I've got lots of different cuts I need to do-my-inputdata-transformations-on-lines-of-csv and write-my-outputs-insomeway-that-can-handle-it


up to a few hundred atm of the same files


which is why I keep coming back to core.async to do it

Ben Hammond11:11:14

is the file always available in its entirety?


but being able to reuse reducing step functions and transducer pipelines for one off things using eduction is handy while developing

Ben Hammond11:11:26

do you have to wait for it to arrive in dribs and drabs?


yeah, it is always available. It is all pretty batchy


and it is usually files rather than file


at least 2, often 30ish

Ben Hammond11:11:38

so if you compose the xform functions doesn't that give you the same thing as the eduction? just not tied to a specific input

Ben Hammond11:11:57

(which is probably a good thing)


it does, I just usually need to compose some on top of others

Ben Hammond11:11:35

you can compose infinitely deeply


so I'll do basic "read it and clean it" and then I'll have others that add particular derived fields or do some filtering or reduce things and then spit out a large reduced thing

Ben Hammond11:11:43

composure all the way down


but being able to cache the intermediate results w/o having to go back to the original files usually gives me a good speed boost


if I have enough RAM I can do that with into

Ben Hammond11:11:35

but an eduction does not do caching?


no, eduction doesn't which is why I was looking at putting it into a sequence

Ben Hammond11:11:07

so you are back to writing intermediate results into postgres? 8P


I've avoided that so far. Creating postgres tables on the fly and introducing a big external dependency for something batch based like this feels like a big pain.

Ben Hammond11:11:08

are these intermediate results infinitely long?


often the intermediate results are bigger than is easily handled in RAM

Ben Hammond11:11:13

are they manageabfe


not infinite


but just big enough that I worry about -Xmx and the OOM Killer

Ben Hammond11:11:46

ah right so you want to have alot of simultaneous calculations so that you can leverage


(at least if I'm doing it on my laptop)

Ben Hammond11:11:08

and then a moving window of intermediate calcs


yeah, I've got cores sitting idle (which is why I'd like to use core.async)

Ben Hammond11:11:13

hence the core.async


and the size of the data pushes at the edges of 8/12/16GB


on a 16GB machine a -Xmx of 12GB is about the most I'm happy with to consistently avoid the OOM Killer


and then I need to shut down all browsers/slack/etc


that is doing it single threaded and in memory


it got better when I moved from ->> to using transducers


lots of speed up and fewer OutOfMemoryErrors

Ben Hammond11:11:08

theres a thing I'v never said before...

Ben Hammond11:11:34

each agent running a transducer to process its own bit

Ben Hammond11:11:56

and punting its results out to other agents in the reducing function


agent and send-off?


that's not a terrible idea

😁 3

I'll think about that


feels a bit like I'm reimplementing a small buggy subset of core.async 😉

Ben Hammond11:11:07

what you gain on the swings...


at least w/core.async other people are looking at the guts of the pipeline too and I can always ask Alex for design advice 😄


core.async channel + mult = the shareable eduction ?


async/reduce is just transduce


just have to set up the mechanism before pushing the batch data through


and having easy parallelism in pipeline-blocking is nice


interesting that there is another data engineering/data science DAG out there:


feels like everyone is doing one. I quite like being able to do it in just a library and have things like transducers and reducing step functions work in lots of different ways


I wish transducers had figured out shared state


in what way?


Well, pipeline can't use any stateful transducers, like distinct.


I'm a bit in two minds about stateful transducers.


They are very useful, but it feels a bit like systems that abstract out network calls. They are very easy, but they are hiding a lot of things underneath that you might want control over or view of or will fail in ways you wouldn't expect


I've had some problems with things like x/by-key in transducers as there is a bug in core.async(?) that means the completing arity of the reducing bit is getting called multiple times


and obviously it changes how I need to reason about things moving through a core.async system as some channels will be storing up big memory stores of data rather than passing it downstream


It might just be as simple as there's no clear visual indicator when you're dealing with a stateful rather than stateless transducer, and limited guidance on when to use what


distinct! ?


it is definitely becoming more embodied knowledge and lore