clojure-europe 2020-11-13

sequence realises things one at a time like eduction, but keeps the results in memory, so if you pass it around to other things they get to use the cached values. It will only realise as much of the underlying thing as you ask for tho, so if you don't need all the data then it won't get it all

otfrom10:11:05

AFAIU it also doesn't do chunks the way seq does

otfrom10:11:06

into will greedily realise everything at the beginning, so if you are always going to want all of it then it is a good replacement for sequence

otfrom10:11:11

are those good rules of thumb?

borkdude10:11:58

sounds good to me

otfrom11:11:35

@ben.hammond so calling (sequence (take 10) (seq eduction-thingie)) works. I'm trying to figure out the downside (other than seq realising things in 32 element chunks I think)

Ben Hammond11:11:37

I would question what the eduction is actually buying you

otfrom11:11:29

atm, eduction is wrapping up some IO on a csv

Ben Hammond11:11:46

(sequence (comp (take 10) xfrom-previously-hidden-inside-eduction) coll) may work just as well

otfrom11:11:21

the InputStream is pointing at something largish

otfrom11:11:50

and it will be massively larger if the reducible transit on top of Fressian works and is performant enough

otfrom11:11:20

so I want a sequence that manages the file handle using reducible (I think)

Ben Hammond11:11:16

so those two things are fundamentally in tension because you dont know when a sequences resources may be disposed of

otfrom11:11:18

as eduction returns an iterateable thing that can be handed to seq I thought this might be my escape hatch

otfrom11:11:54

yeah, I agree that they are in tension

Ben Hammond11:11:02

reducibles know exactly when they are no longer required

Ben Hammond11:11:07

sequences do not

otfrom11:11:21

it might just be a dumb idea and the reason to stick to eduction and reducible is to close things down ASAP

otfrom11:11:35

(and I'm happy for that to be the answer)

Ben Hammond11:11:59

so you could end up contorting yourself into things like

(defn lazywalk-reducible
  "walks the reducible in chunks of size n,
  returns an iterable that permits access"
  [n reducible]
  (reify java.lang.Iterable
    (iterator [_]
      (let [bq (java.util.concurrent.ArrayBlockingQueue. n)
            finished? (volatile! false)
            traverser (future (reduce (fn [_ v] (.put bq v)) nil reducible)
                              (vreset! finished? true))]
        (reify java.util.Iterator
          (hasNext [_] (or (false? @finished?) (false? (.isEmpty bq))))
          (next [_] (.take bq)))))))

Ben Hammond11:11:34

I appreciate that code can be made more concise... finished? is superfluous but if you want to end up with a sequence, and you only have a reducible to plug into it I dont see how you can avoid this

Ben Hammond11:11:35

and it has the downside that if you dont walk the entire Iterator, then it leaks resource

otfrom11:11:42

I suppose the eduction makes it pretty obvious that the thing on disk is fundamentally mutable too

Ben Hammond11:11:12

er, does it? I don't follow

Ben Hammond11:11:33

every time you query the csv file you get a different sequence of lines?

otfrom11:11:39

not usually in practice, but fundamentally. Another process could be writing things into that file (which would probably mess things up, but is what the OS allows)

Ben Hammond11:11:56

Ive always ended up with a single

(transduce
  do-my-inputdata-transformations-on-lines-of-csv
  write-my-outputs-insomeway-that-can-handle-it
  initialise-my-outputs-somehow
  dredge-my-enormous-csvfile-for-its-current-status)

kind of thing when I'v had to do this kind of processing

otfrom11:11:58

that's what my stuff looks like, but I've got lots of different cuts I need to do-my-inputdata-transformations-on-lines-of-csv and write-my-outputs-insomeway-that-can-handle-it

otfrom11:11:19

up to a few hundred atm of the same files

otfrom11:11:41

which is why I keep coming back to core.async to do it

Ben Hammond11:11:14

is the file always available in its entirety?

otfrom11:11:15

but being able to reuse reducing step functions and transducer pipelines for one off things using eduction is handy while developing

Ben Hammond11:11:26

do you have to wait for it to arrive in dribs and drabs?

otfrom11:11:34

yeah, it is always available. It is all pretty batchy

otfrom11:11:01

and it is usually files rather than file

otfrom11:11:11

at least 2, often 30ish

Ben Hammond11:11:38

so if you compose the xform functions doesn't that give you the same thing as the eduction? just not tied to a specific input

Ben Hammond11:11:57

(which is probably a good thing)

otfrom11:11:08

it does, I just usually need to compose some on top of others

Ben Hammond11:11:35

you can compose infinitely deeply

otfrom11:11:40

so I'll do basic "read it and clean it" and then I'll have others that add particular derived fields or do some filtering or reduce things and then spit out a large reduced thing

Ben Hammond11:11:43

composure all the way down

otfrom11:11:47

yeah

otfrom11:11:12

but being able to cache the intermediate results w/o having to go back to the original files usually gives me a good speed boost

otfrom11:11:33

if I have enough RAM I can do that with into

Ben Hammond11:11:35

but an eduction does not do caching?

otfrom11:11:53

no, eduction doesn't which is why I was looking at putting it into a sequence

Ben Hammond11:11:07

so you are back to writing intermediate results into postgres? 8P

otfrom11:11:03

I've avoided that so far. Creating postgres tables on the fly and introducing a big external dependency for something batch based like this feels like a big pain.

Ben Hammond11:11:08

are these intermediate results infinitely long?

otfrom11:11:11

often the intermediate results are bigger than is easily handled in RAM

Ben Hammond11:11:13

are they manageabfe

otfrom11:11:19

not infinite

otfrom11:11:32

but just big enough that I worry about -Xmx and the OOM Killer

Ben Hammond11:11:46

ah right so you want to have alot of simultaneous calculations so that you can leverage

otfrom11:11:48

(at least if I'm doing it on my laptop)

Ben Hammond11:11:08

and then a moving window of intermediate calcs

otfrom11:11:12

yeah, I've got cores sitting idle (which is why I'd like to use core.async)

Ben Hammond11:11:13

hence the core.async

otfrom11:11:31

and the size of the data pushes at the edges of 8/12/16GB

otfrom11:11:51

on a 16GB machine a -Xmx of 12GB is about the most I'm happy with to consistently avoid the OOM Killer

otfrom11:11:03

and then I need to shut down all browsers/slack/etc

otfrom11:11:26

that is doing it single threaded and in memory

otfrom11:11:46

it got better when I moved from ->> to using transducers

Ben Hammond11:11:01

could be a job for https://clojure.github.io/clojure/clojure.core-api.html#clojure.core/agent ?,

otfrom11:11:08

lots of speed up and fewer OutOfMemoryErrors

Ben Hammond11:11:08

theres a thing I'v never said before...

Ben Hammond11:11:34

each agent running a transducer to process its own bit

Ben Hammond11:11:56

and punting its results out to other agents in the reducing function

maybe?

agent and send-off?

kinda thing

that's not a terrible idea

😁 3

otfrom11:11:28

I'll think about that

otfrom11:11:38

feels a bit like I'm reimplementing a small buggy subset of core.async 😉

Ben Hammond11:11:07

what you gain on the swings...

otfrom11:11:47

at least w/core.async other people are looking at the guts of the pipeline too and I can always ask Alex for design advice 😄

otfrom11:11:32

core.async channel + mult = the shareable eduction ?

otfrom11:11:03

async/reduce is just transduce

otfrom11:11:20

just have to set up the mechanism before pushing the batch data through

otfrom11:11:39

and having easy parallelism in pipeline-blocking is nice

otfrom13:11:25

interesting that there is another data engineering/data science DAG out there: https://clojurians.slack.com/archives/C0BQDEJ8M/p1605250177093100

otfrom13:11:01

feels like everyone is doing one. I quite like being able to do it in just a library and have things like transducers and reducing step functions work in lots of different ways

dominicm13:11:12

I wish transducers had figured out shared state

otfrom13:11:40

in what way?

dominicm13:11:52

Well, pipeline can't use any stateful transducers, like distinct.

otfrom14:11:47

got it

otfrom14:11:23

I'm a bit in two minds about stateful transducers.

otfrom14:11:10

They are very useful, but it feels a bit like systems that abstract out network calls. They are very easy, but they are hiding a lot of things underneath that you might want control over or view of or will fail in ways you wouldn't expect

otfrom14:11:00

I've had some problems with things like x/by-key in transducers as there is a bug in core.async(?) that means the completing arity of the reducing bit is getting called multiple times

otfrom14:11:45

and obviously it changes how I need to reason about things moving through a core.async system as some channels will be storing up big memory stores of data rather than passing it downstream

dominicm14:11:22

It might just be as simple as there's no clear visual indicator when you're dealing with a stateful rather than stateless transducer, and limited guidance on when to use what

otfrom14:11:29

distinct! ?

otfrom14:11:03

it is definitely becoming more embodied knowledge and lore

2020-11-13

Channels