Fork me on GitHub

God morgen!


Morning! This weekend I tried to make the finishing touches on

🎉 6

Nifty feature:

👍 3

what do people reach for when they want to do lots of streaming calcs on the fly, but share the early stages of it (does this make sense?). I'm doing things where I'm parsing quite large files and then doing some basic cleaning of what comes out of the files and then I'm doing different things based on that processed input. Usually I reach for core.async for this and a mult on the core bit of processing with some kind of reduce on the last channels. I'd like if possible to use as many cores as possible given that the jobs are parallel, but I'd also like to keep the amount of memory down. Any thoughts?

Ben Hammond10:10:17

so you have seperate functions for • scrubbing data • computing some kind of dispatching value • multitimethods to seperately compute the different streams Glued together using tranducing map and parallelised using something like that blogpost I linked to

Ben Hammond10:10:26

I'd use transduce personally


It's a shame that parallel transducers never happened.

Ben Hammond10:10:41

oh hang on theres a bit of a blog post on that

Ben Hammond10:10:07

essentially funnels each value into its own future then run a few entries behind in order to give the futures a chance to execute before you deref them

Ben Hammond10:10:37

(you have to 'close off', the transducer by calling the arity-1 reducing function; if you forget to do that then you might lose some data)


There's the core.async parallel transducer context :)


I think what I mean is that I like composing my code, but I don't have good ways of composing my results as they flow


so I can do

  (map this-fn)
  (filter other-fn))


@otfrom is sequence the right thing here? it caches results so then you will share the early results?


but I don't see a good way of using the results of that w/o doing transduce/into/reduce multiple times or going to core.async/mult


@borkdude sequence is right if I was just after caching, but the data is large enough that I don't want all of the intermediate values in memory if I can avoid it


and the end of my pipes I do a lot of side effecting (writing to files for me atm, tho it has been a data base or msg stream/q in the past)


maybe use a database ;)


it might come to that


atm I can just run it mulitiple times and wait


or go for core.async which is my usual fave way of doing this kind of thing

Ben Hammond11:10:20

I dont see a problem with (a small number) of nested transduce/into/reduce as long as you dont let it get out of control


I don't either. I was just getting to the tipping point on that and was wondering how others were solving the problem

Ben Hammond11:10:14

I find core.async to be awkward to debug from the repl

Ben Hammond11:10:31

but perhaps that is because I haven't tried hard enough


core.async is optimized for correct programs

😆 6
😅 3
Ben Hammond11:10:31

all programs are wrong some are useful

👍 3

(btw I find x/reduce and x/by-key to make my xf pipelines much more composable as it means I need the results of a transduce a lot less often)


I've gotten reasonably good at debugging core.async issues, but I mostly do it by using the xfs in a non-async way to make sure they are right and then building them up piece by piece

Ben Hammond11:10:56

but I find the happy-hacking carefree nature of the REPL to be Clojure's killer feature certainly in terms of maintaining productivity and focus


@otfrom core.reducers is probably what you really want.


But no transducers there.

Ben Hammond11:10:28

have you found to be useful, @dominicm I've been underwhelmed

Ben Hammond11:10:25

I thought the most useful things were fjfork and fjjoin, and they're both declared as private

Ben Hammond11:10:30

er, so I mean I did not find clojure.core.reducers to add much value on top of the raw Java Fork/Join mechanism


I've never used the fork join mechanism. It gives me a pretty easy api to just merge together some bits.


@dominicm it still doesn't give me a way to share my intermediate results. I've had good fun using them with memory mapped files though

Ben Hammond11:10:41

when you say 'share', how do you expect to retrieve the intermediate results?


I'm not entirely sure about the mechanism, but I'd like to avoid re-calculating everything each time and I'd like to avoid having everything in memory all at once

Ben Hammond11:10:59

wouldn't you just have a big map that you assoc interesting things on to

Ben Hammond11:10:30

oh you'd write to a temporary file then

Ben Hammond11:10:47

and assoc its filename into your big map


> oh you'd write to a temporary file then I said, use database.


sqlite has also decent json support if you want something more lightweight


Postgres’ indeed cool.


Like “I can declare indexes on paths in json”-cool.


There's that cool library from Riverford that does that for maps


@otfrom maybe titanoboa?


hmm... my intermediate results don't have to go into a database tho. If I'm reading from a file a lot of times what I'll end up doing is reading line by line, then splitting the lines to multiple records using mapcat and turning things into maps with useful keys and tweaked types at that point


so it is really just a stream by that point and not something I want to write into a database as I'd just dump the intermediate result anyway


most of the work is pretty boring by the time it gets to something that fits into a database


@otfrom maybe redis? ;) - neh, temp file sounds fine then


I don't want to write out to files either. The computation usually fits into memory if I'm reasonably thoughtful about it


it is just coming up today b/c I have 2 charts I need to make that do different filtering on the same scrubbed source data, which is reasonably big (500MB of csv) so takes about 1 minute to run. I'd just like it to stay around that number rather than creep up as I add new things that hang off the back of the reading and scrubbing


@dominicm I've been keeping an eye on titanoboa. That might be the way to go, but I've not looked into it enough


Me neither. I'd love to hear how you get on, if it isn't/is a fit, why that is, etc. I'd love to know the fit.


@dominicm ah, looking at this I know why I'm not going to use it: it is waaaay more than I need

Ben Hammond13:10:24

@otfrom; so you return a map that contains a bunch of atom values as your code processes stuff, it swap! s intermediate values into that map. isnt that the kind of thing that you are talking yourself into?


no, b/c most of what I want to do is before I'd turn anything into a map


hmm... I think an eduction with a redcible that handles the lines/records from the file sounds like the way to go atm. I get the code composition if not the result composition


and then if I want to re-wire it later into core.async then I can as it doesn't sound like there is an alternative for this bit of it


once it is a map then going in and out of a database sounds reasonable


I suppose my issue is that I don't want to keep a database around as it is all batch, so I'd be creating a whole bunch of things just to throw them away when I produce my results


(I go months between getting new bits of input)

Ben Hammond14:10:00

You're worried about the expense of a database that you only use rarely?


@otfrom use redis + memory limit, if you're going to throw away the results afterwards?


from a dependency and remembering how to do it pov


if the results are bigger than memory


the final results for the things I do are usually quite small. I'm almost always rolling up a bunch of simulations and pushing them through a t-digest to create histograms (iqr, medians, that kind of thing). The problem is to get a new histogram for a different cut you have to push all the data through the t-digest engine again


so I end up calculating lots of different counts per simulation and then pushing them through


so, results tiny, input large-ish, but not large enough to need something like spark/titanoboa/other


I'm wondering if my route is eventually going to be and lots of things on top of that as it seems to have lots of ways of doing fast memory mapped access


I wonder if onyx in single node mode would be good for this


I bet there's a java-y thing that's not too bad either.


No, not kafka :)


Kafka streams-alike though


But for a single instance, and presumably with some kind of web interface or something for intermediates.


ChronicleQueue is a decent embedded solution for queue persistence with kafaka'ish semantics


@mpenet and @borkdude ChronicleQueue and tape are on my "will use at some point" list


I suppose they're more single threaded though, kafka streams is anyway iirc. But I'm no expert there.


I seem to be going doing the core.async route