Fork me on GitHub

@rickmoynihan I haven't managed to catch up with everything discussed yesterday about I/O but interested to see in your transducer-csv library you CSV type basically implements a similar pattern to my read-messages example as the CollReduce protocol. BTW I don't see anything in there that's specifically about CSV? Couldn't it be generalised to be a FileLine type (naming is not my strong point!)




agile_geek: Yes, the pattern is similar - but the limitation with lazy-seq implementations like your example (btw I have a lot of code exactly like that - so not a criticism) is that the life-cycle of the seq and the life cycle of the resource aren’t connected… So if your seq consuming code only ever did (take 10 (read-messages ,,,)) and there were 33 items in the sequence then the resource would never be closed. This is the fundamental problem with lazy-seq based I/O, i.e. you can’t have both the flexibility of laziness (leaving consumption up to the consumer) and control of the resource life cycle. You have to trade one for the other; or be sure that consuming code always either consumes the whole sequence or manages the closing of the resource itself.


> BTW I don't see anything in there that's specifically about CSV? Couldn't it be generalised to be a FileLine type (naming is not my strong point!) Yes definitely. However this isn't supposed to be production code; it was just thrown together to quickly benchmark the performance differences between different approaches to CSV parsing… Basically I was looking at whether it might be possible to port the contrib project to use transducers, and whether there might be a performance benefit in doing so. I spoke a little with alex miller, and the maintainer jonase about the idea.


However I got side tracked onto something else…


the only advantage of being a project I can see if that you’re likely to get found easily. Everything else appears to be downsides


Sounds like there's a need to a small library to provide this?


glenjamin: if you’re a user of the project though you know that it’s probably not going to change radically underneath you though… but yeah I agree


agile_geek: yes, that’s part of my plan… but I have lots of other things that take priority


regarding what you were saying about about having a FileLine type… I think actually you might try and be more general… i.e. you’d have a Reader type… and then just compose read-line into the xform. CSV is quite interesting I was thinking the different flavours of CSV would be composed by different steps e.g. (def xf (comp (split-line \newline) (split-records \,) ->hash-map))


I also had quite a few ideas about how to parallelise the reads for more performance - but then discovered that the univocity parser already implements all of them in java, and is insanely fast… In my benchmarks it took just 4.7s to parse 1gb of CSV on my old macbook air… for comparison would take 50s; and my crude transducer csv parser (not a proper one - just .readLine on the reader and (.split s “,”) on the strings took 13s.


as a base line - I seem to recall the could .readLine over the whole 1gb file in about 400ms - although that might’ve been a BufferedInputStream… can’t quite recall


either way the biggest bottleneck is probably caused by 16 bit characters in java - which is kinda unavoidable unless you don’t want to use strings


Sounds like there's some definite advantages. Even though I've written Java for 19 years I spent most of that time just using what others provide rather than rolling my own solutions (excepting a few things that ended up in J2EE or Spring but weren't there when I started). I sometimes miss that in Clojure. I don't want to have to reimplement a function to safely read a large file everytime I need to do so.


agile_geek: not sure it’s actually that much different to in java… Normally the pattern is implemented for you somewhere in something like c.d.csv which gives you a lazy-seq representation - which is awesome, and you bash away at the repl and it all works - but it leaks resources everywhere in dev… Then you get away with it for a while in production you almost always process the whole file eventually, or error in which case you (hopefully) close it on the exception. The problems arise when people assume they have a normal lazy-seq and can just (take 2) from it in production code and not either consume the whole thing or close the resource. If you need to do it at a low-level then the standard safe pattern for this in clojure is basically the same as in java, i.e. wrap it in a with-open / try-with-resources and eagerly consume it - which was your other approach. The problem with this option is that’s it’s a lot of boilerplate and doesn’t feel very general (or requires you to pass the function in to it)


but I agree that this is a pain too… and really you want to use a good standard transducible-io library.


@rickmoynihan thanks for the examples and the discussion. I'm going to take time at the weekend to write some notes to myself in the form of a blog post to consolidate my learning.


rickmoynihan I'm leaning towards a haskell-y approach w/channels atm (passing through either(?) monads along channels that are the lines or records) as opposed to a erlang style (let it crash and have a supervisor). spec makes me feel this is an option, tho I'm not sure I would have been happy that this would work robustly w/o spec


and then a function with a way of parsing records/lines out a a file and a channel to push them on to


tho I like your eager tranducery style and I need to think more about that (I think transducers need a haskell-y style too as opposed to an erlang sytle)


agile_geek: a pleasure 🙂 — should probably point out that I don’t have any of this stuff in production yet… so not entirely down with the limitations of it; but I’m pretty sure I understand almost all of the problems/tradeoffs to using lazy-seqs with I/O. Will be nice to your blog post.


agile_geek: btw CollReduce basically equates to the I of I/O i.e. the read/input-stream side of the equation… it makes an input stream look like a collection… Whilst Outputstream’s are basically the reducing function side.


otfrom: yeah I’ve thought about similar approaches too… Are you using core.async channels? For generic I/O library though I’m not sure you want to require core.async at the bottom… Though I think if you used transducers you could easily incorporate core.async too. i.e. I think transducible-io should in theory be able to work with both raw java, core.async etc… Though I don’t think transducible-io should force a dependency on core.async


reborg: btw I meant to apologise for yesterday saying “not true…” I’d misread your message and had missed the “not” in it… So I think we were basically in agreement.


rickmoynihan ahh I like violent agreement 🙂 no pbs


@otfrom we're currently heavily using a (manifold) Deferred of a Stream of plain or Either values... wrapping the Stream in a promise makes it easy to derive new streams based on combining old streams with the results of some deferred computations


rickmoynihan the library bit of that might be quite small (something to take a record reading function and a seq of files or paths), but it would be eager and would mean that resources could be controlled in the single function


so it would just be publishing to a channel and then other things could consume/split/tap/other from it as needed


otfrom: yup… library would be super minimal… probably just implementing CollReduce and perhaps a few other pieces on Readers/InputStreams etc… You might also define a small library of composable xform functions for things like .readLine (.read <n>) and similarly you might want a library of reducing functions for writign output streams - just to cover all the core java I/O objects


rickmoynihan hmm... I think you've pointed out where my approach would leak


otfrom: boundary between chans/java blocking I/O?


if I attempted to push a huge file onto the channel with a doall and then only took 2 the pushing function would just sit around waiting for more consuming


(given that it would block when the channel buffer was full)


and I'm probably not even thinking my way around this properly


I think that’s why mccraigmccraig(?) was suggesting use maniform


otfrom: (I’ve never done any core.async for real) but from what I understand I think a transducer would solve it though… transducers communicate when they’re reduced… and reduced tells it when to close


mccraigmccraig did I miss an example of you doing that w/manifold?


(or do you have one you can share)


so (take 2 chan) will trigger the close after 2 elements… as Rich said, transducers are fully decoupled… yet when they evaluate the transformation they know everything… both the read end, the write end and the glue between.


@otfrom right - you shouldn't try and push the whole file onto the channel with doall - you should put async on to the channel in a go block


rickmoynihan I properly need to read through the code examples you sent through


mccraigmccraig was thinking about just putting lines on synchronously using a doall rather than the whole file (tho I think we are agreeing)


@otfrom here's a relatively self-contained manifold+cats streaming example ... it's about exporting user metadata from cassandra to a TSV


mccraigmccraig ace. thx!


and here's another example of one way of getting stuff from outside of manifold on to a stream (with callbacks in this case)


otfrom: the code was never really meant to be shared or read by others in that state… but hopefully illustrates how this kind of thing might work