This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
- # beginners (97)
- # boot (77)
- # cider (7)
- # cljs-dev (47)
- # cljsrn (3)
- # clojure (125)
- # clojure-austin (5)
- # clojure-dusseldorf (1)
- # clojure-italy (4)
- # clojure-russia (91)
- # clojure-spec (80)
- # clojure-uk (54)
- # clojurescript (92)
- # core-async (6)
- # cursive (17)
- # datomic (56)
- # hoplon (7)
- # immutant (3)
- # liberator (3)
- # luminus (4)
- # off-topic (26)
- # om (41)
- # om-next (11)
- # pedestal (3)
- # perun (3)
- # protorepl (25)
- # re-frame (32)
- # reagent (33)
- # ring (46)
- # rum (3)
- # spacemacs (5)
- # specter (82)
- # test-check (16)
- # untangled (8)
- # yada (26)
@rickmoynihan I haven't managed to catch up with everything discussed yesterday about I/O but interested to see in your transducer-csv library you CSV type basically implements a similar pattern to my
read-messages example as the CollReduce protocol. BTW I don't see anything in there that's specifically about CSV? Couldn't it be generalised to be a FileLine type (naming is not my strong point!)
agile_geek: Yes, the pattern is similar - but the limitation with
lazy-seq implementations like your example (btw I have a lot of code exactly like that - so not a criticism) is that the life-cycle of the seq and the life cycle of the resource aren’t connected… So if your seq consuming code only ever did
(take 10 (read-messages ,,,)) and there were 33 items in the sequence then the resource would never be closed.
This is the fundamental problem with lazy-seq based I/O, i.e. you can’t have both the flexibility of laziness (leaving consumption up to the consumer) and control of the resource life cycle. You have to trade one for the other; or be sure that consuming code always either consumes the whole sequence or manages the closing of the resource itself.
> BTW I don't see anything in there that's specifically about CSV? Couldn't it be generalised to be a FileLine type (naming is not my strong point!)
Yes definitely. However this isn't supposed to be production code; it was just thrown together to quickly benchmark the performance differences between different approaches to CSV parsing… Basically I was looking at whether it might be possible to port the contrib project
clojure.data.csv to use transducers, and whether there might be a performance benefit in doing so.
I spoke a little with alex miller, and the maintainer jonase about the idea.
the only advantage of being a
clojure.data project I can see if that you’re likely to get found easily. Everything else appears to be downsides
glenjamin: if you’re a user of the project though you know that it’s probably not going to change radically underneath you though… but yeah I agree
agile_geek: yes, that’s part of my plan… but I have lots of other things that take priority
regarding what you were saying about about having a
FileLine type… I think actually you might try and be more general… i.e. you’d have a
Reader type… and then just compose
read-line into the xform.
CSV is quite interesting I was thinking the different flavours of CSV would be composed by different steps e.g.
(def xf (comp (split-line \newline) (split-records \,) ->hash-map))
I also had quite a few ideas about how to parallelise the reads for more performance - but then discovered that the univocity parser already implements all of them in java, and is insanely fast… In my benchmarks it took just 4.7s to parse 1gb of CSV on my old macbook air… for comparison
clojure.data.csv would take 50s; and my crude transducer csv parser (not a proper one - just
.readLine on the reader and
(.split s “,”) on the strings took 13s.
as a base line - I seem to recall the java.io.Reader could
.readLine over the whole 1gb file in about 400ms - although that might’ve been a BufferedInputStream… can’t quite recall
either way the biggest bottleneck is probably caused by 16 bit characters in java - which is kinda unavoidable unless you don’t want to use strings
Sounds like there's some definite advantages. Even though I've written Java for 19 years I spent most of that time just using what others provide rather than rolling my own solutions (excepting a few things that ended up in J2EE or Spring but weren't there when I started). I sometimes miss that in Clojure. I don't want to have to reimplement a function to safely read a large file everytime I need to do so.
agile_geek: not sure it’s actually that much different to in java… Normally the pattern is implemented for you somewhere in something like
c.d.csv which gives you a lazy-seq representation - which is awesome, and you bash away at the repl and it all works - but it leaks resources everywhere in dev… Then you get away with it for a while in production you almost always process the whole file eventually, or error in which case you (hopefully) close it on the exception.
The problems arise when people assume they have a normal lazy-seq and can just
(take 2) from it in production code and not either consume the whole thing or close the resource.
If you need to do it at a low-level then the standard safe pattern for this in clojure is basically the same as in java, i.e. wrap it in a
with-open / try-with-resources and eagerly consume it - which was your other approach. The problem with this option is that’s it’s a lot of boilerplate and doesn’t feel very general (or requires you to pass the function in to it)
but I agree that this is a pain too… and really you want to use a good standard transducible-io library.
@rickmoynihan thanks for the examples and the discussion. I'm going to take time at the weekend to write some notes to myself in the form of a blog post to consolidate my learning.
rickmoynihan I'm leaning towards a haskell-y approach w/channels atm (passing through either(?) monads along channels that are the lines or records) as opposed to a erlang style (let it crash and have a supervisor). spec makes me feel this is an option, tho I'm not sure I would have been happy that this would work robustly w/o spec
and then a function with a way of parsing records/lines out a a file and a channel to push them on to
tho I like your eager tranducery style and I need to think more about that (I think transducers need a haskell-y style too as opposed to an erlang sytle)
agile_geek: a pleasure 🙂 — should probably point out that I don’t have any of this stuff in production yet… so not entirely down with the limitations of it; but I’m pretty sure I understand almost all of the problems/tradeoffs to using lazy-seqs with I/O. Will be nice to your blog post.
CollReduce basically equates to the I of I/O i.e. the read/input-stream side of the equation… it makes an input stream look like a collection… Whilst Outputstream’s are basically the reducing function side.
otfrom: yeah I’ve thought about similar approaches too… Are you using core.async channels? For generic I/O library though I’m not sure you want to require core.async at the bottom… Though I think if you used transducers you could easily incorporate core.async too. i.e. I think transducible-io should in theory be able to work with both raw java, core.async etc… Though I don’t think transducible-io should force a dependency on core.async
reborg: btw I meant to apologise for yesterday saying “not true…” I’d misread your message and had missed the “not” in it… So I think we were basically in agreement.
@otfrom we're currently heavily using a (manifold) Deferred of a Stream of plain or Either values... wrapping the Stream in a promise makes it easy to derive new streams based on combining old streams with the results of some deferred computations
rickmoynihan the library bit of that might be quite small (something to take a record reading function and a seq of files or paths), but it would be eager and would mean that resources could be controlled in the single function
so it would just be publishing to a channel and then other things could consume/split/tap/other from it as needed
otfrom: yup… library would be super minimal… probably just implementing
CollReduce and perhaps a few other pieces on Readers/InputStreams etc…
You might also define a small library of composable
xform functions for things like
(.read <n>) and similarly you might want a library of reducing functions for writign output streams - just to cover all the core java I/O objects
if I attempted to push a huge file onto the channel with a doall and then only took 2 the pushing function would just sit around waiting for more consuming
otfrom: (I’ve never done any core.async for real) but from what I understand I think a transducer would solve it though… transducers communicate when they’re reduced… and reduced tells it when to close
(take 2 chan) will trigger the close after 2 elements… as Rich said, transducers are fully decoupled… yet when they evaluate the transformation they know everything… both the read end, the write end and the glue between.
@otfrom right - you shouldn't try and push the whole file onto the channel with doall - you should put async on to the channel in a go block
mccraigmccraig was thinking about just putting lines on synchronously using a doall rather than the whole file (tho I think we are agreeing)
and here's another example of one way of getting stuff from outside of manifold on to a stream (with callbacks in this case) https://github.com/mpenet/alia/blob/fix/manifold-buffering/modules/alia-manifold/src/qbits/alia/manifold.clj#L67