Fork me on GitHub
#clojure-uk
<
2017-01-09
>
agile_geek07:01:32

@paulspencerwilliams is your talk going to be recorded?

agile_geek07:01:37

BTW for those of you on Linux if you ever want to do a screen recording, a talk or even pair with someone and you want to show your keypresses on screen, this is quite cool https://github.com/wavexx/screenkey

agile_geek07:01:25

Looks like I'll be walking from KX to Camden this morning!

paulspencerwilliams08:01:46

@agile_geek not sure, but it's only an intro. Might try and run it at Brum FP in slightly more detail and should be able to get that one recorded...

yogidevbear09:01:51

Morning all šŸ‘‹

dominicm09:01:26

Suup šŸŒ“

agile_geek10:01:01

Here's a Clojure question for a change. Is it just me who finds the file io in Clojure awkward? If I want to read a large file lazily I end up having to either: 1. use with-open somewhere at the top of a stack of calls that deal with each line of the file and line-seq in a function to process each line 2. write a custom function that opens a reader and recursively calls a closed over helper fn (with a lazy-seq wrapping the body) and manually calls .read conj'ing the results to its recursive call. E.g.

(defn read-messages [filename]
  (let [reader (io/reader filename)]
    (letfn [(lines []
              (lazy-seq
               (try
                 (if-let [line (.readLine reader)]
                   (cons line (lines))
                   (.close reader))
                 (catch Exception e (when reader (.close reader))))))]
      (lines))))
The first option feels ugly as I have to wrap with-open around a stack of fns in some top level function when the top level function doesn't need to know about the file. The second option is much cleaner but I have to write this everytime. I would have thought this pattern so common there would be a std abstraction/fn to do this? Or am I completely missing something obvious here?

agile_geek10:01:02

I end up with the with-open several levels higher up OR I have to pass the bulk of the app as a higher order fn doseq within the fn that uses line-seq. That also works but I struggle a bit with thinking that way although maybe that's a better approach as it means the fn to read the lines calls the fn stack to process them (inversion of control) but when I start nesting writes inside the fn that processes each line it can get ugly.

agile_geek10:01:49

I don't think I'm expressing this issue well in here. I will prepare a gist or something tonight with examples of approaches I can think of and post it here to see what people think.

thomas10:01:50

I vaguely remember the one time I had to deal with bigger files it wasn't straight forward.

thomas10:01:52

I guess with small files you don't actually notice it when you do something wrong (eg not close file handle)

agile_geek10:01:06

I've done it a number of times but with many months in between and it always takes me ages to figure out how to process them lazily in a sensible way. I seem to lose hours on this everytime. But again I am a bit slow sometimes.

agile_geek10:01:42

@thomas with small files you can just process them eagerly and have the whole file in memory

reborg10:01:15

@agile_geek never had a particular problem with the with-open idiom. I see the point to be forced to have it at the top of the computation, but that can always be hidden inside the the first layer that you consider ā€œpublic interfaceā€.

reborg10:01:35

If you donā€™t mind the library, https://github.com/thebusby/iota approach is quite smart.

agile_geek10:01:01

@reborg hmm, I think it's maybe just a hang up from my procedural programming days! The more I think about it the more I think I am favouring having a with-open and a line-seq in a doseq at the top of the computation and passing the body of the doseq the computation to perform on each line.

agile_geek10:01:09

It's just that I go through this thought process everytime I need to process a file lazily and it never seems to come naturally to me.

rickmoynihan11:01:40

@agile_geek: Iā€™ve done quite a lot of I/O on large filesā€¦ including a lot of reifying files as lazy seqsā€¦ Iā€™ve come to the conclusion though that the best way is to avoid laziness & I/O altogetherā€¦ Best option in my mind is to use a transducer if you can.

agile_geek11:01:09

@rickmoynihan not sure I understand. How can you use a transducer to avoid I/O and laziness?

rickmoynihan11:01:56

agile_geek: I/O is one of the main things transducers/reducers were intended to supportā€¦ itā€™s just not that widely known.

agile_geek11:01:19

@rickmoynihan any links to examples?

rickmoynihan11:01:44

agile_geek: yeah hold on...

rickmoynihan11:01:50

canā€™t remember the name of it

agile_geek11:01:16

I've used core.async pipelines with transducers with a with-open at the top of the stack of fn's but I think I was just reading each line in a go loop from the file and putting to a channel

rickmoynihan11:01:26

you donā€™t need to use channels - you can use the underlying java io methods with them

agile_geek11:01:42

I'm still struggling to understand.

rickmoynihan11:01:13

sorry still trying to find an example

agile_geek11:01:33

OK - no hurry šŸ˜‰

rickmoynihan11:01:01

sorry I canā€™t find a better opensource example but you can take a look at this code where I was tying to benchmark whether it was worth writing a transducer based csv parser. Code was never meant to be used, but basically you just hook into CollReduce on the IOstream and write the rest as xforms e.g hereā€™s a really crude CSV parser: https://github.com/RickMoynihan/transducer-csv/blob/master/src/transduce_csv/bench.clj#L36

rickmoynihan11:01:02

though I think ideally youā€™d move the .readline out into a transducer function too

rickmoynihan11:01:44

there is another repo somewhere that does something similarā€¦ wracks brain

rickmoynihan11:01:54

@agile_geek: let me know if it still doesnā€™t make sense

rickmoynihan11:01:47

@agile_geek: Ahhh hereā€™s the other repoā€¦ IIRC I discovered this after Iā€™d done the above, but itā€™s basically the same idea: https://github.com/pjstadig/reducible-stream I think it would benefit from being wrapped up at a lower level of abstraction thoughā€¦ more like

rickmoynihan11:01:10

the big problem with laziness + IO is that the life cycle of the laziness is different from the resource life-cycleā€¦ i.e. how do you know when the consumer is done with the sequence? You donā€™t, hence you have to wrap it in a with-open and consume it eagerly somewhere, which means somewhere you have to treat it like it's eager. Transducers solve this, by actually being eager; but by separating the computational elements out (so it still feels lazy)ā€¦ e.g. a transducer can know that itā€™s the final (take 10 ,,,) and close the stream after the items have been taken; and as a user you donā€™t need to bother wrapping it in with-open & doall anymore

reborg11:01:18

interesting rickmoynihan although I donā€™t think reducers/xducer are solving all problems

rickmoynihan11:01:42

theres no such thing as a silver bullet šŸ™‚

rickmoynihan11:01:18

But they do solve the resource issue with I/O - whilst letting you map/filter/etc over files

reborg11:01:44

if you donā€™t want/need to reduce but just process a lazy-seq you still have that problem of closing the IO resource somewhere. If you reduce you have the possibility to hook up the .close at the end of the reduction which I find quite fine idea

otfrom11:01:43

was wondering what the hashmap equivalent would be (other than a database (key value or otherwise))

rickmoynihan11:01:26

reborg: yeah but the problem with lazyseqs and closing the resource is why Iā€™m saying use a (trans|re)ducer.... I think if you want a lazy-seq coupled to I/O youā€™re doing it wrong. lazy-seq + I/O has historically been used a lot, but Rich has been saying itā€™s bad since basically Clojure 1.0

rickmoynihan11:01:53

@otfrom: thanks - Iā€™d forgotten about iota.

glenjamin11:01:20

the general theme iā€™m seeing is passing the data processing function into the IO code, rather than passing the IO reference into some data processing code

rickmoynihan11:01:33

I think the biggest issue with this approach is how to use it with things like ringā€¦ e.g. to my knowledge you canā€™t really return a transducer to ring

otfrom11:01:14

I'm wondering if konserve might be my huckleberry https://github.com/replikativ/konserve

reborg11:01:03

rickmoynihan not sure there is a wrong/good, I suppose it depends on your requirements. Personally Iā€™d say use lazy-seqs with IO if your app is fetching/processing/spitting out and you canā€™t afford to bring the thing in memory

otfrom11:01:07

a lot of CRDT stuff there looks very intersting

glenjamin11:01:39

i think the key thing is that the lazy-seq should stay within a single function, and not get passed around

rickmoynihan12:01:30

glenjamin: agreed - but then itā€™s not really like a normal lazy seq

agile_geek12:01:01

@glenjamin I agree about the "passing the data processing function into the IO code" but I need a way to remember this! I think I feel a blog post coming on. Usually helps me solidify my thinking

rickmoynihan12:01:02

reborg: You can totally do thatā€¦ but I think the problem is that clojure.core doesnā€™t have transducible-IO libraryā€¦ Having done a huge amount of lazy-seq I/O Iā€™m trying to slowly move away from it - because the costs of laziness are really high for large datasets and controlling resource cycle is a pain. You can use transducer/reducer based I/O and still not load everything into memory.

reborg12:01:07

As soon as you reduce (xform or not) you are not loading everything into memory by design šŸ™‚ Is only that not all problems can be solved with a reduce as the last form

rickmoynihan12:01:49

reborg: not trueā€¦

rickmoynihan12:01:18

You can e.g. reduce into a file or an outputstream

rickmoynihan12:01:36

just need to implement CollReduce and friends

rickmoynihan12:01:21

it basically entirely depends on the reducing functionā€¦ e.g. here I transduce over a 1gb CSV file to count the linesā€¦ I donā€™t hold more than a line of the file in memory at any one point in timeā€¦ Infact itā€™s actually much easier on memory/GC than the equivalent lazy-seq code: https://github.com/RickMoynihan/transducer-csv/blob/master/src/transduce_csv/bench.clj#L36

rickmoynihan12:01:55

the reducing function could equally just write it to an outputstream (to do a file copy)

reborg12:01:03

What I mean is that (reduce + (range 1e9)) is not going to load all the 1e9 items in memory at any given time. The head is consumed and garbage collected and yes, depending on the reducing function.

reborg12:01:19

My point tho is that all suggestions in this thread are equally interesting.

rickmoynihan12:01:06

reborg: completely agreeā€¦ and like I said you can use lazy-seqā€™s for I/O - itā€™s something Iā€™ve done a huge amountā€¦ however maintaing a code base with a lot of I/O done in that style makes me yearn for something betterā€¦ e.g. hacks where you add (.close rdr) to the end of a lazy seq is far from ideal.

glenjamin12:01:34

I implemented a lexer+parser with transducers once

glenjamin12:01:46

It was a bit weird, but mostly worked

mccraigmccraig12:01:09

@agile_geek upgrade from laziness and go async - manifold streams have on-closed and on-drained callbacks which can be used for the resource close - i would certainly like to use a stream based file i/o lib, though i haven't cared enough to implement one yet. some stream info - https://github.com/ztellman/manifold/blob/master/docs/stream.md

rickmoynihan12:01:43

mccraigmccraig: itā€™s definitely an option, and a lot of people seem to be moving that wayā€¦ I do wonder if core.async chans could compete with the performance of a transducer based I/O solution thoughā€¦ also transducers can work with core.async chans too so not necessarily mutually exclusive solutionsā€¦ e.g could probably wire a CollReduce reader through a transducer into a core.async channel pretty seamlessly

rickmoynihan12:01:53

bigger issues are as you say having to be aware of the differences between blocking/non-blocking I/O - Iā€™ve been assuming blocking I/O so far as nio is not supported by libraries I need

rickmoynihan12:01:49

and whilst nio is trendy - I really donā€™t typically need to handle 100k concurrent connections on a single server

mccraigmccraig12:01:29

yes, without nio you are stuffed for async models - manifold will cheat for you and use a threadpool to manage blocking actions, but there's not a lot of point to using that feature if all your i/o ops are blocking

rickmoynihan12:01:26

yeah - weā€™re pretty tied to a suite of parsers built on blocking I/O - async doesnā€™t make much sense for us

rickmoynihan12:01:12

I really do wish the clojure I/O ecosystem was a little less fractured

otfrom13:01:01

rickmoynihan mccraigmccraig who has the best I/O approach in your opinion or do we need to come up with one?

rickmoynihan13:01:14

otfrom: I think thereā€™s necessarily basically two approaches to I/O on the JVM: async & blocking, so clojure needs at a minimum two approaches also. I think in an ideal world clojure would provide an io mechanism that extended CollReduce etc to readers out of the box and that backing I/O and effects with lazy-seqā€™s would be frowned upon or treated as deprecated/legacy. So yeah I think thereā€™s work needed to unify this stuff - and that is probably "a new wayā€.

mccraigmccraig13:01:29

i/o covers a tonne of stuff @otfrom ... i've been happy with my network i/o options recently (via aleph), but i would have liked an async file i/o lib

mccraigmccraig13:01:48

converting async to blocking is easy though, so i think there only needs to be one async approach !

mccraigmccraig13:01:43

(with some blocking lip-gloss)

rickmoynihan13:01:36

there are some problems thoughā€¦ I had thought that because you can call sequence on a transducer to return a lazy seq - you could potentially interop with lazy-seq stuff whilst having a transducer backed solution through this mechanism; but Iā€™m not sure it has some problems as Seqable etc arenā€™t protocols - so might be doable but YMMV on that front. async -> blocking can be done as mccraigmccraig says; but if the abstraction has a cost (core.asyc chanā€™s seem to), Iā€™m not sure you want to require everything to be expressed as an async thingā€¦ I think transducers might provide a way out thoughā€¦ as you can pick an async/blocking source/destination depending on your needs and providing you mix in the appropriate xforms write the bulk of the transducer with no knowledge of either.

rickmoynihan13:01:09

also thereā€™s been murmerings from cognitect that transducers will eventually be made to support the parallel cases of reducers too

rickmoynihan13:01:40

you could e.g. do something like (def xf (comp (->AsyncFile ā€œ/blah/file.csvā€) async-split-lines record->hash-map)) or (def xf (comp (->BlockingFile ā€œ/blah/file.csvā€) split-lines record->hash-map)) But basically the user should decide.

rickmoynihan13:01:20

Iā€™ve not though too much about the async case though

rickmoynihan13:01:12

ok lunch time!

mccraigmccraig13:01:06

well, i'm all for abstractions which let the user decide

glenjamin13:01:52

I ended up with a function that took a reader and a transducer, and applied it: https://github.com/glenjamin/hand-parser/blob/master/src/hand/parser.clj#L5

glenjamin13:01:30

similar effect to implementing CollReduce I think

rickmoynihan14:01:11

lol readuction :thumbsup:

rickmoynihan14:01:57

mccraigmccraig: totally agree that i/o covers a tonne of things though - so might not be possible to have one approach to rule them all

glenjamin14:01:00

i recall being rather happy with that name

paulspencerwilliams18:01:58

Anyone 'attending' Clojure Remote?

dominicm20:01:51

I thought the tickets were really expensive Tbh. Though there might be a good reason.

tcoupland21:01:51

not sure when you saw the prices, but they got dropped recently to 425. Need to start writing some slides really!

tcoupland21:01:06

thought i had a link for 10%, but not convinced it's working right