Fork me on GitHub
#clojure-uk
<
2017-09-27
>
yogidevbear07:09:27

How are things in the Netherlands, @thomas ?

thomas07:09:53

fine... and very misty at the moment @yogidevbear

thomas07:09:12

ooh and no government yet... but that is ok.

Rachel Westmacott08:09:31

I get the feeling that no government can be a good thing

sleepyfox09:09:56

less gov = better gov, for the current given value of gov

sundarj09:09:36

keynes would have something to say about that šŸ˜‰

chrjs09:09:39

Morning all.

guy09:09:36

morning!

maleghast09:09:58

Morning Allā€¦ šŸ™‚

maleghast10:09:00

Anyone got 5 minutes to help me with something that I should know better about..? I am trying to consume a 2.5Mn row CSV file and the approach I have been using for sub 250K row files is no longer cutting the mustard. I have a sense that two thing I donā€™t really grok, laziness and streaming, are going to help me here, but I am not really sure where to startā€¦ šŸ˜ž (I am using clojure.data.csv)

maleghast10:09:58

Current ingest looks like this:

(defn ingest-csv-data
  "Read in a CSV file from the filesystem based on a configured path :external-files-path"
  [csv-resource]
  (with-open [in-file (io/reader (io/file (str (:external-files-path ingest-cfg) csv-resource)))]
    (doall
     (csv/read-csv in-file))))
and I pass the result to functions that iterate on it (using doseq).

maleghast10:09:05

I am assuming that part of the problem is that doseq is not going to work with the laziness, that in order to doseq the collection (which is a lazy sequence of vectors) it will need to realise the whole collection first..? TLDR; I am running out of memory.

guy10:09:58

from my csv reading experience from before

guy10:09:04

take it with a pinch of salt though

guy10:09:23

i thought when you do with-open, you load the whole file into memory first?

guy10:09:53

so you will have memory problems at that point regardless?

guy10:09:06

but like i said i could be totally wrong

iaint10:09:31

from my reading of doall that's going to force the sequence to be realised (i.e. the whole CSV file to be read in)

mccraigmccraig10:09:02

@maleghast you can use either laziness or streaming to help you with the OOM problem - have a look at the start of this for our lazy-tsv parser - https://gist.github.com/5207b5069bac7188fbb6dfce2d38c490

dominicm10:09:20

bingo, doall is the problem

guy10:09:52

so the in-file, is that just not evaluated then?

guy10:09:17

i dont really know much about readers to be fair

guy10:09:22

ill go do some googling

mccraigmccraig10:09:14

readers are a file-handle, maybe a buffer and some character interpretation atop the raw input-stream - they don't read the whole file

guy10:09:33

got ya thanks craig :thumbsup:

dominicm10:09:43

You were missing an i

maleghast10:09:56

I will take a look at @mccraigmccraigā€™s link-y / gist and let you know if I understand anything of itā€¦

mccraigmccraig10:09:48

it has some custom line-end parsing behaviour in there @maleghast , which is why we aren't just using clojure.core/line-seq iirc, but it's illustrative of the mechanics of using lazy-seqs for processing large files

maleghast10:09:19

@mccraigmccraig - Yeah, I noticed the custom line-ending stuffā€¦

maleghast10:09:03

I am reading through it nowā€¦

maleghast10:09:30

@mccraigmccraig - My reading of clojure.data.csv is that it already provides a ā€œlazyā€ approach, i.e. that I donā€™t need__ to take the steps that youā€™ve taken here to create lazy seqs of chars, lines etc. to build the lazy collection of rows. I guess my question is more that if I want it to behave the way I think it should, do I have to do all the ā€œworkā€ inside the function that is ā€œcreatingā€ the lazy collection of rows, in order to benefit from the laziness?

mccraigmccraig10:09:24

no @maleghast - just map / reduce over the seq that you get back from clojure.data.csv or whatever

mccraigmccraig10:09:43

it's the doall in your code above which was causing the whole file to be read into memory

glenjamin10:09:12

without the doall the file would close before reading the seq

glenjamin10:09:29

doing the work inside the lazy bit is a good way to think about it imo

maleghast10:09:58

Right, thatā€™s kind of what I was asking - just doing it badlyā€¦ If I did this:

(defn format-row
[row]
;;Do some data cleaning, data casting etc. here)

(defn insert-row
[row]
(sql/insert {:? row}, (:conn db)))

(defn csv-to-db [from]
  (with-open [reader (io/reader from)]
    (->> (read-csv reader)
         (map #(format-row) %)
         (map #(insert-row %))))))

maleghast10:09:23

would that work..?

jonpither10:09:20

could use a pmap for the insert-row bit šŸ™‚ - but this ^ would be lazy and eval would need to be forced

maleghast10:09:59

Ok, and this is where the whole lazy thing hurts my brain - how do I enforce the eval?

glenjamin10:09:52

iā€™d have csv-to-db be something more like with-csv-row and pass in a function which does the stuff

maleghast10:09:06

i.e. how do I get the actual__ data ā€œoutā€ on a row by row basis?

maleghast10:09:06

@mccraigmccraig - well yes, I like both reduce and doseq, but I feel you are suggesting that they have powers I am unaware of, that I nonetheless believe that they do haveā€¦ i.e. I need to level up a bit, donā€™t I?

glenjamin10:09:09

(defn with-csv-row [from action]
  (with-open [reader (io/reader from)]
    (doseq [row (read-csv reader)]
      (action row)))))

glenjamin10:09:12

something like that

maleghast10:09:44

@glenjamin - Where action is a function that cleans up the row and runs the query?

glenjamin10:09:50

but with balanced parens and correct function usage - i just typed that in slack

maleghast10:09:28

so doseq only realises one row at a time from the lazy sequence of vectors that read-csv emits..?

maleghast10:09:08

Also, thereā€™s nothing to stop me passing in the row and 2 functions, right, so that I can have a separate function for each of data clean / prep and db insert, right?

Ben Hammond10:09:29

sounds to me like a good fit for transduce

maleghast10:09:35

(more for helping my brain not hurt than anything else, though I like to think__ I would keep side-effects to their own function(s))

maleghast10:09:21

@ben.hammond - Oh no, youā€™ve just invoked my third, super-secret achilles heel of Clojure-Dev-Imposter syndromeā€¦ Transduce / Transducersā€¦

glenjamin10:09:54

@maleghast you could compose a bunch of functions defined independently to create action, then the csv function doesnā€™t need to know about the distinction

glenjamin10:09:26

and it should be straightforward to write reduce or transduce based variants of with-csv-row with the same signature

maleghast10:09:54

@glenjamin - That is only true if youā€™ve grok-ed transduce

glenjamin11:09:19

yeah, my point being ā€œdo this on each line of csvā€ is a decent interface to code to

minimal11:09:41

i wrote a thing to import massive CSVs recently, I used semantic-csvā€™s transducer functions to process the rows

maleghast11:09:45

I think that I could write it with reduce instead, but I tend to think ā€œreduce means reductionā€ and this use-case does not take one collection and produce a smaller oneā€¦

maleghast11:09:24

I realise that this is a simplistic view, but it helps me to keep my decision-making consistent, at the moment, while I level up from keen amateur to coding in clojure every day

maleghast11:09:01

if doseq will get the job done, in the parameters I need, I will stick to that for now and then get cleverer laterā€¦ šŸ™‚

sundarj11:09:15

@maleghast reduce is a swiss army knife. you can implement map and filter purely with reduce, for example. a better name for it would be accumulate imo

Ben Hammond11:09:52

I found this to be an extremely helpful explanation

maleghast11:09:03

@sundarj - I know what you mean, I really do, I just find it easier, at the moment, at my level of grok, to keep reduce in the box that gets opened when I want to do something to a collection where the result is smaller than in size or different to in structure, or both from the original collection

sundarj11:09:38

fair enough - i tend to put reduce in the same box as loop

sundarj11:09:08

a more imperative, and thus more powerful, way of doing something to a collection

maleghast11:09:25

*nods* I think I will__ too in the fullness of timeā€¦ šŸ™‚

sundarj11:09:41

i've come to clojure from js, which has reduce too, and in that you can just use reduce everywhere you would've used a for/while loop

sundarj11:09:57

if you so choose

sundarj14:09:04

maybe i should just leave you alone, but this is a good description of reduce: http://www.lispcast.com/annotated-clojure-core-reduce

maleghast11:09:17

@ben.hammond - thx, I will go look nowā€¦

maleghast11:09:51

@ben.hammond - It is a good video, thanks. I think I am going to have to watch it again, and possibly some more times, but more of the idea(s) make sense now than they did before, so thanks šŸ™‚

Ben Hammond11:09:42

I'd recommend all his videos. his pay channel represents pretty good value

Ben Hammond11:09:18

since you can consume the whole thing in a couple of months

maleghast11:09:27

@ben.hammond - I will consider it, but I donā€™t watch videos when at home, we have a metred, satellite connection and watching YouTube(etc) videos is not really an option. This is the price I pay for living in the most beautiful spot ever. I happen to be in an AirBnB in London right now__ so I watched it šŸ™‚

minimal12:09:44

youtube-dl ftw (if not drm)

otfrom12:09:55

@ben.hammond that transducer video was the first time I properly understood what was going on inside a transducer and why (I've been happily using them before that, just not properly understood how it worked internally)

jasonbell13:09:07

Video bookmarked. something I do need to wrap my head around.

chrisjd16:09:00

@maleghast I also found this really useful; itā€™s reasoned from first principles: https://labs.uswitch.com/transducers-from-the-ground-up-the-essence/

maleghast16:09:18

I will have a ā€œproperā€ look later on, but this looks very helpful.