This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-09-27
Channels
- # beginners (34)
- # boot (15)
- # cider (7)
- # cljs-dev (7)
- # cljsjs (2)
- # cljsrn (46)
- # clojure (130)
- # clojure-argentina (1)
- # clojure-colombia (2)
- # clojure-greece (1)
- # clojure-italy (53)
- # clojure-losangeles (1)
- # clojure-russia (15)
- # clojure-spec (8)
- # clojure-uk (100)
- # clojurescript (117)
- # core-matrix (1)
- # cursive (24)
- # datomic (41)
- # duct (1)
- # emacs (11)
- # fulcro (22)
- # graphql (4)
- # hoplon (3)
- # jobs (1)
- # lein-figwheel (3)
- # luminus (18)
- # lumo (52)
- # off-topic (57)
- # pedestal (2)
- # planck (12)
- # re-frame (22)
- # remote-jobs (1)
- # ring-swagger (6)
- # rum (7)
- # shadow-cljs (13)
- # yada (19)
Morning
How are things in the Netherlands, @thomas ?
fine... and very misty at the moment @yogidevbear
I get the feeling that no government can be a good thing
Anyone got 5 minutes to help me with something that I should know better about..? I am trying to consume a 2.5Mn row CSV file and the approach I have been using for sub 250K row files is no longer cutting the mustard. I have a sense that two thing I donāt really grok, laziness and streaming, are going to help me here, but I am not really sure where to startā¦ š (I am using clojure.data.csv)
Current ingest looks like this:
(defn ingest-csv-data
"Read in a CSV file from the filesystem based on a configured path :external-files-path"
[csv-resource]
(with-open [in-file (io/reader (io/file (str (:external-files-path ingest-cfg) csv-resource)))]
(doall
(csv/read-csv in-file))))
and I pass the result to functions that iterate on it (using doseq).I am assuming that part of the problem is that doseq is not going to work with the laziness, that in order to doseq the collection (which is a lazy sequence of vectors) it will need to realise the whole collection first..? TLDR; I am running out of memory.
from my reading of doall
that's going to force the sequence to be realised (i.e. the whole CSV file to be read in)
@maleghast you can use either laziness or streaming to help you with the OOM problem - have a look at the start of this for our lazy-tsv parser - https://gist.github.com/5207b5069bac7188fbb6dfce2d38c490
readers are a file-handle, maybe a buffer and some character interpretation atop the raw input-stream - they don't read the whole file
Thanks @mccraigmccraig @dominicm @guy and @iain.tatch
I will take a look at @mccraigmccraigās link-y / gist and let you know if I understand anything of itā¦
it has some custom line-end parsing behaviour in there @maleghast , which is why we aren't just using clojure.core/line-seq
iirc, but it's illustrative of the mechanics of using lazy-seqs for processing large files
@mccraigmccraig - Yeah, I noticed the custom line-ending stuffā¦
@mccraigmccraig - My reading of clojure.data.csv is that it already provides a ālazyā approach, i.e. that I donāt need__ to take the steps that youāve taken here to create lazy seqs of chars, lines etc. to build the lazy collection of rows. I guess my question is more that if I want it to behave the way I think it should, do I have to do all the āworkā inside the function that is ācreatingā the lazy collection of rows, in order to benefit from the laziness?
no @maleghast - just map / reduce over the seq that you get back from clojure.data.csv or whatever
it's the doall
in your code above which was causing the whole file to be read into memory
Right, thatās kind of what I was asking - just doing it badlyā¦ If I did this:
(defn format-row
[row]
;;Do some data cleaning, data casting etc. here)
(defn insert-row
[row]
(sql/insert {:? row}, (:conn db)))
(defn csv-to-db [from]
(with-open [reader (io/reader from)]
(->> (read-csv reader)
(map #(format-row) %)
(map #(insert-row %))))))
could use a pmap for the insert-row bit š - but this ^ would be lazy and eval would need to be forced
Ok, and this is where the whole lazy thing hurts my brain - how do I enforce the eval?
iād have csv-to-db
be something more like with-csv-row
and pass in a function which does the stuff
@glenjamin - *whoosh*
@maleghast reduce
is good
as is doseq
@mccraigmccraig - well yes, I like both reduce and doseq, but I feel you are suggesting that they have powers I am unaware of, that I nonetheless believe that they do haveā¦ i.e. I need to level up a bit, donāt I?
(defn with-csv-row [from action]
(with-open [reader (io/reader from)]
(doseq [row (read-csv reader)]
(action row)))))
@glenjamin - Where action is a function that cleans up the row and runs the query?
so doseq only realises one row at a time from the lazy sequence of vectors that read-csv emits..?
Also, thereās nothing to stop me passing in the row and 2 functions, right, so that I can have a separate function for each of data clean / prep and db insert, right?
sounds to me like a good fit for transduce
(more for helping my brain not hurt than anything else, though I like to think__ I would keep side-effects to their own function(s))
@ben.hammond - Oh no, youāve just invoked my third, super-secret achilles heel of Clojure-Dev-Imposter syndromeā¦ Transduce / Transducersā¦
@maleghast you could compose a bunch of functions defined independently to create action
, then the csv function doesnāt need to know about the distinction
@glenjamin Good point, ok
and it should be straightforward to write reduce or transduce based variants of with-csv-row
with the same signature
@glenjamin - That is only true if youāve grok-ed transduce
yeah, my point being ādo this on each line of csvā is a decent interface to code to
i wrote a thing to import massive CSVs recently, I used semantic-csvās transducer functions to process the rows
I think that I could write it with reduce instead, but I tend to think āreduce means reductionā and this use-case does not take one collection and produce a smaller oneā¦
I realise that this is a simplistic view, but it helps me to keep my decision-making consistent, at the moment, while I level up from keen amateur to coding in clojure every day
if doseq will get the job done, in the parameters I need, I will stick to that for now and then get cleverer laterā¦ š
@maleghast reduce
is a swiss army knife. you can implement map
and filter
purely with reduce, for example. a better name for it would be accumulate
imo
I found this to be an extremely helpful explanation
@sundarj - I know what you mean, I really do, I just find it easier, at the moment, at my level of grok, to keep reduce in the box that gets opened when I want to do something to a collection where the result is smaller than in size or different to in structure, or both from the original collection
i've come to clojure from js, which has reduce
too, and in that you can just use reduce
everywhere you would've used a for/while loop
maybe i should just leave you alone, but this is a good description of reduce
: http://www.lispcast.com/annotated-clojure-core-reduce
@ben.hammond - thx, I will go look nowā¦
@ben.hammond - It is a good video, thanks. I think I am going to have to watch it again, and possibly some more times, but more of the idea(s) make sense now than they did before, so thanks š
I'd recommend all his videos. his pay channel represents pretty good value
since you can consume the whole thing in a couple of months
@ben.hammond - I will consider it, but I donāt watch videos when at home, we have a metred, satellite connection and watching YouTube(etc) videos is not really an option. This is the price I pay for living in the most beautiful spot ever. I happen to be in an AirBnB in London right now__ so I watched it š
@ben.hammond that transducer video was the first time I properly understood what was going on inside a transducer and why (I've been happily using them before that, just not properly understood how it worked internally)
@maleghast I also found this really useful; itās reasoned from first principles: https://labs.uswitch.com/transducers-from-the-ground-up-the-essence/