Fork me on GitHub
#data-science
<
2017-06-05
>
husain.mohssen20:06:30

What would this channel recommend with regards to processing 3-10GB csv files, my repl is chocking with out of memory errors on a beefy machine

husain.mohssen20:06:35

am I thinking about this wrong?

donaldball20:06:27

What are you trying to do with them?

husain.mohssen20:06:20

All types of thing. For starters I need to group by the first column

husain.mohssen20:06:52

Later on we want to do all types of processing and output the results into an other csv file.

jsa-aerial20:06:12

Do you need all of it at once or can you reduce it one row at a time? If the latter read each line and process it. If the resulting processed data is too large, you may want to use this

husain.mohssen21:06:09

hmm seems like a good approach

husain.mohssen21:06:35

I did basic tests and reading a csv is twice as slow on clojure vs. pandas

husain.mohssen21:06:21

I'm guessing that's because clojure.data.csv/read-csv is lazy

husain.mohssen22:06:07

interestingly iota can be as fast as python but I have to do the csv parsing myself

husain.mohssen22:06:54

@jsa-aerial what do you recommend if you want to parse csv's just send it though clojure.string/split ?

aaelony22:06:57

worth taking a look at onyx as well. https://github.com/onyx-platform/onyx

husain.mohssen22:06:58

onyx would be an overkill for what I'm looking at

husain.mohssen22:06:21

and to be completely honest I'd rather just switch over to spark if I needed a distributed system (given that I know how to use it)

wiseman23:06:04

@husain.mohssen clojure.string/split should be fine as long as you have simple CSVs with no quoting, embedded commas, etc.