Fork me on GitHub
#data-science
<
2017-06-06
>
jsa-aerial05:06:01

@husain.mohssen as @wiseman says split would work fine. But you can also use clojure.data.csv per line, you just have to use first on it to take the 'one row' from the 'matrix'

jsa-aerial05:06:09

given line has been read, (-> line csv/read-csv first) would do the job even with quoting and embedded commas

jsa-aerial05:06:44

Also, @husain.mohssen if you are OK with stuff like python / pandas which is all mutation all the time, you should have no reservations with using loop for this

blueberry08:06:23

@jsa-aerial but he also wants to group-by, which boils down to sorting.

jsa-aerial14:06:00

Can still use slit or read-csv for the "parse", but stick with the iota reducer model

husain.mohssen17:06:16

any recommendations for how to connect to a warehousing DB though SQL? I'm looking at korma and it looks sane, any other suggestions? is it worth the hassle or should I just use jdbc?

tanzoniteblack17:06:45

@husain.mohssen I generally recommend using jdbc directly, but instead of doing string bashing to make queries, use https://github.com/jkk/honeysql (which is light library that converts clojure maps into parameterized sql strings)

husain.mohssen18:06:02

thanks @tanzoniteblack It seems like a winner

husain.mohssen18:06:10

I did a little bit more experimenting with CSV parsing and clojure seems to have real problems parsing non-trivial CSVs. After playing with it a while I realized that it uses multiple X the memory it needs to parse a file.

husain.mohssen18:06:32

for example for a simple csv file that looks like this:

husain.mohssen18:06:48

2016,12,17,20
2016,12,19,7
2016,12,19,10
2016,12,19,12
2016,12,19,13
2016,12,19,13
2016,12,19,13

husain.mohssen18:06:08

the overhead of memory usage is almost 50x in some cases. To show this I attempted to parse a 150MB.csv file that looks like the above (ie 4 columns) and see how much memory I need just to store the result):

(require '[clojure.data.csv :as csv])
(def raw (csv/read-csv ( "/tmp/150MB.csv")))
now if I try to read the last value of the lazy sequence raw I get:
(time (last raw))
"Elapsed time: 77097.499053 msecs"
["2017" "06" "02" "15"]
IE it takes 77 seconds to parse a 150 MB file

husain.mohssen18:06:27

even more worrisome is the RAM usage:

husain.mohssen18:06:35

(/ (- (. (Runtime/getRuntime) totalMemory) (. (Runtime/getRuntime) freeMemory)) 1000000.)
6442.394584

husain.mohssen18:06:59

I'd understand a factor of 2 or 3 or even 5 but we are taking about 6442/150=~42x more memory than needed to store the raw data.

husain.mohssen18:06:34

I have to admit I'm a big shocked that anyone can do serious data science on clojure with this limitation. Am I missing something big?

john18:06:39

For data that size and larger, you probably want to mess with transducers

donaldball18:06:44

These csv libs weren’t written with memory utilization or performance as the primary characteristic. For working with huge csv files, you probably want to mmap them into byte buffers and write a parser that yields tuples of offsets.

wiseman18:06:09

i get a factor of 20x (did you force a GC?). 10x in python.

husain.mohssen18:06:33

I did force GC (System/gc)

husain.mohssen18:06:02

in python it depends on what you use, if you are doing data science you will be using pandas which will probably take less than 10% extra space.

aaelony18:06:03

you never want to do this for large files: (def raw (csv/read-csv (http://clojure.java.io/reader "/tmp/150MB.csv")))

wiseman18:06:40

yeah, it is unfortunate.

aaelony18:06:16

avoid holding onto the head

john18:06:15

Their site appears to be down, but grammarly had a helpful blogpost recently on using transducers to minimize memory usage on medium sized data in clojure: https://tech.grammarly.com/blog/building-etl-pipelines-with-clojure

husain.mohssen18:06:21

@wiseman I'm sorry my experiments on python seem to indicate it is using 2-3x the size.

wiseman18:06:47

@husain.mohssen i believe you 🙂 and like i said, it’s an unfortunate situation.

husain.mohssen18:06:56

having said that python/pandas parses the strings to integers

jsa-aerial18:06:44

husain.mohssen: You can do that with semantic csv

husain.mohssen18:06:13

This is not realistic when doing EDA, you want to load the data and the start to slice and dice it...

aaelony18:06:00

it's realistic. you just need to change the way you think...

aaelony18:06:24

it depends also on your goal

aaelony18:06:46

you can do EDA on samples as well

aaelony18:06:30

I recommend also reading up on transducers

husain.mohssen18:06:41

@aaelony @john any general pointers about why transducers would be the way to go here? I understand that they are generalizations of processes so I'd understand their values when doing stream processing but when the problem statement is "here is a 150 MB file, cut up the data in 3-5 different ways and generate the result on the screen or into an other CSV file "

jsa-aerial18:06:45

@husain.mohssen , @aaelony is really on target here - you don't want to lazy seq the whole thing into memory, you can use iota's reducers / mmap model and/or use transducers. The problem is your approach is straightforward, but naive.

husain.mohssen18:06:46

@jsa-aerial I'll look carefully at both transducers and mmap's ...

john19:06:03

@husain.mohssen it doesn't solve your sorting problem. Transducers might just help you with any extra collections you may be creating elsewhere in your logic.

jsa-aerial19:06:27

the nice thing about iota is it effectively wraps mmaps in a clojure-esque way

wiseman19:06:58

there is something to be said for being able to use the straightforward, naive approach because your language + library uses 1/4 as much memory so your dataset just fits, without the cognitive overhead of having to go with a more sophisticated, complicated approach.

wiseman19:06:01

if your dataset fits.

jsa-aerial19:06:56

Yes, there is a lot to recommend the straightforward naive approach in general. But in cases like this, it may not be robust enough to get the job done...

john19:06:20

aye, clj seems to favor experimentation of repl oriented workloads. Then, when you're dealing with 100s or thousands of megs, you bolt on powertools.

aaelony19:06:35

or get a box with lots of memory (https://aws.amazon.com/ec2/instance-types/x1/) and hold-onto-the-head with impunity.

jsa-aerial19:06:30

If you are going straightforward naive, it's hard to beat 2TB of ram 😂

john20:06:36

@husain.mohssen here are some thoughts on how to structure a sort in clojure over data too large to fit in memory: http://blog.malcolmsparks.com/?p=42

john20:06:28

You could potentially pair that up with something like https://github.com/datacrypt-project/hitchhiker-tree#outboard and lazily sort, piece-wise over very large data on disk.

john20:06:02

Might be faster to just to just rent 2TB of ram though 🙂

husain.mohssen20:06:11

a 20x tax is so high that I think clojure may not be the right tool for this.

jsa-aerial20:06:36

Use Java arrays and turn the stuff into integers

husain.mohssen20:06:50

Even Mathematica is at the 7x range.

husain.mohssen20:06:37

@jsa-aerial yup that would do it.

husain.mohssen20:06:57

also core.matrix will also work.

jsa-aerial20:06:13

core.matrix would 'work' but not for the actual loading. There was talk (Mike Anderson) about adding streaming and DB style loading to the c.m.datasets but it is not there yet AFAIK

jsa-aerial20:06:29

You need to realize that both Mathematica and Python are just punting directly to C libs for this.

husain.mohssen20:06:05

the problem is not the loading AFAICT (slurp) loads the whole thing in less than 2 seconds. The problem is that the parsing is generating data structures with too much overhead, seems to be a combination of unicode + persistent data structure as well as laziness

jsa-aerial20:06:09

So, punting to Java (and even C under the covers) would be reasonable here (even if not as nice)

jsa-aerial20:06:01

You can get rid of lazyness with transducers and reducing (no consing for 'intermediate' seq nodes); use Java arrays to remove any persistent DS overhead; use conversion per row to get rid of string/unicode overhead

jsa-aerial20:06:05

If you do all of that you will be close to what you would have in C (aka Python, http://et.al)

jsa-aerial20:06:45

Actually, you could use c.c.matrix here with vectorz impl instead of Java arrays, but you would have doubles not longs if that is ok

husain.mohssen20:06:26

Thanks everyone. These are very very helpful hints

aaelony20:06:12

again, you might want to look into onyx