Fork me on GitHub
#data-science
<
2017-06-07
>
luxbock01:06:32

You can't really measure the memory usage by looking at the total RAM used up by the JVM the way you do, because the JIT takes whatever memory you give it and tries its best to use it all to get best runtime performance

husain.mohssen01:06:45

@luxbock it's not meant to be a totally accurate method of measuring memory but there really isn't that many other methods of reliably measuring how much memory is being used by the system.

husain.mohssen01:06:09

I know I'm not far off because GC stops freeing up memory and the system stops.

john01:06:10

Yeah, JVM trades lots of space for time, artificially inflating RAM usage. Then Clojure's structures have heavy garbage churn. Then you're trying to sort over hundreds of megs. You'll start thrashing your memory pretty quick.

husain.mohssen01:06:34

@john can you elaborate more?

john01:06:07

Well, you're up against those three factors, when you're looking at memory usage there

husain.mohssen01:06:27

(side point) I've been running a few experiments using (into-array) to bypass the president vectors and I still don't have too much luck.

john01:06:17

I think they were referring to a bytebuffer off the raw byte stream. I still think you'll see memory improvement with transducers, which wouldn't dramatically change the shape of your code.

john01:06:42

How much memory does your system have?

husain.mohssen01:06:25

It's a regular laptop with 16GB but this is a toy csv file that is a copy of a large redshift AWS DB.

john01:06:52

Yeah that's plenty

husain.mohssen01:06:55

when doing complicated analysis I find it easier to load up the data in ram and start playing with it instead of writing complex SQL queries

john01:06:16

Did you set your JVM memory options?

husain.mohssen01:06:22

my real data set is in the 2-5GB range which means Id not be able to do these analyses.

husain.mohssen01:06:33

yes it's set to max of 16GB

husain.mohssen01:06:35

Here is my best attempt to save memory (and the results):

user=>  (/ (- (. (Runtime/getRuntime) totalMemory) (. (Runtime/getRuntime) freeMemory)) 1000000.)
35.268248
user=> (time (def parsed (into-array (mapv #(into-array (clojure.string/split % #",")) (line-seq (io/reader "/tmp/180MB.csv"))))))
"Elapsed time: 19111.031364 msecs"
#'user/parsed
user=> (System/gc)
nil
user=>  (/ (- (. (Runtime/getRuntime) totalMemory) (. (Runtime/getRuntime) freeMemory)) 1000000.)
2490.754304

husain.mohssen01:06:32

the variable parsed is responsible for more than 2400MB based on a 180MB file

husain.mohssen01:06:52

if I do

(def parsed nil) 

husain.mohssen01:06:58

I get all my memory back:

husain.mohssen01:06:20

(def parsed nil)
#'user/parsed
user=> (System/gc)
nil
user=>  (/ (- (. (Runtime/getRuntime) totalMemory) (. (Runtime/getRuntime) freeMemory)) 1000000.)
51.476048

luxbock01:06:44

what type of data is in the rows? do you want to keep them as strings or to eventually parse them into something else?

john01:06:50

small data

john01:06:04

like four small items in a vector for each row?

husain.mohssen01:06:18

the data is just an array of arrays

husain.mohssen01:06:26

here is a sample:

husain.mohssen01:06:40

head /tmp/180MB.csv 
0,0,1,0
1,0,4,0
2,0,5,1
3,0,9,1
4,0,2,1
5,0,4,3
6,0,6,2
7,0,6,1
8,0,5,1
9,0,5,0

john02:06:30

That's a lot of vectors. Are you ending up with an array of arrays?

husain.mohssen02:06:56

yes, it's an array of arrays: (aget (aget parsed 9) 0) -> gives "9"

john02:06:04

Vectors are going to have some minimal amount of overhead

husain.mohssen02:06:30

but there are no vectors I think anymore. (into-array) changed everythign to a java array

john02:06:58

I'm not sure into-array is multi-level like that

husain.mohssen02:06:46

Not sure but it seems to be doing the right thing: parsed returns #object["[[Ljava.lang.String;" 0x7e206dd "[[Ljava.lang.String;@7e206dd"]

john02:06:49

hmm, not sure either. In any case, there may be some overhead in the vector itself, taking up a few bytes - perhaps a few bytes more than the few they're storing there.

husain.mohssen02:06:12

maybe I should ask the general clojure channel?

luxbock02:06:37

(into []
  (comp (map #(str/split % #","))
        (map (fn [xs]
               (mapv #(Integer/parseInt %) xs))))
  (-> csv-file io/file io/reader line-seq))
how about this?

john02:06:42

Yeah, and perhaps put together a gist, so others can code golf out a solution

john02:06:00

and some example data that can be downloaded

john02:06:39

nice 🙂

john02:06:01

Do you really need the mapv though?

john02:06:19

@husain.mohssen you don't need it to be a vector, right?

luxbock02:06:27

if you use a regular map you get a vector of lists

luxbock02:06:08

right, yeah but if the data represents a matrix then I figured you'd want to be able to use get-in on it

john02:06:25

yeah, def depends on how he's consuming them

john02:06:09

(into []
  (comp (map #(str/split % #","))
        (map (mapv #(Integer/parseInt %) xs)))
  (-> csv-file io/file io/reader line-seq))

luxbock02:06:12

or you could just go straight to core.matrix with vectorz, which is the Clojure equivalent of NumPy

john02:06:15

cause then you could just do that

luxbock02:06:36

that would not work because you're passing a transducer to map

luxbock02:06:55

or actually you are not, but the types still don't match

john02:06:04

(into []
  (comp (map #(str/split % #","))
        (map #(Integer/parseInt %) xs))
  (-> csv-file io/file io/reader line-seq))

john02:06:07

typo lol

john02:06:20

That works, right?

luxbock02:06:21

xs is not in scope

john02:06:06

(into []
  (comp (map #(str/split % #","))
        (map #(Integer/parseInt %)))
  (-> csv-file io/file io/reader line-seq))

luxbock02:06:47

you will need the extra call to map, because (map #(str/split % #",")) returns a seq

john02:06:49

Then just throw your group-by in there and you're golden

(defn group-by [f]
  (fn [rf]
    (let [groupped-value (volatile! (transient {}))]
      (fn
        ([] (rf))
        ([result]
          (rf (rf result (persistent! @groupped-value))))
        ([result input]
          (let [key (f input)]
            (vswap! groupped-value assoc! key (conj (get @groupped-value key []) input))
            result))))))

luxbock02:06:26

so the #(Integer/parseInt %) will receive a seq instead of a string

john02:06:35

Probably should have thought through the problem before editing your solution 🙂

john02:06:53

I usually avoid using mapv in a transducer comp, instead just doing an extra (map vec) if I want it. Same performance I think, and it can be easily added or removed or commented out.

john02:06:52

btw, that group-by transducer above was written in ClojureScript, I believe. If it gives you any problems, you may want to tweak it or find an actually clojure version out there.

john03:06:47

np! On another topic, just ran across this paper (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005110) which argues that zipfian distributions are much more common to nature than previously thought. Perhaps those are domains that allow ANN's the feature compression and the unreasonable effectiveness that Max Tegmark talks about: https://www.technologyreview.com/s/602344/the-extraordinary-link-between-deep-neural-networks-and-the-nature-of-the-universe/

john20:06:51

Any chance cortex can leverage this?

john20:06:09

Hashing!!!

john20:06:07

I've been thinking, neural nets are just really big bloom filters with an error rate jacked up to some large portion.