This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-06-07
Channels
- # aleph (19)
- # aws (1)
- # beginners (75)
- # boot (28)
- # cider (1)
- # cljs-dev (12)
- # cljsrn (20)
- # clojure (350)
- # clojure-argentina (1)
- # clojure-chicago (2)
- # clojure-dev (2)
- # clojure-russia (5)
- # clojure-spec (2)
- # clojure-uk (14)
- # clojure-ukraine (3)
- # clojurescript (68)
- # component (87)
- # core-async (25)
- # core-logic (13)
- # cursive (4)
- # data-science (72)
- # datascript (59)
- # datomic (15)
- # defnpodcast (7)
- # emacs (33)
- # hoplon (5)
- # immutant (73)
- # jobs (21)
- # klipse (6)
- # lumo (14)
- # off-topic (26)
- # om (23)
- # onyx (6)
- # parinfer (37)
- # protorepl (4)
- # re-frame (13)
- # ring (2)
- # rum (3)
- # spacemacs (2)
- # specter (22)
- # sql (47)
- # uncomplicate (10)
- # unrepl (79)
- # untangled (66)
- # vim (47)
- # yada (17)
You can't really measure the memory usage by looking at the total RAM used up by the JVM the way you do, because the JIT takes whatever memory you give it and tries its best to use it all to get best runtime performance
@luxbock it's not meant to be a totally accurate method of measuring memory but there really isn't that many other methods of reliably measuring how much memory is being used by the system.
I know I'm not far off because GC stops freeing up memory and the system stops.
Yeah, JVM trades lots of space for time, artificially inflating RAM usage. Then Clojure's structures have heavy garbage churn. Then you're trying to sort over hundreds of megs. You'll start thrashing your memory pretty quick.
@john can you elaborate more?
(side point) I've been running a few experiments using (into-array) to bypass the president vectors and I still don't have too much luck.
I think they were referring to a bytebuffer off the raw byte stream. I still think you'll see memory improvement with transducers, which wouldn't dramatically change the shape of your code.
It's a regular laptop with 16GB but this is a toy csv file that is a copy of a large redshift AWS DB.
when doing complicated analysis I find it easier to load up the data in ram and start playing with it instead of writing complex SQL queries
my real data set is in the 2-5GB range which means Id not be able to do these analyses.
yes it's set to max of 16GB
Here is my best attempt to save memory (and the results):
user=> (/ (- (. (Runtime/getRuntime) totalMemory) (. (Runtime/getRuntime) freeMemory)) 1000000.)
35.268248
user=> (time (def parsed (into-array (mapv #(into-array (clojure.string/split % #",")) (line-seq (io/reader "/tmp/180MB.csv"))))))
"Elapsed time: 19111.031364 msecs"
#'user/parsed
user=> (System/gc)
nil
user=> (/ (- (. (Runtime/getRuntime) totalMemory) (. (Runtime/getRuntime) freeMemory)) 1000000.)
2490.754304
the variable parsed is responsible for more than 2400MB based on a 180MB file
if I do
(def parsed nil)
I get all my memory back:
(def parsed nil)
#'user/parsed
user=> (System/gc)
nil
user=> (/ (- (. (Runtime/getRuntime) totalMemory) (. (Runtime/getRuntime) freeMemory)) 1000000.)
51.476048
what type of data is in the rows? do you want to keep them as strings or to eventually parse them into something else?
the data is just an array of arrays
here is a sample:
head /tmp/180MB.csv
0,0,1,0
1,0,4,0
2,0,5,1
3,0,9,1
4,0,2,1
5,0,4,3
6,0,6,2
7,0,6,1
8,0,5,1
9,0,5,0
yes, it's an array of arrays: (aget (aget parsed 9) 0)
-> gives "9"
but there are no vectors I think anymore. (into-array) changed everythign to a java array
Not sure but it seems to be doing the right thing: parsed
returns #object["[[Ljava.lang.String;" 0x7e206dd "[[Ljava.lang.String;@7e206dd"]
hmm, not sure either. In any case, there may be some overhead in the vector itself, taking up a few bytes - perhaps a few bytes more than the few they're storing there.
maybe I should ask the general clojure channel?
(into []
(comp (map #(str/split % #","))
(map (fn [xs]
(mapv #(Integer/parseInt %) xs))))
(-> csv-file io/file io/reader line-seq))
how about this?@husain.mohssen you don't need it to be a vector, right?
right, yeah but if the data represents a matrix then I figured you'd want to be able to use get-in
on it
(into []
(comp (map #(str/split % #","))
(map (mapv #(Integer/parseInt %) xs)))
(-> csv-file io/file io/reader line-seq))
or you could just go straight to core.matrix
with vectorz, which is the Clojure equivalent of NumPy
(into []
(comp (map #(str/split % #","))
(map #(Integer/parseInt %) xs))
(-> csv-file io/file io/reader line-seq))
(into []
(comp (map #(str/split % #","))
(map #(Integer/parseInt %)))
(-> csv-file io/file io/reader line-seq))
Then just throw your group-by in there and you're golden
(defn group-by [f]
(fn [rf]
(let [groupped-value (volatile! (transient {}))]
(fn
([] (rf))
([result]
(rf (rf result (persistent! @groupped-value))))
([result input]
(let [key (f input)]
(vswap! groupped-value assoc! key (conj (get @groupped-value key []) input))
result))))))
I usually avoid using mapv in a transducer comp, instead just doing an extra (map vec)
if I want it. Same performance I think, and it can be easily added or removed or commented out.
btw, that group-by transducer above was written in ClojureScript, I believe. If it gives you any problems, you may want to tweak it or find an actually clojure version out there.
Ty everyone !
np! On another topic, just ran across this paper (http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005110) which argues that zipfian distributions are much more common to nature than previously thought. Perhaps those are domains that allow ANN's the feature compression and the unreasonable effectiveness that Max Tegmark talks about: https://www.technologyreview.com/s/602344/the-extraordinary-link-between-deep-neural-networks-and-the-nature-of-the-universe/