Fork me on GitHub
#data-science
<
2020-06-07
>
chrisn14:06:28

http://tech.ml.dataset can load that file I believe. It is far more efficient with memory. in general.

user> (require '[tech.ml.dataset :as ds])
nil
user> (def ds (ds/->dataset ""))
#'user/ds
user> (require '[clj-memory-meter.core :as mm])
nil
user> (mm/measure ds)
"5.7 MB"
For a one-stop data exploration pathway that should work well for you: https://github.com/cnuernber/simpledata/

👍 4
metasoarous18:06:10

Dude; That memory-meter shit is dope! Going to stow that one away in my toolbox.

chrisn19:06:49

Haha, yeah totally. I really wish I had found that earlier as tracking down which object graph in a program is hogging ram is a serious problem sometimes 🙂.

jumar06:06:19

@UDRJMEFSN Do I need any special setup for http://tech.ml.dataset? The 2.0-beta... didn't work well so I've tried 1.73 but that gives another error:

1. Caused by java.lang.IllegalArgumentException
   Missing config value: :tech-io-cache-local

                  core.clj:  198  tech.config.core/get-config
                  core.clj:  193  tech.config.core/get-config
             providers.clj:   51  tech.io.providers/fn/fn
                  core.clj: 2753  clojure.core/map/fn
              LazySeq.java:   42  clojure.lang.LazySeq/sval
              LazySeq.java:   51  clojure.lang.LazySeq/seq
                   RT.java:  535  clojure.lang.RT/seq
                  core.clj:  137  clojure.core/seq
                  core.clj: 2809  clojure.core/filter/fn
              LazySeq.java:   42  clojure.lang.LazySeq/sval
              LazySeq.java:   51  clojure.lang.LazySeq/seq
                   RT.java:  535  clojure.lang.RT/seq
                  core.clj:  137  clojure.core/seq
                  core.clj:  930  clojure.core/reduce1
                  core.clj:  947  clojure.core/reverse
                  core.clj:  947  clojure.core/reverse
             providers.clj:   42  tech.io.providers/provider-seq->wrapped-providers
             providers.clj:   35  tech.io.providers/provider-seq->wrapped-providers
             providers.clj:   48  tech.io.providers/fn
             providers.clj:   47  tech.io.providers/fn
                  AFn.java:  152  clojure.lang.AFn/applyToHelper
                  AFn.java:  144  clojure.lang.AFn/applyTo
                  core.clj:  665  clojure.core/apply
                  core.clj: 6353  clojure.core/memoize/fn
               RestFn.java:  397  clojure.lang.RestFn/invoke
                    io.clj:   35  
                    io.clj:   35  
                    io.clj:   80  
                    io.clj:   76  
               RestFn.java:  410  clojure.lang.RestFn/invoke
                  base.clj:  572  tech.ml.dataset.base/->dataset
                  base.clj:  515  tech.ml.dataset.base/->dataset
                  base.clj:  580  tech.ml.dataset.base/->dataset
                  base.clj:  515  tech.ml.dataset.base/->dataset

jumar06:06:38

With 2.0-beta-57 I get this error even earlier, when requiring the lib:

Syntax error (IllegalArgumentException) compiling . at (tech/ml/dataset/math.clj:136:11).
No matching method fit found taking 5 args for class smile.clustering.KMeans

jumar07:06:04

I solved it - it seems that fastmath is using older versions of smile-* dependencies. I had to manually specify the 2.4.0 versions in my project.clj. The memory footpring indeed looks quite lower compared to clojure vector/hashmap https://github.com/jumarko/clojure-experiments/blob/master/src/clojure_experiments/csv.clj#L39-L49

(def csv-ds (csv/read-csv (slurp "")))
  ;; don't be fooled by lazy seqs when measuring memory -> use vector
  (mm/measure (vec csv-ds))
  ;; => "23.1 MB"
  (mm/measure (vec (csv-data->maps csv-ds)))
  ;; => "31.8 MB"

  (require '[tech.ml.dataset :as ds])
  (def ds (ds/->dataset ""))
  (mm/measure ds)
  ;; => "5.1 MB"  ;;

chrisn13:06:28

For more on that pathway: https://gist.github.com/cnuernber/26b88ed259dd1d0dc6ac2aa138eecf37 If you get a dataset where the numeric data can be represented by short integers and the string columns have low numbers of unique items then the dataset library really will shine. Also, if you measure the memory used by ds/mapseq-reader you will see that the maps are really referring back to the original table data; you only pay for what you read in terms of converting a dataset back into a sequence of maps. https://github.com/techascent/tech.ml.dataset/blob/master/java/tech/ml/dataset/FastStruct.java I got that idea from @U05100J3V’s semantic-csv library.