Fork me on GitHub
#data-science
<
2022-07-06
>
Benjamin08:07:47

(ds/->dataset [{:foo 10} {:foo 3} {:foo 3}])
  ;;  =>
  [{:foo 3} {:foo 3}]
I like to filter the rows where foo is the smallest. Is the solution something with group by and sort ? Basically I have a date in my data and I like to look at the first day

Benjamin08:07:50

(->
   (ds/->dataset [{:foo 10} {:foo 3} {:foo 3}])
   (tc/order-by :foo :asc)
   (tc/group-by :foo)
   (tc/groups->seq)
   first)

genmeblog08:07:19

In this degenerated example I would reach for (reduce max (map :foo [{:foo 10} {:foo 3} {:foo 3}])) Or sort-by :foo and first

genmeblog09:07:58

In case you want to group on the other field and take smallest :foo within a group. Something like that:

genmeblog09:07:54

(-> (tc/dataset [{:foo 10 :group 1} {:foo 3 :group 2} {:foo 3 :group 1}])
    (tc/group-by :group)
    (tc/order-by :foo) ;; this is done within a group
    (tc/first) ;; this is done within a group
    (tc/ungroup)

Benjamin09:07:44

👀 :thumbsup:

Konrad Claesson15:07:04

Is it possible to use tech.ml.dataset to process larger than memory datasets? It actually seems like Clojure crashes with java.lang.OutOfMemoryError far before all memory on my machine is used up. Presumably, this is due to some JVM limitation. For my current use-case, increasing this limit may be a sufficient solution.

jsa-aerial16:07:33

Yes. You will need to use the Arrow support of TMD. You will have more response and quicker here: https://clojurians.zulipchat.com/#narrow/stream/151924-data-science and in particular here: https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/tech.2Eml.2Edataset

👍 3
chrisn17:07:47

Zulip is definitely a better place to discuss this but there are two ways to process larger than memory datasets. The first is to divide the dataset up into sections that fit in memory and process each section. In this case you are processing a sequence of datasets and you want to use transducers and specifically eduction to avoid Clojure's chunking mechanism. The second is to load the dataset from an arrow file using the memory mapping pathway but even in that case you had to write the file somehow. So usually we just process sequences of datasets. Note that https://techascent.github.io/tech.ml.dataset/tech.v3.libs.parquet.html#var-parquet-.3Eds-seq, https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html#var-stream-.3Edataset-seq and the https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.io.csv.html#var-csv-.3Edataset-seq loading mechanism all have ways to load a sequence of datasets from a file. This overall design (sequences of in-memory datasets) is supported specifically in the https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html. To increase the usable ram of the JVM look into the -Xmx and -Xms options for jvm startup. This pathway works well if you have enough ram but we normally just divide the dataset up into million row batches or something like that and process a sequence of datasets.

👍 1