Fork me on GitHub
#data-science
<
2022-07-01
>
Konrad Claesson18:07:45

Given a tablecloth dataset like

(tc/dataset [{:name "alice"} {:name "bob"} {:name "bob"}])
| :name |
|-------|
| alice |
|   bob |
|   bob |
how can I count the number of times that each value appears in the :name column? The output I want is
| alice | 1 |
| bob.  | 2 |
In SQL terms, I am trying to do
SELECT column, count(*)
FROM table
GROUP BY column

genmeblog19:07:47

group-by and aggregate

genmeblog19:07:53

I will provide an example soon (on mobile now)

Konrad Claesson20:07:13

This works

(->
   (tc/dataset
    [{:name "alice"}
     {:name "bob"}
     {:name "bob"}])
   (tc/group-by [:name])
   (tc/aggregate
    {:name-count (fn [coll] (tc/row-count coll))} ))
  _unnamed [2 2]:

| :name | :name-count |
|-------|------------:|
| alice |           1 |
|   bob |           2 |
but it is too slow for my real use-case. My dataframe has 50 columns and 2,333,409 rows. Is this more than what tablecloth/tech.ml.dataset can handle Too slow means that it did not complete within 10 minutes.

chrisn20:07:21

@U03A8HUF1C2 - The https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-group-by-column-agg has high performance reductions for larger datasets that avoid intermediate dataset creation and take sequences of datasets. What you are looking for is something like:

tech.v3.dataset.reductions-test> (def ds (ds/->dataset "test/data/stocks.csv"))
#'tech.v3.dataset.reductions-test/ds
tech.v3.dataset.reductions-test> (ds-reduce/group-by-column-agg "symbol" {:num-sym (ds-reduce/row-count)} [ds])
symbol-aggregation [5 2]:

| symbol | :num-sym |
|--------|---------:|
|   AAPL |      123 |
|    IBM |      123 |
|   AMZN |      123 |
|   MSFT |      123 |
|   GOOG |       68 |

chrisn20:07:08

Note that for larger datasets it may be better to either subsample or perhaps keep them as a sequence of datasets and use something like eduction along with various operations.

chrisn20:07:20

Also parquet will automatically chunk your large dataset up into many large-ish chunks into the same parquet file. For arrow you need to do the chunking before writing the dataset sequence out to a file.

genmeblog00:07:23

TC might be slow in certain cases, I'll check if there is something to do with it (however 2M rows is not too much imho). TMD has much more options as Chris described.

Konrad Claesson06:07:23

Thank you for the detailed responses. The reductions approach outlined above too does not work, but for a different reason. I start getting a null reference exception the moment I try to perform the aggregation on 64793 rows rather than 64792 rows (see below snippet).

(ds-reduce/group-by-column-agg "id" {:id (ds-reduce/row-count)} (ds/head ds 64792))
;; Instantly returns the results

(ds-reduce/group-by-column-agg "id" {:id (ds-reduce/row-count)} (ds/head ds 64793))
;; Instantly fails with a null reference exception
Interestingly, this limit does not have to do with the total size of the dataset. I say this because after dropping 45 out of 50 columns, the breaking point is still 64739 rows. The stack trace for the null reference exception can be found here: https://pastebin.com/qX68G6f2

Konrad Claesson07:07:16

I think that the issue may be that my source CSV contains commas in some columns, but also uses comma as a separator.

✔️ 1
Konrad Claesson08:07:39

The issue was indeed the input data. With commas properly escaped, everything works smoothly.

bananadance 1