Fork me on GitHub
#data-science
<
2022-07-02
>
Benjamin08:07:33

Jo is ds/mapseq-reader the correct thing to use when I for example want to provide data for #clerk ? (I have a dataset and I like a seq of maps). Yea ok I figure the dosctring literally says my description 😅

Konrad Claesson12:07:48

How can I perform an aggregation like

SELECT
    category,
    stragg(id, ', ') AS ids
FROM table
GROUP BY category
using tech.ml.dataset? stragg is an imaginary function that aggregates all values in the id column into a comma-separated string. For example, given a dataset like
(ds/->dataset [{"id" 1, "name" "bob"} {"id" 2, "name" "bob"}, {"id" 3, "name" "alice"}])
|  name | id |
|-------|---:|
|   bob |  1 |
|   bob |  2 |
| alice |  3 |
I would like to create a dataset like
| name  | ids  |
|-------+------|
| bob   | 1, 2 |
| alice | 3    |

genmeblog14:07:44

tc/fold-by is your friend here:

1
genmeblog14:07:50

(-> (tc/dataset [{"id" 1, "name" "bob"} {"id" 2, "name" "bob"}, {"id" 3, "name" "alice"}])
    (tc/fold-by ["name"] (partial str/join ", ")))

;; => _unnamed [2 2]:
;;    |  name |   id |
;;    |-------|------|
;;    |   bob | 1, 2 |
;;    | alice |    3 |

chrisn14:07:12

And the https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-reducer for larger datasets that avoids construction of the intermediate column values:

tech.v3.dataset.reductions-test> (ds-reduce/group-by-column-agg 
                                  "name"
                                  {"id" (ds-reduce/reducer 
                                         "id" 
                                         (fn [ctx val]
                                           (let [first? (nil? ctx)
                                                 ^StringBuilder ctx (or ctx (StringBuilder.))]
                                             (when-not first? (.append ctx ", "))
                                             (.append ctx val)))
                                         #(.toString ^Object %))}
                                  [ds])
                                                           
                                                                
name-aggregation [2 2]:

|  name |   id |
|-------|------|
|   bob | 1, 2 |
| alice |    3 |

Konrad Claesson14:07:55

This works great, but on my real dataset I get

1. Unhandled java.lang.Exception
   Column appId has value whose length (109583) is greater than max-chars-per-column (65536).
when trying to export it to a CSV file using ds/write!. Is there any workaround? cider also can't show a preview of the dataset because the columns are too long. Any work around for this?