data-science 2022-07-02 | Slack Archive

Benjamin08:07:33

Jo is ds/mapseq-reader the correct thing to use when I for example want to provide data for #clerk ? (I have a dataset and I like a seq of maps). Yea ok I figure the dosctring literally says my description 😅

Konrad Claesson12:07:48

How can I perform an aggregation like

SELECT
    category,
    stragg(id, ', ') AS ids
FROM table
GROUP BY category

using tech.ml.dataset? stragg is an imaginary function that aggregates all values in the id column into a comma-separated string. For example, given a dataset like

(ds/->dataset [{"id" 1, "name" "bob"} {"id" 2, "name" "bob"}, {"id" 3, "name" "alice"}])
|  name | id |
|-------|---:|
|   bob |  1 |
|   bob |  2 |
| alice |  3 |

I would like to create a dataset like

| name  | ids  |
|-------+------|
| bob   | 1, 2 |
| alice | 3    |

genmeblog14:07:44

tc/fold-by is your friend here:

genmeblog14:07:50

(-> (tc/dataset [{"id" 1, "name" "bob"} {"id" 2, "name" "bob"}, {"id" 3, "name" "alice"}])
    (tc/fold-by ["name"] (partial str/join ", ")))

;; => _unnamed [2 2]:
;;    |  name |   id |
;;    |-------|------|
;;    |   bob | 1, 2 |
;;    | alice |    3 |

chrisn14:07:12

And the https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-reducer for larger datasets that avoids construction of the intermediate column values:

tech.v3.dataset.reductions-test> (ds-reduce/group-by-column-agg 
                                  "name"
                                  {"id" (ds-reduce/reducer 
                                         "id" 
                                         (fn [ctx val]
                                           (let [first? (nil? ctx)
                                                 ^StringBuilder ctx (or ctx (StringBuilder.))]
                                             (when-not first? (.append ctx ", "))
                                             (.append ctx val)))
                                         #(.toString ^Object %))}
                                  [ds])
                                                           
                                                                
name-aggregation [2 2]:

|  name |   id |
|-------|------|
|   bob | 1, 2 |
| alice |    3 |

Konrad Claesson14:07:55

This works great, but on my real dataset I get

1. Unhandled java.lang.Exception
   Column appId has value whose length (109583) is greater than max-chars-per-column (65536).

when trying to export it to a CSV file using ds/write!. Is there any workaround? cider also can't show a preview of the dataset because the columns are too long. Any work around for this?

Benjamin14:07:17

https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.print.html I guess here https://techascent.github.io/tech.ml.dataset/quick-reference.html print options

Carsten Behring17:07:23

max-chars-per-column can be changed in write https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.html#var-write.21

👍 1

2022-07-02

Channels