Fork me on GitHub
#data-science
<
2022-12-07
>
Eric Dvorsak08:12:33

Is there any way to make aggregates that produce more than one rows with tablecloth or dataset?

otfrom08:12:37

row-mapcat?

otfrom08:12:56

(I may not understand the problem)

Eric Dvorsak08:12:58

I have a table of user answers, I want to calculate daily progress of each users So I'm grouping by users, order by day and want to produce a row for each day with the aggregation of answers of the day + previous day a wrong answer resets the progress to 1 eg: day 1 t t f -> 1 day 2 t f t t -> 3 day 3 t t -> 5 so I need a reduction on each user (simplifying the problem for the example)

Eric Dvorsak08:12:11

so far I managed to do it by adding a progress column in tablecloth, but I have 2 million answers and for the next step I need to extrapolate the progress to the missing day, the solutions I tried are too slow (didn't let it finish after an hour), a vanilla clojure solution with map and reduce building a vector of n (number of days) elements only took 8 seconds for this dataset, 7 seconds being getting the 2 millions rows from the database with jdbc.next

genmeblog10:12:37

Tablecloth can aggregate a grouped dataset. So first you apply group-by and then aggregate This way you will have aggregation per day. See some examples here: https://scicloj.github.io/tablecloth/index.html#Aggregate

Eric Dvorsak10:12:51

but I need to know the result of the aggregate of the previous day to start the aggregate of each day

genmeblog10:12:49

Ok... then first approach should be ok, grouping by user, order by day and then custom reduction function.

genmeblog10:12:05

Tablecloth (TC) operations are very similar to SQL, so if it's hard to do in SQL it can be hard as well in TC.

genmeblog10:12:01

Anyway, digesting 2 milion of records should be really fast.

Eric Dvorsak16:12:24

I guess my issue is that I don't see how you'd do a custom reduction function that lets you replace a dataset subgroup by another

genmeblog17:12:59

When you group by a pair user, day you reduce many rows to one, right?

genmeblog17:12:29

Then you ungroup and group by user to reduce again.

Eric Dvorsak17:12:16

I want the progress of every day, and it needs to take the previous day into account