hi! Is there an initiative similar to cuDF in Python? (GPU accelerated dataframe)
Hi everybody ... I have another tech.ml.dataset question. I do not appear to really understand what's going on with ds/row-map. I'm trying to derive missing values in one column based on the contents of another (details in thread).
> I have a code convention I try to stick to with ->> and -> where I'm not allowed to change the shape mid-thread
Keeping this stuff simple is definitely a good idea (especially when we expect someone to need to read the code later).
One area where this pops up for me a lot is with ds/group-by-* which produces a map instead of a dataset, and I find that I'm often -> into ds/group-by-* and then want to ->> from there... Most commonly, I actually just do this, and nest them. Maybe the nesting sort of 'follows' the change of shape?
I find my own style has evolved to extracting either a let binding or a function once the (-> ,,, (->> ,,,)) gets going, based on past regrets.
or, (as-> x $ ,,,) is a catch-all signal for "here be cleverness; consider some more coffee first"
ohhh, ok, yeah ... this is not the first time I've confused myself by not lifting my brain up into thinking about whole datasets at once
I find it pretty easy to accidentally flip over from column-thinking (ds) into row-thinking (seq of maps)
I have a code convention I try to stick to with ->> and -> where I'm not allowed to change the shape mid-thread
I'm mulling one where maybe I just don't thread datasets and instead it has to be (ds/bind-> ,,,) as a signal to be more careful
Thought about this a little more and the 'surprising' behavior ultimately makes sense. I put some comments in the issue: https://github.com/techascent/tech.ml.dataset/issues/452
If anyone has ideas about how to improve the row-map docs, there's probably room for improvement there.
I had understood a dataset as something I could thread and run row-map on in multiple passes. But it seems like I've accidentally set up a "last pass wins" situation ... I'd have expected the other article numbers to fill in but they stay blank (see 2 in code here).
Also, I'm trying to understand these prn statements ... it makes sense to me that each row-map is running 1x per row, so 9 statements, but I'm struggling to understand what I'm doing that's blowing away previously added data.
hm, never mind ... it seems that returning the whole row gives me the behaviour I want. Now I'm trying to figure out where I got the idea that I could return just a patch from ds/row-map
Perhaps some of the confusion is coming from the fact that backfill-description sometimes returns nil (!)...
could be ... where it got odd and confusing is that I had another row-map function in a when (so it could return nil), BUT the patch that it returned didn't collide with any existing column names, ie. it added columns instead of mutating ones that existed before the row-map
oh here's where I got the idea of just returning the columns that I was interested in changing
anyway, I can assoc in; no big deal
In general, you don't need to return a map with all the keys:
user> (-> (ds/->dataset {:a [1 2 3]})
(ds/row-map (fn [{:keys [a]}]
{:b (inc a)})))
_unnamed [3 2]:
| :a | :b |
|---:|---:|
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
user> (-> (ds/->dataset {:a [1 2 3]})
(ds/row-map (fn [{:keys [a]}]
(when (= 2 a)
{:b (inc a)}))))
_unnamed [3 2]:
| :a | :b |
|---:|---:|
| 1 | |
| 2 | 3 |
| 3 | |right, and both of those examples are adding a :b column where none existed before the row-map
Ah! Perhaps this is the surprising behavior:
user> (-> (ds/->dataset {:a [1 2 3]})
(ds/row-map (fn [{:keys [a]}]
(when (= 2 a)
{:a 0 :b (inc a)}))))
_unnamed [3 2]:
| :a | :b |
|---:|---:|
| | |
| 0 | 3 |
| | |yes!
hm, happens even if it's the first row:
user> (-> (ds/->dataset {:a [1 2 3]})
(ds/row-map (fn [{:keys [a]}]
(when (= 1 a)
{:a 0 :b (inc a)}))))
_unnamed [3 2]:
| :a | :b |
|---:|---:|
| 0 | 2 |
| | |
| | |I could be wrong but it seems like the distinction is in touching the existing column, versus adding a new one?
Unsure if this is satisfying:
user> (-> (ds/->dataset {:a [1 2 3]})
(ds/row-map (fn [{:keys [a] :as row}]
(if (= 1 a)
{:a 0 :b (inc a)}
row))))
_unnamed [3 2]:
| :a | :b |
|---:|---:|
| 0 | 2 |
| 2 | |
| 3 | |Seems to be 'touching the existing column sometimes and returning nil sometimes'...
well, it seems like returning a complete row unconditionally gets me over some very confusing behaviour, maybe at some hard to measure performance cost
but my datasets are in the 10k-100k lines range here so it's fine
yeah, row-map is fast, and returning an unchanged row is likely to be fast as well.
maybe it all just mixes badly with missing values
it could be a bug, we can dig into it: https://github.com/techascent/tech.ml.dataset/issues/452 Thanks for bringing it up!
yw! Thanks for talking it through with me
glad there's a sensible workaround as well.
(love the lib, by the by ... been replacing a bunch of brittle docjure with 1/3 to 1/2 the LOC by keeping things as datasets as long as possible)
Nice! To me this is always one of the selling points - once things are up and running it's easy to consume a variety of formats. Years ago we wrote a lot of one-off stuff to ingest excel over and over, now (like you're saying) we do it a lot more quickly and reliably.
perhaps of interest: I was delighted to find that I can rename these columns with the stacked headers with a column header map of eg. {:arrangement "2 Screw" :spacing-vertical 16.0} or {:arrangement "2 Screw" :spacing-vertical 24.0}, and then it's a fairly straightforward mapcat to unroll the row into one map per cell.
That sounds rad - we've also found it very powerful to have the ability to do Object columns with arbitrary clojure data in them. The fact that .nippy serialization can maintain these kinds of arrangements with acceptable freeze/thaw times gives us flexibility that isn't frequently seen in other systems/languages.
all hail value equality semantics