Getting started using dtype.next and tech.ml.dataset.... really awesome stuff, thank you @chris441!
One question I can't really find from the docs... is there any kind of optimized support for columns whose values are arrays (i.e, the whole column is a tensor?) Or is an object column containing an array best approach at this time?
There is a tensor datatype. https://cnuernber.github.io/dtype-next/tech.v3.tensor.html
Yes, but you can't use an instance of it as a column:
Execution error (IllegalArgumentException) at tech.v3.dataset.io.mapseq-colmap/column-map->dataset$fn (mapseq_colmap.clj:122).
No matching clause: :tensorYou can't indeed. But you can convert dataset to tensor and reverse.
That is a bug - mapseq-colmap should support the tensor argtype 🙂 - for its purposes it should interpret that as a reader.
Well - I guess it depends which dimension you want the dataset to expose as the column rows.
user> (def tens (dtt/->tensor (partition 3 (range 36))))
#'user/tens
user> tens
#tech.v3.tensor<object>[12 3]
[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]
[15 16 17]
[18 19 20]
[21 22 23]
[24 25 26]
[27 28 29]
[30 31 32]
[33 34 35]]
user> (ds/->dataset {:a tens})
_unnamed [1 1]:
| :a |
|-------------------------------|
| #tech.v3.tensor<object>[12 3] |
| [[ 0 1 2] |
| [ 3 4 5] |
| [ 6 7 8] |
| [ 9 10 11] |
| [12 13 14] |
| [15 16 17] |
| [18 19 20] |
| [21 22 23] |
| [24 25 26] |
| [27 28 29] |
| [30 31 32] |
| [33 34 35]] |
user> (ds/->dataset {:a (dtt/rows tens}))
Syntax error reading source at (REPL:81:40).
Unmatched delimiter: }
user> (ds/->dataset {:a (dtt/rows tens)})
_unnamed [12 1]:
| :a |
|----------------------------|
| #tech.v3.tensor<object>[3] |
| [0 1 2] |
| #tech.v3.tensor<object>[3] |
| [3 4 5] |
| #tech.v3.tensor<object>[3] |
| [6 7 8] |
| #tech.v3.tensor<object>[3] |
| [9 10 11] |
| #tech.v3.tensor<object>[3] |
| [12 13 14] |
| #tech.v3.tensor<object>[3] |
| [15 16 17] |
| #tech.v3.tensor<object>[3] |
| [18 19 20] |
| #tech.v3.tensor<object>[3] |
| [21 22 23] |
| #tech.v3.tensor<object>[3] |
| [24 25 26] |
| #tech.v3.tensor<object>[3] |
| [27 28 29] |
| #tech.v3.tensor<object>[3] |
| [30 31 32] |
| #tech.v3.tensor<object>[3] |
| [33 34 35] |
user> (ds/->dataset {:a (dtt/columns tens)})
_unnamed [3 1]:
| :a |
|------------------------------------|
| #tech.v3.tensor<object>[12] |
| [0 3 6 9 12 15 18 21 24 27 30 33] |
| #tech.v3.tensor<object>[12] |
| [1 4 7 10 13 16 19 22 25 28 31 34] |
| #tech.v3.tensor<object>[12] |
| [2 5 8 11 14 17 20 23 26 29 32 35] |
user> When you say it's a bug, do you mean it'd be helpful for me to put together a minimal test case & submit a report (and possibly try to debug myself?) Or do you have a pretty good idea of what's going on? (using dtt/rows is a good workaround though, at least semantically... I don't have a big enough brain to reason about whether there are perf gains to be had from using a true tensor))
Create an issue here pointing to this thread might help already: https://github.com/techascent/tech.ml.dataset/issues
data analysis and visualization in this report brought to you by tablecloth, vega-lite, and clerk: https://hellgatenyc.com/nypd-shotspotter-data-report/
Do you have a way to unlock the article so we can read it?
https://bds.org/assets/files/Brooklyn-Defenders-ShotSpotter-Report.pdf
Thanks, and great work! Loved your conj talk
NICE!