Fork me on GitHub
#core-matrix
<
2016-03-09
>
otfrom13:03:02

rickmoynihan and mikera I've been thinking about plumatic schema and core.matrix (esp datasets) lately. Any thoughts on how to do something like that?

otfrom13:03:19

it is soooo easy to do with a vector of maps would be nice to do on a dataset

rickmoynihan13:03:26

@otfrom: funny you should say that... we've been having similar discussions at Swirrl

rickmoynihan13:03:51

I've not made much use of core.matrix yet... but we use incanter datasets in grafter quite a lot though our usecase is a little different. Basically incanter/core.matrix like to load the whole dataset into memory etc... but because we want to use it for ETL, we've been trying to avoid that and instead keep a lazy-seq of :rows in the Dataset

rickmoynihan13:03:52

but this means that validation of the rows at least is somewhat delayed - because you don't want to have to consume everything all the time in order to validate the rows

rickmoynihan13:03:21

but my problem with incanter Datasets is that they allow arbitrary types in for keys. From my perspective this can cause a lot of problems and it'd be much nicer if they were always keywords (though I'd accept always strings too) - allowing them to be either causes problems with equality

rickmoynihan13:03:34

I'd like to move away from incanter though - and perhaps define a Dataset type of our own that conforms to the core.matrix Dataset protocol

rickmoynihan13:03:48

I'd also like to experiment with perhaps a reducer based implementation... but I've not seen many examples of people using reducers for I/O

otfrom14:03:45

rickmoynihan: AFAIK incanter 1.9 (and later 2.x) uses core.matrix.dataset

otfrom14:03:19

rickmoynihan: iota is good one to look at for reducers and IO https://github.com/thebusby/iota

otfrom14:03:37

I really should look again at what you have done in grafter

rickmoynihan15:03:10

@otfrom: yeah I know incanter plans to use core.matrix.... but incanter 1.9 is basically a snapshot release... and there's been almost no movement on incanter for a long time as far as I can see

rickmoynihan15:03:49

I've actually been looking at iota - it's one of the few examples of reducers and I/O that i've found - from the little I've seen it seems to assume too much about the file parsing...

rickmoynihan15:03:07

but I need to look at it in more depth

otfrom15:03:57

@rickmoynihan: got schema and core.matrix.datasets working together

otfrom15:03:25

matty.core> (def DataSet {:column-names [s/Keyword]
                          :columns [[s/Num]]
                          :shape [(s/one s/Num "x-shape")
                                  (s/one s/Num "y-shape")]})
;; => #'matty.core/DataSet
matty.core> (def foo (ds/dataset [:a :b :c] [[10 11 12] [20 21 22]]))
;; => #'matty.core/foo
matty.core> (s/validate DataSet foo)
;; => {:column-names [:a :b :c], :columns [[10 20] [11 21] [12 22]], :shape [2 3]}

otfrom15:03:42

not quite sure what my problem was before

otfrom15:03:07

ds is [clojure.core.matrix.dataset :as ds]

otfrom15:03:29

so just need to do the column-names and keywords as I want

otfrom15:03:45

not sure if I can do coercion yet, but at least I can do validation

rickmoynihan15:03:05

I noticed this the other day that core.matrix has a column-wise representation now - I'm guessing the protocol doesn't require that

rickmoynihan15:03:18

Regarding Grafter - our use cases are probably a little different. Firstly we have to avoid using incanter quite a lot; because incanter is eager... so the API isn't as expressive as what incanter provides... again we've been prefering laziness to eagerness (though that brings its own problems for sure) Also the main idea with Grafter was to support an openrefine like interface for building transformations - so the DSL functions are intentionally dumbed down for those reasons. Also syntactically we thread the ds through the first argument of functions rather than the last - mainly because I wanted the option of optional arguments on dataset functions in the DSL. The basic idea is each step in a -> is an undo point - allowing stepwise debugging at the granularity of the dataset functions via the UI

otfrom15:03:14

I'm still a wiki gnome at ❤️

rickmoynihan15:03:46

hmm the api docs also need updated - we're on 0.7.0 now

rickmoynihan15:03:15

@otfrom: You can see a prototype Grafter UI that was built by SINTEF (an FP7 project partner from the Dapaas project) as the centre piece of this: https://datagraft.net/ Basically we'd decided to build Grafter to support a UI that we were planning to build as part of our product - but we didn't have enough resources in the project; so we let SINTEF (a norwegian research institution) build a prototype UI for us... They took a fair bit of direction from us about what to do and how to do it - and they did a pretty good job - but it's very much prototype quality

rickmoynihan15:03:44

there's a youtube video on that page you can watch

otfrom15:03:24

cool. Will have a look

rickmoynihan16:03:12

One thing I've been wondering - is whether it'd also be possible to use reducer/channel/seq (and therefore transducer backed datasets... from your core.matrix experience would this be possible on the API?

rickmoynihan16:03:28

I mean now that there's a Dataset protocol presumably you could do this

otfrom16:03:07

IIRC mikera was suggesting getting at each row and processing it that way in a transducer (and I presume) reducer style

otfrom16:03:35

I think it partly comes down to whether or not the backing matrix implementation is faster than the trans|re ducer would be or not

otfrom16:03:52

as a lot of the performance stuff is baked into the matrix implementations themsevles

rickmoynihan17:03:38

that's my understanding too... As I said though - our usecase is perhaps a little different - in that firstly there isn't really a suitable backing matrix implementation that I know of - and we want to avoid loading the file into RAM - so it'd be in e.g. grafter 2 to just build up a reducer inside the Dataset somehow - I'm guessing we could perhaps use the IReducible protocols for this... as our representation is currently #Dataset { :rows (...) :column-names [:foo :bar :baz]} and we'd need operations to keep the column names in sync with the row data. I'm not quite sure on how we could back it with a transducer yet though... as I'm not sure there are protocols for that

rickmoynihan17:03:32

but it'd be very cool if you could switch a dataset from being pull based, push based, reducible/foldable, and sequence-able - but I think I need to learn a lot more about reducers and transducers

rickmoynihan17:03:38

I've not made much use of either yet