Fork me on GitHub

rickmoynihan and mikera I've been thinking about plumatic schema and core.matrix (esp datasets) lately. Any thoughts on how to do something like that?


it is soooo easy to do with a vector of maps would be nice to do on a dataset


@otfrom: funny you should say that... we've been having similar discussions at Swirrl


I've not made much use of core.matrix yet... but we use incanter datasets in grafter quite a lot though our usecase is a little different. Basically incanter/core.matrix like to load the whole dataset into memory etc... but because we want to use it for ETL, we've been trying to avoid that and instead keep a lazy-seq of :rows in the Dataset


but this means that validation of the rows at least is somewhat delayed - because you don't want to have to consume everything all the time in order to validate the rows


but my problem with incanter Datasets is that they allow arbitrary types in for keys. From my perspective this can cause a lot of problems and it'd be much nicer if they were always keywords (though I'd accept always strings too) - allowing them to be either causes problems with equality


I'd like to move away from incanter though - and perhaps define a Dataset type of our own that conforms to the core.matrix Dataset protocol


I'd also like to experiment with perhaps a reducer based implementation... but I've not seen many examples of people using reducers for I/O


rickmoynihan: AFAIK incanter 1.9 (and later 2.x) uses core.matrix.dataset


rickmoynihan: iota is good one to look at for reducers and IO


I really should look again at what you have done in grafter


@otfrom: yeah I know incanter plans to use core.matrix.... but incanter 1.9 is basically a snapshot release... and there's been almost no movement on incanter for a long time as far as I can see


I've actually been looking at iota - it's one of the few examples of reducers and I/O that i've found - from the little I've seen it seems to assume too much about the file parsing...


but I need to look at it in more depth


@rickmoynihan: got schema and core.matrix.datasets working together


matty.core> (def DataSet {:column-names [s/Keyword]
                          :columns [[s/Num]]
                          :shape [(s/one s/Num "x-shape")
                                  (s/one s/Num "y-shape")]})
;; => #'matty.core/DataSet
matty.core> (def foo (ds/dataset [:a :b :c] [[10 11 12] [20 21 22]]))
;; => #'matty.core/foo
matty.core> (s/validate DataSet foo)
;; => {:column-names [:a :b :c], :columns [[10 20] [11 21] [12 22]], :shape [2 3]}


not quite sure what my problem was before


ds is [clojure.core.matrix.dataset :as ds]


so just need to do the column-names and keywords as I want


not sure if I can do coercion yet, but at least I can do validation


I noticed this the other day that core.matrix has a column-wise representation now - I'm guessing the protocol doesn't require that


Regarding Grafter - our use cases are probably a little different. Firstly we have to avoid using incanter quite a lot; because incanter is eager... so the API isn't as expressive as what incanter provides... again we've been prefering laziness to eagerness (though that brings its own problems for sure) Also the main idea with Grafter was to support an openrefine like interface for building transformations - so the DSL functions are intentionally dumbed down for those reasons. Also syntactically we thread the ds through the first argument of functions rather than the last - mainly because I wanted the option of optional arguments on dataset functions in the DSL. The basic idea is each step in a -> is an undo point - allowing stepwise debugging at the granularity of the dataset functions via the UI


I'm still a wiki gnome at ❤️


hmm the api docs also need updated - we're on 0.7.0 now


@otfrom: You can see a prototype Grafter UI that was built by SINTEF (an FP7 project partner from the Dapaas project) as the centre piece of this: Basically we'd decided to build Grafter to support a UI that we were planning to build as part of our product - but we didn't have enough resources in the project; so we let SINTEF (a norwegian research institution) build a prototype UI for us... They took a fair bit of direction from us about what to do and how to do it - and they did a pretty good job - but it's very much prototype quality


there's a youtube video on that page you can watch


cool. Will have a look


One thing I've been wondering - is whether it'd also be possible to use reducer/channel/seq (and therefore transducer backed datasets... from your core.matrix experience would this be possible on the API?


I mean now that there's a Dataset protocol presumably you could do this


IIRC mikera was suggesting getting at each row and processing it that way in a transducer (and I presume) reducer style


I think it partly comes down to whether or not the backing matrix implementation is faster than the trans|re ducer would be or not


as a lot of the performance stuff is baked into the matrix implementations themsevles


that's my understanding too... As I said though - our usecase is perhaps a little different - in that firstly there isn't really a suitable backing matrix implementation that I know of - and we want to avoid loading the file into RAM - so it'd be in e.g. grafter 2 to just build up a reducer inside the Dataset somehow - I'm guessing we could perhaps use the IReducible protocols for this... as our representation is currently #Dataset { :rows (...) :column-names [:foo :bar :baz]} and we'd need operations to keep the column names in sync with the row data. I'm not quite sure on how we could back it with a transducer yet though... as I'm not sure there are protocols for that


but it'd be very cool if you could switch a dataset from being pull based, push based, reducible/foldable, and sequence-able - but I think I need to learn a lot more about reducers and transducers


I've not made much use of either yet