data-science 2019-09-02 | Slack Archive

Daniel Slutsky05:09:13

Video of last Thursday's meeting: https://youtu.be/3Hx7kbub9YE Thanks, @jsa-aerial and @jr0cket!

🚀 4

Daniel Slutsky07:09:33

The next meeting will be about R-interop: https://twitter.com/scicloj/status/1168422545302966272

agigao09:09:23

Hello Clojurians, any ideas how to use Python Pickled model in Clojure web app?

chrisn13:09:04

Maybe use libpython-clj on the server side to unpickle/run model.

✔️ 4

chrisn13:09:59

New releases of a lot of the techascent stack: • https://github.com/techascent/tech.io - somewhat drop on replacement for http://clojure.java.io that allows you to read/write data in the form of nippy, json, csv/tsv, to/from a url. We have backends for aws and azure so you can switch between local file and aws by changing the url. We make it easy to add new protocols so for instance the azure blob storage backend can be written to as: (io/put-nippy! "" {:a 1 :b 2}). Adding support for new protocols means only implementing a multimethod and a protocol. We have found that gzipped-TSV makes a great format for datasets as it has decent compression, is robust against locale-changes (like locales where you use commas as your thousands-separator) and can be used easily from Clojure, R, and Python. • https://github.com/techascent/tech.ml.dataset - Load those gzipped TSV's into a column-store system like R dataframe or pandas. Much better repl printing support and auto-detect between tsv-csv, and column datatype. Much better support for descriptive statistics so you can quickly loading your things and get an overview of what is going on; simply call descriptive-stats from the repl and let the returned dataset print itself out. Also some small nice things such as datasets are functions that when given a column name return the column like a map but when you iterate the datasets it returns a sequence of columns (as columns have names themselves). Columns also have nice printing so you can safely load really large things and print them; it will only print the first 20 or so entries. Strings, missing values are well supported. • https://github.com/techascent/tech.ml - As said earlier, smile was updated so we have officially supported versions of some new models most importantly ElasticNet. XGBoost support is upgraded to enable full support of early stopping and per-round error metrics so you can see if you need to use early stopping in the first place. Also XGBoost has been upgraded from upstream and now has (among other things) a better default regression model (squared-loss as opposed to linear). We are actively using these tools on a client project that involves some intermediate-sized data (millions of rows, many columns wide) so these tools are getting some nice refinements. Columnwise operations such as differences between columns or summing a few of them are effectively instant at that scale (millions of rows) so you can do exploratory programming and get nice,quick feedback. Our last training run involved: 1. Load dataset, remove outliers and some other dataset-specific filtering, make various new columns based linear combinations of other columns. 2. Doing an n-wise column runoff so train many random n columns where n goes from 2-5 and gather information about which collections of columns perform best. This is far more effective than a correlation table and xgboost or elasticnet are so fast they enable this (with moderate subsampling of dataset). 3. Gridsearch across hyperparameters given top Y results of n-wise column search. 4. Do full dataset train of top Y models from gridsearch and gather stats. For our purposes both loss and model size are important details. Choose 'best' model based on tradeoffs of stats gathered above. These tools allow you to do high level dataset engineering and some pretty sophisticated training regimes while staying in the functional realm. Enjoy 🙂.

👏 32

🎺 4

2019-09-02

Channels