off-topic 2017-12-21 | Slack Archive

qqq06:12:53

Are there any Statistics / Machine Learning books that focuses on implementation? I.e. instead of "here's a bunch of math, prove these theorems", it states: here's a dataset here's a model that achieves error rate BLAH write some code that gets error rate < BLAH

burke07:12:22

@qqq If you want to compare different algorithms (nnetwork, decision tree etc.) on the same dataset, i would recommend to play with knime. If you're interested in implementing the algorithms to understand how they are working, I would recommend to read some blogs (maybe https://machinelearningmastery.com/implement-decision-tree-algorithm-scratch-python/ ) or watch some youtube videos

qqq07:12:49

no, it's not "diff algos, same dataset"

qqq07:12:09

it's "here are the N most important ML algorithms; for each algorithm, here's a toy data set, now go implement the algo and get error rate foobar"

burke07:12:48

I'm working on a platform which provides unit tests as a service, where someone can request testdata, classify it, send the results and the service will tell the error-rate and other metrics. There will also be unit tests for small fragments of the algorithms. Its a project for the Machine Learning & Pattern Recognition Course of my University - at the moment there are no plans to make it available for everyone, didnt know that there is a demand outside of universities..

qqq07:12:31

There's http://poj.org/problemlist which is nice for classical , CLRS style algorithms. I'm looking for something similar for machine-learning algorithms, where "correct" means "error rate good enough", instead of "string equivalence of output."

qqq07:12:38

I think we're discussing something very similar.

burke08:12:30

In my platform you send your results to a rest-api as json/edn data (its language independent) and I will compare them with the results of a reference algorithm. Not based on string equivalence of the output. The next step I'm working on are reference algorithms for common mistakes in the implementation - so when you make a mistake, like forgetting to sort the numbers in your median algorithm, the software wont just tell you, that your median is implemented incorrect it will also say that you might check if you forgot to sort the numbers.

qqq08:12:55

re "string equivalence" -- I think we're in agreement here: I'm saying: for classical CLRS style algorithms, you can do string equivalence on output; but for ML algorithms, you nave to take the output, get an error rate, and see if it's "good enough"

qqq08:12:48

what machine learning algorithms / data sets are you covering? is it the UCI dataset, or did you create your own ?

burke08:12:54

If you request testdata on regular exercises, the platform will generate synthetic testdata. I use machine learning algorithms to generate binary- and multiclass classification datasets. There are a few competitions, where you can compete against other students in your class. There everyone gets the same dataset, and you can see a leaderboard, how good the other students classified it compared to your solution. On the prototype of the competition feature I used the Iris Dataset.

qqq09:12:27

@burke: this is really cool, you should commercialize it, I'd happily pay $10/month for a "continue to learn by writing code"

burke09:12:25

@qqq thanks for your interest. I will consider to open the platform for other people, when its ready to use. Commercialization would be okay, but is not really needed in my situation - I prefer donation based development 🙂 When it is available I will contact you for a test access 😄

qqq09:12:45

great! I look forward to it; but I can promise you, if it's donation based, I wouldn't donate anything 🙂

burke09:12:52

Testing and feedback is also some kind of donation :thinking_face:

cvic09:12:37

Yes. Time is a donation.

2017-12-21

Channels