data-science 2017-08-16 | Slack Archive

c25l01:08:59

I doubt I speak for most, but I go to python every time for data science work and the notebook is more of an annoyance than a feature. I go because pandas can seemingly parse everything, most analyses are implemented in some form or (like tensorflow) aren't convenient anywhere else, and they're all done using as efficient as possible matrix ops.

jhemann04:08:51

I agree, in that pandas was really the killer library that led to Python’s accelerated adoption for data science work. (The foundation of numpy/scipy/matplotlib was of course critical.)

zirmite11:08:54

pandas+scikit-learn is still a pretty great combo. notebooks are sometimes nice too, but I think those two libraries are the key to python adoption for data-science. I’ve also been using pymc3 a bit lately but I think clojure has a decent story for bayesian modeling with anglican and bayadera now (although the docs for those is lacking).

joelkuiper22:08:20

I’m probably the odd one out, but coming from an R background I generally find Pandas annoying and lacking. Numpy, Scipy, scikitlearn, pymc, gensim, theano/tensorflow/keras and now PyTorch are amazing though, and really have no equivalent elsewhere. Not to mention all the visualization libraries, and the sheer amount of examples and well written documentation, reference implementations for papers & blog posts

joelkuiper22:08:51

for large datasets I generally use H5fs for its memory map capabilities, which is also something that is oddly lacking in other ecosystems

joelkuiper22:08:18

I sometimes wish Clojure/JVM had a better data-science story, but on the other hand … apart from some notion of language purism, I really don’t see the benefit given all the nice distributed systems frameworks we have these days.

joelkuiper22:08:07

so the real question to me would be:what would be the killer thing Clojure/Lisp can do in the data science world, that would be incredibly hard elsewhere?

zirmite22:08:21

that’s a great framing, @joelkuiper.

zirmite22:08:23

i have an inkling that perhaps as data-science moves towards more streaming workloads (if, in fact, it does), that something like #onyx + ML will be a contender.

c25l13:08:18

I built the stream processing for the platform that Rally software was using a few years ago. We were able to do a ton of stuff, and I wanted onyx but it was still marked as "don't use for stuff you care about" or equivalent at the time. Everything in clojure on samza, and it worked. Clojure itself was great to do stream work in.

c25l13:08:07

That said, Riemann was an amazing base platform to do stream work from, the built-on database capabilities were super convenient.

zirmite13:08:49

we are slowly approaching some streaming use cases where I work. I’m hoping now that onyx is more mature, I can get folks on board with using it. thanks for reminding me about riemann too 🙂

joelkuiper22:08:09

possibly, the idea of some “immutable” model might be nice, especially in the context of online learning / reinforcement learning

joelkuiper22:08:19

my inkling is, but I do a bunch of NLP so I’m biased, is that some time soon people will rediscover symbolic manipulation (i.e. logic programming), maybe someone really clever can figure out how to bolt logic programming on top of statistical optimization, that could probably be a really nice fit for a Lisp

zirmite22:08:44

interesting

blueberry23:08:39

@joelkuiper for me, the killer feature is that i can create stuff that i need, which is not available off the shelf. sure, if i needed something that is already available in python, it would be most effective to just use it. but, as soon as i need a custom thing (and i need that often) i'd have to implement it in a very awkward python (with poor performance and messy code) or, if i wanted acceptable performance, implement it in c/c++ and then integrate it with the rest of python's ecosystem. for me, the decision is clear. for someone who needs only standard stuff, python is a clear choice. another factor is that it would be really expensive and hard to find people who could do the custom clojure stuff that i do, so if i'd have to hire someone to program that stuff, that would be another pair of shoes...

joelkuiper23:08:20

kinda depends on what you need off the shelf I guess, in machine learning I often need numerically heavy stuff. Having something like numpy makes that a lot easier, even trying to find a good enough random number generator for the JVM is tricky as is, whereas in numpy you can just say np.random.dirichlet for example. Most programming is stringing together known things in my experience, something like coding up a custom loss function for a neural network is easy in Python, harder in the JVM. So I have to disagree “only needs the standard stuff”, and I guess so does most of the academic community. I guess a factor in that is that when you do academic science, you often compare against baselines, not having to also code that up (but instead have the baseline “off the shelf”), makes iterating to the next thing a lot easier. But I guess it boils down to preference as well 🙂

joelkuiper23:08:40

I mean case in point, the amazing Neanderthal library you wrote is a fairly recent development, and /you/ had to code that. And it wasn’t that there weren’t other fast numerical interfaces out there already. I guess personally I just don’t like coding all that much, and rather reuse whatever I find, even if somewhat suboptimal 😉

2017-08-16

Channels