Fork me on GitHub
#data-science
<
2019-02-26
>
jlmr11:02:19

Hi, I’ve just started experimenting with JupyterLab and the IClojure kernel and I’m wondering if there is a way to suppress the output of a command. In Python it can be done by having ; at the end of the command, but in Clojure this has no effect of course. Is there some other character?

metasoarous17:02:28

@aria42 Great talk! Super excited to see Oz here! Thanks for the shout out. I'll make one point of correction though, which is that Oz does have a REPL API, which is how I predominantly use it.

✔️ 5
dcj21:02:07

@aria42: agreed, great talk! What follows are my random, "stream-of-conciousnes" thoughts on Clojure & data-science, highly speculative and rough, and by no means worked out in detail: Obviously Incanter was the early/first major contribution. IMHO Incanter included a data frame, math/stat functions, and a charting package. I found that the charts I created in Incanter were not easy to share, and the path to getting them into a web application wasn’t obvious. My initial attempts to use the Incanter data frame for other purposes were frustrating and not successful, however the fault may have been all mine…. I am not (yet) a “notebook” person, but my perception is that notebooks are a great way to share, and I’m thrilled that Clojupyter and others are doing well in this space, and I plan/hope to use them as I transition to more of a “sharing” mode… You wrote/asked: What is the “figwheel” for data-science? and Are notebooks better than a REPL? My feeling is that notebooks are great for sharing, but the REPL is awesome, and so I say “NO, REPL is better/different than a notebook” (for development, not for sharing) My opinion is the awesomeness of figwheel is that it brought the REPL experience to web development. So, I think that focusing on data-science in the REPL is most likely to achieve an unparalleled experience, more unique to the Clojure value proposition. Lately, for charting, I’ve been playing around with Oz (and will likely take a look at Saite at some point soon-ish). My initial impression is that Oz/vega-lite is far superior to the Incanter charting library, and I think that the sharing story, and the “how do I move my visualization to the web” story, is way better for these vega-lite based libs. I am excited about kixi.stats as a math/stats library, and I am hoping that this or something like it could grow into a very complete library. Your opinions about the lack of a winning dataframe format for Clojure (as Pandas is for Python) resonates with my experience, but I never realized it until you said it. So, to speculate, perhaps as a community we can work on something like: - a good/great dataframe - get it to work well with math/stat libs (kiwi.stats being one candidate) - and vega-lite-based charting libs All usable easily in the REPL. And it seems like it might be straightforward to package REPL exploration/development work in the above into a notebook when it is ready to share….

jsa-aerial21:02:15

@dcj I think a lot of that is 'on target'. Over on Zulip data-science there has been a lot of talk about the "pandas" issue, which really starts at the "NumPy" issue. If we had a 'NumPy' based on something that really leveraged things like BLAS and LINPACK and CUDA and OpenCL for GPU, with minimal overhead and some other goodies (automatic inference of best order of operation application, eg), then you would have the foundation for a great "pandas" as well as stats lib(s). I have made the pitch over there that this 'thing' should be Neanderthal, as Dragan has put in an impressive effort to achieve that minimal overhead and support for 'write naive, but get best performance anyway'.

jsa-aerial21:02:47

I think the big problem for Incanter was that it was simply too early to the table.

jsa-aerial22:02:55

Your basic proposition that some sort of 'repl' based data-science lever, plus what Aria points out as winning in deployment sound like what the potential 'figwheel' for DS could be. Now, the new datafy/nav capability in 1.10 and "Rebel" may well be a major piece of that vision. Also, note that it is really simple (and easy!) to do interactive repl dev in Neanderthal - even on the GPU.

dcj22:02:40

@jsa-aerial I neglected to mention datafy/nav/REBL. I think these are very promising also. Not clear to me yet if REBL the codebase will have the hooks people need to extend it. The fundamental ideas behind REBL are solid and important. Recall incanter/view as a way to browse/view a dataset. REBL is far superior to that. One can imagine vastly extending the REBL graphing based on Saite/Oz/etc.

metasoarous03:02:55

I hope I'm not getting anyone's hopes too high, but ya'll may be interested to know that I had a great chat with @stuarthalloway & @marshall about integrating Vega-Lite & Vega (and maybe even Voyager) directly into REBL, and they were highly receptive to the idea. It sounds like if they can overcome the technical hurdles, we may see this come to fruition! 🤞

5
jsa-aerial22:02:52

Yes, I think we are 'singing the same song' here!

metasoarous04:02:41

Agreed; To me it's wonderful to see the Clojure community begin to fall into harmony around Vega-Lite & Vega, as visualization is such an important piece to the overall puzzle. I also see it as a major sign of strength that both our libraries are out there and gaining traction as it signals a robust intersection with lots of choices depending on need/taste. Support in REBL would only add to that momentum!

jsa-aerial16:02:48

Yes, it looks good in that regard, but we will see. But again, the key to moving forward is having that solid base (which I argue Neanderthal is by far the best for this) which is "Clojure's NumPy". Then you could build a Pandas 'DSL equivalent' on top of that. If this isn't the path taken, then we will continue to have an inconsistent, scattered set of half baked things, most of which don't play well together.

jsa-aerial22:02:43

Which is good - there's been too much unnecessary fragmented views on how/where to proceed

dcj22:02:57

@jsa-aerial Crucial disclaimers: I haven't used NumPy or Neanderthal. I am currently more concerned with how do I organize and manage my exploration/analysis, and how fast a particular calculation takes is not my current focus. Ideally there would be a well integrated set of tools for managing data, browsing/viewing/charting it and defining/invoking computations, and then if you need that computation to run with maximal performance, there would be smooth/straightforward path to getting to done via a high performance GPU library.

jsa-aerial22:02:46

@dcj I think having the solid base is the key to all of this. If NumPy sucked in performance, Pandas would suck and nobody would be talking much about it

👍 5
dcj22:02:08

But I understand that performance can have a qualitative effect on exploration, if your computation runs too slowly, then it turns into a batch job and you can't sit around waiting for it, and you lose the REPL experience

jsa-aerial22:02:23

The key is to have that solid base while maintaining the wonderful flexibility and ease of data 'wrangling' that Clojure has. But having all that on a poor performing base would just be a waste as it wouldn't really be useful.

jsa-aerial22:02:46

That's a major reason why core.matrix was (and to a large extent still is) a flop. It really makes no sense to base any serious application on something that won't scale. It's all a massive waste of time and effort. Sadly, I'm speaking from experience...

👍 5
metasoarous03:02:37

Just curious: did you test out all of the implementations? The default vectorz impl is pretty slow for bigger stuff, but depending on what you're doing some of the other impls aren't too bad (assuming you have, but figured I'd check). It actually has some sort of data-frame like table notion, but it was a little half-baked still the last time I looked into it.

jsa-aerial16:02:18

No, I did not waste my time trying all of the half baked impls for this. I did waste time trying to get the Clatrix backend (at least on BLAS even if not optimal) to work consistently. I can't recall the details at the moment but I hit a couple situations where things didn't just fail outright, but actually gave bad results. Obviously even worse than just blowing up.

jsa-aerial17:02:04

But that is actually irrelevant to the main issue. After all you could always point out that, well, that can be fixed. With enough effort the Clatrix backend could be as solid as the vectorz. And that would be correct, but a major waste of time. That's because core.matrix isn't setup to use all the BLAS API and none of the LAPACK API - that's because it is really focused on vectorz.

jsa-aerial17:02:02

Now, why waste time trying to 'fix' this when there is already a base available that targets all of BLAS and LAPACK and does so with an informed, carefully crafted minimal overhead interface? One that also simultaneously directly supports the GPU? One that supports ordering operations for best performance so that you don't have to figure that out for each calculation? http://Et.al.

metasoarous01:02:59

Yeah, I hear you. Just asking.

metasoarous01:02:59

I found the same sort of issues with Incanter back in the day. Reported a bug in the pca algorithm that took years to get fixed. If it ever did...

Ben Kamphaus23:02:01

“I haven’t used NumPy” etc. (e.g. Matlab, IDL, w/e) is something I hear too commonly in the discussions, it’s really hard to compete with something you don’t understand. Same re: pandas, I don’t think it’s a global optimum, but what it gets right is a semantically rich DSL layered over a performant numerics impl. Granted it’s tied to tables, it’s very much a builder patterny sort of series of intermediate transforms that string SQL-named things together, etc., but it’s a huge speedup over native Python code and significantly faster than just using tables or libraries just over naive contiugous array impls. And you get to interact with named columns, do all your pivots, etc.

Ben Kamphaus23:02:48

so a lot of people will look at pandas and think dataframes are working around limitations in Python sequence operations or something, and say “well we have group-by, merge-with, reduce, etc. and I’d do the transform that way”, and they get that the Clojure case is already a good deal faster than the naive Python code, but don’t get that when it happens in pandas you use your groupby etc. code but what happens underneath the hood is a fast vectorized array operation, not something tree or list like.

☝️ 5
Ben Kamphaus23:02:32

of course once you’re out of smallish-mediumish data nd it’s big enough that network io dominates the pandas desktop story is less compelling, so ymmv.

Ben Kamphaus23:02:33

(good talk above, and discussion here, I was mainly responding to the blog posts)

Daniel Slutsky23:02:39

Regarding a semantically rich DSL layered over a performant numerics impl. as @bkamphaus suggested, it is worth looking into the recent progress of @chris441 with the http://tech.ml stack. https://clojurians.zulipchat.com/#narrow/stream/151924-data-science/topic/tech.2Eml.2Edataset/near/159465711 The DSL part of this is still under construction, but looks really promising Imho. It would be of great help if you could join that discussion and Zulip and add your thoughts about the DSL choices, etc.

metasoarous03:02:55

I hope I'm not getting anyone's hopes too high, but ya'll may be interested to know that I had a great chat with @stuarthalloway & @marshall about integrating Vega-Lite & Vega (and maybe even Voyager) directly into REBL, and they were highly receptive to the idea. It sounds like if they can overcome the technical hurdles, we may see this come to fruition! 🤞

5
metasoarous04:02:41

Agreed; To me it's wonderful to see the Clojure community begin to fall into harmony around Vega-Lite & Vega, as visualization is such an important piece to the overall puzzle. I also see it as a major sign of strength that both our libraries are out there and gaining traction as it signals a robust intersection with lots of choices depending on need/taste. Support in REBL would only add to that momentum!

jsa-aerial16:02:48

Yes, it looks good in that regard, but we will see. But again, the key to moving forward is having that solid base (which I argue Neanderthal is by far the best for this) which is "Clojure's NumPy". Then you could build a Pandas 'DSL equivalent' on top of that. If this isn't the path taken, then we will continue to have an inconsistent, scattered set of half baked things, most of which don't play well together.