Fork me on GitHub
#data-science
<
2020-05-21
>
niveauverleih08:05:44

I am gathering some selling arguments for clojure with clj-python over plain python for evangelism. Is it faster, more stable, easier to put to production?

Daniel Slutsky10:05:50

libpython-clj's docs offers some thoughts of what clojure is about, with a pythonista reader in mind: https://github.com/clj-python/libpython-clj/blob/master/docs/new-to-clojure.md

metasoarous18:05:09

A few big selling points to me: • Clojure is a data-driven language, and so actually much better for basic data manipulation than python (functional pipelines etc are the shit) • Clojure is much faster than python, generally speaking • Clojure runs on JVM, which has real threads and no GIL, making it a better choice for parallelization • Moreover, RH lists parallelization/concurrency as one of his core motivations for Clojure, and so has lots of built in primatives like atoms, refs, agents, futures, (not to mention libs like core.async) which greatly empowers this sort of work.

aaelony19:05:57

No more "indentation errors" 😄

🔥 4
aaelony19:05:32

Also: "One language to rule them all..."

💯 4
metasoarous20:05:04

Absolutely! And ClojureScript is a huge win here as well! A solid dynamic front end target is something python lacks, and which is gold for productizing data-science & data-viz.

metasoarous20:05:20

No more commas...

bananadance 8
metasoarous20:05:24

Seriously; fuck commas

metasoarous20:05:54

Here's the 🎫: • Host typing races where clojurists and pythonists have to type vectors/lists of numbers • let the naysayers eat their "BuT tHE paREnS?!?" • ... • profit

aaelony21:05:23

I am hoping that eventually, the management of sandboxed conda-like environments (or docker) will be driven by clojure and allow libpython-clj to be a "run anywhere" seamless thing

metasoarous21:05:39

Jesus... the package management system there is a nightmare

💯 4
metasoarous21:05:00

If we cleaned that nightmare up for them, that could be a huge sell.

👍 8
niveauverleih11:05:30

Thanks all! Why does Chris Nuernberger say that one should use containers when using clojure and clj-python?

niveauverleih11:05:43

@U05100J3V I have to read up on the GIL. As for parallelization, doesn't python have a library for that? I know distributed computing is not the same as parallel computing, but doesn't Spark solve part of their speed problems?

niveauverleih11:05:19

@U066L8B18 so the main take away from the long text are fun, easy to extend, high productivity, better REPL, right?

niveauverleih11:05:34

@U0CDMAKD0 I'm afraid "one language to rule them all" will just make pythonistas suspicious. For the package-mangement problems, could you please tell me more? I used pip before, and it worked.

Daniel Slutsky11:05:59

@UQLUWPRQD I think that is a nice way to summarize it! I like the point it makes about the REPL in the last paragraph.

Daniel Slutsky11:05:40

Because of this visibility advantage a common way to use Clojure is to model your problem as a transformation from datastructure to datastructure, testing each stage of this transformation in the repl and just letting the REPL printing show you the next move. 

niveauverleih11:05:50

@U05100J3V typing races: are you saying that one can type vectors faster in Clojure? (Sorry, my command of English is mediocre, I don't have access to hidden meanings).

niveauverleih12:05:20

@U066L8B18 I don't know these python functional pipelines that @U05100J3V talks about. Can they be used for transforming datastructure to datastructure? Background info: I've just been accepted as junior data scientist and will have to survive in python until I have gained enough credibility to start clojure evangelism. Worse, I'm newbie to both clojure and python.

niveauverleih12:05:56

... I've used transducers with core.async, so I have a vague idea what these transformations might look like.

Daniel Slutsky12:05:11

I don't know so much about the different possibilities of doing Functional Programming in Python. When I have to, I like to compose things with toolz.curried.pipe as demonstrated here: https://toolz.readthedocs.io/en/latest/streaming-analytics.html but probably there are other options. Hoping your time with Python will not be so bad : ) .. till your time of evangelism comes.

Daniel Slutsky12:05:22

Yes, one can work with datastructures in Python in a somewhat functional fashion. But I think the Clojure REPL experience make it clearer and simpler. Probably one reason for that is that things are Printed in the same notations they are Read. So the Read and Print parts of the Read-Eval-Print-Loop speak the same language. Does it make sense?

niveauverleih12:05:52

Yes, that makes sense.

Daniel Slutsky12:05:10

user=> ;; Evaluate something:
user=> (update {:x 9} :x inc)
{:x 10}
user=> ;; Take the printed result and pass it to be read for the next evaluation:
user=> (update {:x 10} :x inc)
{:x 11}

niveauverleih12:05:44

That's clear. Different question. Are neanderthal and numpy overlapping in functionality? Would you mix them?

Daniel Slutsky12:05:44

Afaik they overlap. Never tried to mix them.

niveauverleih12:05:34

and thanks for the toolz link.

metasoarous17:05:11

@UDRJMEFSN's tech.dataset has some level of interop with numpy matrices, but I don't think we get this with neanderthal.

metasoarous17:05:49

toolz looks cool; I've used some thing s like that. I just hate that I have to go to a second hand library to get all the Functional Programming goodies. Basically every python program I've written since learning Clojure results in me implementing a few clojure.core functions. I'd rather not have to do that work if I don't have to.

metasoarous17:05:03

And yes, Python is a fairly functional language as far as OOP languages go. In fact, I'd argue that at a core design level, it's actually functional first! The reason is that in Python, Objects & Methods are built using basic functional primitives. So it's not like Ruby, which for all the hype over having lambdas, is different in that in Ruby methods are not functions. It's possible to get to the function (lambda, really) for a given method, but it's an extra step. By contrast, in python, every object has a "magic" o.__dict__ attribute which points to all of the Object attributes and methods (as functions!) that the Object (and class) knows about. Which is hella elegant in my opinion. For an OOP language, Python is pretty nice (again, IMHO).

metasoarous17:05:36

However... And this is a big "however": Functional programming is deeply limited in power (compared to its full potential) when you don't bake persistent/immutable data structures into the language. JS has this problem too; It has first class functions, and at least used to require you to use protyping to build objects (which is also a functional pattern, if not sometimes a messy one). But the lack of persistent data structures means you can't squeeze every bit of juice out of those FP patterns. And this isn't just academic: Perhaps the single strongest part of the JS ecosystem right now is React, which is fundamentally a Functional Reactive Programming paradigm. Time and time again I have found that vanilla JS React is significantly limited relative to ClojureScript+React due to this limitation. As with toolz in python, there are libraries for adding persistent data structures to JS. In fact, mori just rips off ClojureScript's persistent data structures as a JS lib! Which is great, but also not the same as being a pervasive and fundamental assumption in the language. In Clojure everything is build up around these ideas and so they carry much more heft and power.

metasoarous17:05:52

Taking comments/questions in turn... <deep-breath>

metasoarous17:05:10

Next up for package management (now let the real grilling begin).

metasoarous17:05:23

Yes, pip works fine for the first few pacckages you install.

metasoarous17:05:17

Until you end up needing to install a package that depends on a different version of a package than the one you already have installed. In python you cannot install more than one version of a package in a given python environment! Which, to be honest, is fucking lunacy. Ruby, JS, Java and by implication Clojure all dodge this bullet. Each project specifies which version it needs, but you can have multiple versions installed along side each other.

metasoarous17:05:24

Python does not allow this, and thus there have been scores of projects which try to patch over the fundamental failing, and none of them have really done a good job of it (again, IMHO).

metasoarous17:05:16

In virtualenv you actually create separate environments for each project, and have to install things separately in each one of them. Managing and switching between these becomes a pain.

metasoarous17:05:38

Pip has finally embraced this pipenv project that gives you something close to what we have had forever in a lot of languages, which is a single tool for specifying packages and versions in a file, and managing virtualenvs based on those: https://pypi.org/project/pipenv/

metasoarous17:05:00

Hopefully that project is going well. It's very new, and I haven't looked at it since maybe 1.5yr ago. When I did, I had some issues with it, but it seems to be doing the right thing by copying "bundler, composer, npm, cargo, yarn, etc." (aka all the things that other languages have had for years, and Pythonistas are only now coming around to...).

metasoarous18:05:38

To be fair, I've found Python libraries to be pretty good about not breaking by themselves, probably because of lived experience with the pains of package management in this kind of system. But also because of the Python etiquette for simplicity (as they see it, at least: "there should be one obvious right way to do things"; to be contrasted with the ruby philosophy of "monkeypatching goes vrrrrrrmmmm!"), I think there's a natural inclination towards some of the Clojure philosophy of simplicity, which may rub off in part as "try not to break things". In Ruby, any upgrade of any major Rails or Active whatever infrastructure meant a big refactor. Not as frequently the case in Python, in my experience. So credit where due.

metasoarous18:05:01

However, and we all knew this was coming. Drullroll please!

metasoarous18:05:50

The language itself likes to break things! 🎉 (Actually I think it may be worse than Ruby in this ironically, though I really haven't used Ruby in years)

metasoarous18:05:51

It has taken Python something like 15 years to transition to Python 3. People are still running Python 2! It's getting more and more uncommon, but contrast the situation with that in pretty much any other language.

metasoarous18:05:39

A lot of Pythonistas hate this, and I think it will prime them nicely for Clojure, which in language and community has an exceptional ethic of not breaking things.

metasoarous18:05:43

I have seen such an ethic in exactly 0 other languages I have spent time with. Everyone seems to think that by incrementing a version number it's safe break things (semvar). Wrong! It still causes pain! https://www.youtube.com/watch?v=oyLBGkS5ICk

metasoarous18:05:49

Anyway, the part that pipenv (see above) doesn't solve is the rest of your system (native dependencies, etc). This is more in the realm of conda, which to some extent does a decent job of aiming at that problem, but also (to my knowledge) doesn't interoperate with pipenv 😂🔫

metasoarous18:05:33

Clojure has a sort of unfair advantage here, which is that it's hosted on the JVM, and so there generally aren't as many system-level libraries and such needed.

metasoarous18:05:36

This is all important context for understanding how Clojure could potentially help, and is where we dovetail with @UDRJMEFSN's point about containers.

metasoarous18:05:58

Because Clojure also doesn't really solve the problem of native/system dependencies (again, because it mostly doesn't have the problem, except when doing things like this: trying to interop with python or low level computational libraries), there's a space here where if we build good tooling (ideally as an extension (or at least in compatibility with) the deps.edn config) we could potentially solve some of the combined problems of conda+pipenv :thinking_face: 🏗️

metasoarous18:05:14

Imagine a world where pythonists would use this cool new tool for solving their combined conda+pipenv needs, that lets them write in python! 🌈 :unicorn_face:

metasoarous18:05:16

And where that tool just happens to be implemented in Clojure and comes preconfigured with libpython-clj so that curious Pythonistas can dabble in the divine art of true parallelization and immutable data! 🕍 🎆 🙏

metasoarous18:05:10

I get that none of this is easy, but all of this brings me to my last point in response to @UQLUWPRQD's questions/comments: Parallelism!

metasoarous18:05:18

Python does have some libraries for parallel computation, but they are nothing like the support we have in Clojure. A lot of this (to be honest) comes from being on the JVM. The JVM is awesome for threading. Clojure takes it to the next level by providing pervasive persistent & immutable data, which solves a lot of the problems one gets with place oriented (and Object Oriented) programming in particular (see https://www.infoq.com/presentations/Are-We-There-Yet-Rich-Hickey/).

metasoarous18:05:06

But it goes even further than that! Clojure provides state management primatives like atoms, refs, agents, futures, and abstract/custom dereffables (like Reagent's ratoms)!

metasoarous18:05:35

It also comes with concurrency tooling like core.async, which, it's worth mentioning, is a testament to Clojure's macro capabilities as a lisp! 💪 With macros, lisps allow users (programmers, libraries) to extend the syntax of the language, and so we were able to copy design features from the Go programming language 🎉

metasoarous18:05:29

What python has (the multiprocessing library) only allows you to spawn new processes from a parent python process.

metasoarous18:05:11

This is different than being able to spawn new threads in that processes can't share data in memory like threads can. You're stuck with message passing which requires data serialization/deserialization, and can severely constrain the kinds of things you can do.

metasoarous18:05:32

In python, the GIL means that only one thread can ever be running at time.

metasoarous18:05:05

So you can share memory between "threads", but they're not really threads, because you can't get them to run in parallel.

metasoarous18:05:09

Spark is basically just a layer of abstraction over lots of separate processes, which might be python or other (in fact its jvm, so we win there as well cause it's easier to interop there, and we have some very snappy spark wrappers, or you can interact with the jvm code directly).

metasoarous18:05:29

But you're still going to have the same constraints around processes passing messages, and will have added overhead of... Spark as infrastucture.

metasoarous18:05:51

Bottom line: There's a lot that Pythonistas stand to gain by working with the Clojure community, but the reverse is also true. We should be thinking about how our communities can benefit each other.

metasoarous18:05:03

The idea of "one language to rule them all" is in some ways a sort of lisp pipe dream, but also one which Clojure has sort of come closest to with it's hosted philosophy & design.

metasoarous18:05:25

But it's not really "one language to rule them all", it's "one language to connect them all" 🌈 :unicorn_face:❇️

metasoarous18:05:58

Oh; looked up and saw two more things to respond to: > typing races: are you saying that one can type vectors faster in Clojure? (Sorry, my command of English is mediocre, I don't have access to hidden meanings). Yes, not as in static typing but as in literally keyboard typing. Try it:

[1 2 3 4 5 6 7 8 9]
;vs
[1, 2, 3, 4, 5, 6, 7, 8, 9]

metasoarous18:05:00

Obviously, it's a little bit silly calling out little things like commas, but that's exactly the point! Anyone who says "BuT thE COmMaS!?!" is a) focused on minutia, and b) not seeing the whole picture.

metasoarous19:05:05

I would take Clojure's parens over the commas and semicolons and syntactic conflation/complexion of code blocks with data structures (looking at you { ... }) of other languages any day.

metasoarous19:05:24

Last thing: functional pipelines. @UQLUWPRQD You are already familiar with transducers, so you more or less know what I'm talking about. But also, transducers are the more decomplected (simpler/more-general/more-powerful) versions of a thing we have long done with the -> and ->> (and cond->/`cond->>`) macros, which more or less do the same thing but greadily. (-> x (f :arg) (g 1 2)) is the same as

(g (f x :arg))  
Meanwhile, (->> xs (map f) (filter g)) is the same as
(filter g
        (map f xs))

metasoarous19:05:08

Basically just a nice way of taking (certain) deeply nested Cojure expressions and rewriting them as a sequece of operations, much like with transducer composition. I guess the above exampole looks better like:

(->> xs
     (map f)
     (filter g))

metasoarous19:05:06

I actually usually end up writing threadf and threadl functions in python projects now, where I pass vectors of [fn, arg1, arg2], because I find it easier to read.

metasoarous19:05:14

OK; Sorry for the long rant. Hope that was helpful.

👍 8
jsa-aerial19:05:51

Wow, this thread is crazy long. Even so, I think a few points of clarification might help. Neanderthal is analogous to Numpy while http://tech.ml.dataset (TMD) covers more the Pandas, data.table http://et.al. space. So, Neanderthal is for real number crunching while TMD is for dataframe style slicing and dicing. Python doesn't really have any parallelism capabilities (due to the GIL issues) - it does have concurrency. Sort of. Anyone who has ever tried to do concurrency in Python knows it is just awful. There have been loads of libs over the decades trying to 'fix' this in various ways. Asyncio was supposed to be the final true way and was 'blessed' by Guido. But it is ludicrously complex and error prone. Trio is a lib that actually finally does (mostly) work for concurrent / async work. Thankfully. I threw away a bunch of brittle horrible asyncio code and replaced with trio code which actually worked w/o random exceptions and other idiocies. The python2 vs python3 fiasco isn't as bad as perl 5 / 6 mess, but it is closer than you might think. There is good evidence (and reasons) that py2 will be around more or less forever. IMO, Python is very non functional in almost all aspects. You can beat on it to some extent to use it in a broken functional manner, but it fights you all way. This isn't really surprising as Guido is on record dissing functional programming. Indeed, he tried hard to get rid of the simple functional constructs that it does have but failed.

metasoarous20:05:00

Thanks for the clarifications and additional context @U06C63VL4.

👍 8
metasoarous20:05:54

My point about http://tech.ml.dataset is that IIRC libpython-clj does a bit of work to give you zero-copy mappings to/from http://tech.ml datasetructures and what I thought were numpy data structures. I could have gotten that wrong, and that mapping is more directly to pandas. In any case, you're point is well taken that conceptually http://tech.ml.dataset & pandas serve the same role.

metasoarous20:05:38

The only real way that python is functional is in embracing first class functions as values. Which in my book is sort of the minimal sufficient condition. And again, their OOP approach leans heavily on that assumption, hence my perspective on it. But you're absolutely right @U06C63VL4 that this is pretty weak sauce relative to what you can do with first-class functions when you build around a core of immutable data structures & functions, and provide utilities for managing state separate from value (atoms, software transactional memory, etc).

Daniel Slutsky20:05:11

BTW, not having commas is actually important in making code-editing easier. It is not just about aesthetics and space-saving. In languages that require commas, whenever I have to comment out some part of a big data structure (say, a long list, or some nested dict, or something), it becomes quite annoying. Lots of editing is required to take care of the commas. In clojure, you just comment out (`#_`) the relevant inner form, without any such bother.

✔️ 4
jsa-aerial21:05:04

@U05100J3V I don't think you are wrong about that bit (as Pandas sits on top of Numpy to various extents), but TMD is focused on the Pandas columnar data sliceing and dicing stuff - Clojisr also uses it to map to/from R dataframe things (dplyr / data.type / http://et.al.). TMD also tries (like Pandas on top Numpy) to utilize efficient native memory.

✔️ 4
jsa-aerial21:05:38

Re: Python func stuff: Yeah, I think that is a reasonable point Chris. And following that line of thought, (it hurts to say this but...) JS is much more functional that Python

✔️ 4
jsa-aerial21:05:35

How do you quote if F-ing slack anyway??? Daniel: "BTW, not having commas is actually important in making code-editing easier. " This is definitely a VERY big deal. This has made Saite code transformation (at the editor level) far more simple. Having extraneous, and totally irrelevant, syntax is an f-ing disaster.

jsa-aerial21:05:29

Geez as long as I am on this, another thing that sucks like the tar pit from hell, is Python (and R just as much - maybe even worse) scoping 'rules'. Yeah, if it weren't such a geyser of bugs, this would be a LOL statement.

🎉 4
metasoarous22:05:56

You can quote with > followed by the quote

👍 8
🙏 8
🆒 4
jumar05:05:51

@U05100J3V wow, you should write a blog post about this 😉

metasoarous18:05:41

Yeah... I kinda started thinking that about half way through

metasoarous18:05:29

I'm going to assume that since this is a more or less public forum that it would be fine for me to use folk's handle's or names (happy to link out to folk's twitter's as well), but if you've contributed to this thread please let me know if you'd like yours to be elided.

niveauverleih18:05:48

+1 for the blog post.

✔️ 4
chrisn14:05:24

I don't have much to add here as @U05100J3V really did sum things up really well. 1. I think containers are important because I don't have the time to dive into everyone's individual machine configuration and figure out why the thing does not work. If you come to me with an issue that seems to be hardware or machine config related then my first response will be to ask you to reproduce it in a container. You need containers anyway to move to production and conda+docker gives you a completely reproduceable pathway for a lot of things. We have put effort into making a container development flow as painless as possible like for instance we have a container that mounts the local directory and runs as your logged in user so it reads/writes files as your logged in user. 2. We do have support for zerocopy to/from neanderthal. In my talk at time around 14:38 (https://www.youtube.com/watch?v=vQPW16_jixs) I show zerocopy from neanderthal to numpy via the tech platform. I haven't kept up the neaderthal bindings mainly because it is tough to dig through the class hierarchy that Dragan uses but the bindings used in the talk are at https://github.com/techascent/tech.neanderthal. They are somewhat out of date.

🙏 4
David Pham10:05:33

It is easier to combine with the JVM. I think the sweet spot is saying that the data transformation and IO are done with Clojure and the computational part on python.