Fork me on GitHub
#clojure-europe
<
2020-10-13
>
javahippie07:10:58

Guten Morgen!

otfrom07:10:14

@borkdude I should probably get over shelling out feeling like cheating

Ben Hammond09:10:10

does it then tie you to running JVM on a specific platform?

borkdude09:10:27

Yes, of course: it then depends on that executable being available in your external environment.

borkdude09:10:37

I'm writing a package manager that should help solve this problem: https://github.com/borkdude/glam

πŸ‘ 3
borkdude07:10:31

In babashka it's natural, on the JVM it feels like cheating :)

otfrom08:10:36

shelling out is the whole point of babashka πŸ˜„

otfrom10:10:38

@borkdude this seems to have worked reasonably well

(defn ->nippy [dirname data]
  (run!
   (fn [data]
     (let [idx (-> data first :simulation)]
       (nippy/freeze-to-file (str dirname "simulated-transitions-" idx ".npy") data)))
   (partition-by (fn [{:keys [simulation]}] (quot simulation 100)) data)))


(defn nippy->data [dirname]
  (into []
        (comp
         (filter (fn [f] (re-find #"npy$" (.getName f))))
         (x/sort-by (fn [f] (.getName f)))
         (mapcat nippy/thaw-from-file))
        (.listFiles (io/file dirname))))

otfrom10:10:48

the data is at least pretty easy to partition

otfrom10:10:15

I ran out of memory (heap space) when trying to do it all as one vector

otfrom10:10:41

reading in takes 78 seconds from nippy compared to 394 seconds converting from csv (with no compression)

borkdude10:10:02

and what about zip or gzip?

borkdude10:10:27

I guess nippy is nice to use since it can deserialize to EDN directly

otfrom10:10:24

I've not had a go with zip or gzip, tbh, I'm pretty happy I can dump my code that was doing the type conversions from csv

otfrom10:10:54

and nippy uses LZ4 for compression which is pretty fast and compact

otfrom10:10:59

@borkdude any reason why you think I should not use nippy? Other than compatibility with other languages (which would probably drive me to arrow and http://tech.ml.dataset really)

borkdude10:10:56

Don't know. Btw, there's also a CLI for nippy (which can also be used as a pod from babashka: https://github.com/justone/brisk)

borkdude10:10:49

So then you could use it from other languages as well, by shelling out, or from bb scripts

otfrom10:10:16

arrow is really good for going to R (which we use) or Python (which we sometimes use but not often)

borkdude10:10:47

the benefit of using zip or gzip is that it's natively supported in many stdlibs (of Java as well)

otfrom10:10:03

that is true

otfrom10:10:25

I'm sort of enjoying the faster save and load times and not having more conversion code to maintain πŸ™‚

otfrom10:10:32

I will probably regret this one day

otfrom10:10:49

but then babashka will save me right??!?!?! right?!?!?! πŸ˜‰

borkdude10:10:22

I hope! You could also write your own company branded GraalVM-based CLI tool around your data and then call this from Python, R, whatever.

otfrom10:10:03

If I did that then I'd have to use even more interrobangs interrobanghugs

borkdude10:10:38

or if it's seconds anyway, just use a Clojure on the JVM script. That has better perf than bb

otfrom10:10:59

true enough

otfrom10:10:07

it isn't like I'm going to notice that latency

otfrom10:10:30

it is quite nice working with an annoying size of data again

πŸ˜‚ 3
otfrom10:10:37

so many of my datasets have been small lately

otfrom16:10:35

πŸ‘‹ @jasonbell

πŸ‘‹ 3