clojure-europe 2020-11-12

Anyone here got a good suggestion for something like nippy but that writes out records in a file rather than just a big take it or leave it data structure?

ordnungswidrig14:11:43

file per record?

otfrom14:11:43

I've done something before with baldr and record separators, but that felt a bit janky

otfrom14:11:24

file per record would overwhelm the OS file handles I think. There are about 2-10 million records

otfrom14:11:00

I like the speed of nippy, and the compression is pretty good too, but I lose a lot of compression by needing to split things up and I lose a lot of file efficiency by having each file be a single vector of records that gets read in

otfrom18:11:43

looks like transit, based on fressian, might be the sweet spot? Looks like you can read and write individual objects from a stream. https://cognitect.github.io/transit-clj/#cognitect.transit/read

otfrom18:11:28

and there is a reducible friendly wrapper already https://gitlab.com/pjstadig/reducibles

plexus08:11:40

probably not the performance you are looking for, but this is the main reason for ednl https://github.com/lambdaisland/edn-lines

otfrom08:11:39

thx 🙂

otfrom18:11:58

as this is often the eduction channel, I've been looking at @ben.hammond's blog post here: https://juxt.pro/blog/ontheflycollections-with-reducible and thinking that you don't need to have a reducible for the directory of files, you just need a reducible for each file type, you can then have a vector of eduction of those reducibles which would give you all your short circuiting/ reduced? functionality if you did something like

(eduction ;; changed from sequence thanks to Ben Hammond's advice
  cat
 [(eduction mappify-record (reducible-type-1 file-1))
   (eduction mappfiy-record (redcucible-type-1 file-2))])

otfrom18:11:46

you can replace sequence with eduction depending on whether or not you want to have the results in memory or recalculate them each time (from what I understand)

otfrom18:11:11

(errors of misunderstanding of the blog post are mine)

otfrom18:11:36

I think this simplifies the chaining-reducible bit. I think

otfrom18:11:50

the real magic happening in cat

Ben Hammond18:11:20

An eduction of a reducible might not implement ISeq, at which point things start breaking

otfrom19:11:51

Ah, TIL.

2020-11-12

Channels