2025-05-03 data-science | Clojure Slack Archive

data-science 2025-05-03

john 2025-05-03T01:11:53.835109Z

Idea: a wire format that chunks columnar data, maybe by rows of 32. (1024?) Then they can be easily consumed as a sequence of maps or as the column format. Is that already a thing?

john 2025-05-03T01:14:46.352819Z

You'd lose a little performance, getting worse the smaller your chunk size. But you might get a decent performance profile for both use cases - serial maps and columnar - making it a more universal wire format

john 2025-05-03T01:18:24.149649Z

So you're basically sending tables. I suppose for the use case where you're sending heterogenous maps of lots of sizes you could make the table be a superset of all the keys and just designate holes where maps shouldn't contain keys

john 2025-05-03T01:20:19.798649Z

That would also be less performant than a format optimized for sequential data, like fressian

chucklehead 2025-05-03T01:34:59.290299Z

something like https://arrow.apache.org/docs/format/Flight.html?

john 2025-05-03T01:42:34.914579Z

Hmm yeah looks like they do batches

john 2025-05-03T01:44:15.636219Z

I guess you could wrap all non-maps with maps, so all sequences have a row. Then unwrap non-maps on the other side.

chucklehead 2025-05-03T01:44:18.408479Z

The underlying IPC stream format is described better here and doesn't require buying in to flight: https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format

john 2025-05-03T01:45:51.820389Z

And then just use tech.ml.dataset as the wire format

chucklehead 2025-05-03T01:51:19.068349Z

https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html

john 2025-05-03T01:52:37.665039Z

Do you know if flight does this heterogenous data thing? I'm hoping for a thing that is decent for sequential data but transparently optimizes sequential maps into column format

chucklehead 2025-05-03T02:19:34.867129Z

arrow is very much column-oriented, but it does have things like StructArray columns and dense/sparse union types which all support nesting of types, so you could probably come up with a reasonable schema to fall on for heterogenous data. I'm not aware of any tools/prior art that handle it automatically, but it's been several years since I messed around with this stuff.

chucklehead 2025-05-03T02:34:01.729599Z

I would also mention the feature completeness https://arrow.apache.org/docs/status.html quite a bit across the implementations if you want to get into anything more specialized. I ended up moving from Java to Go for the side project I was working on. Which worked great until I ran into a bug in the go version's support for some combination of types I was working with that sent me back to the drawing board again.

chucklehead 2025-05-03T02:36:06.367379Z

I guess it's actually been just over a year that felt like several https://clojurians.slack.com/archives/CG3AM2F7V/p1709149525475239

john 2025-05-03T02:41:10.699409Z

Hmm. Cross platform has its benefits

chucklehead 2025-05-03T05:55:20.464939Z

So, for seqs of heterogenous/unknown maps you can start with something like a Struct<data: Map<VarBinary,VarBinary>, keyTypes: Map<VarBinary,VarBinary>, valTypes: Map<VarBinary, VarBinary>> which will be inefficient but in theory should be supported by everything but the Swift implementation, which barely supports anything. You should also be able to do zstd or lz4 compression. Then you could start dictionary-encoding. You can dictionary-encode the map keys and the data types. The dictionaries can be shared such that key type of all 3 Maps reference the same dictionary, and the val types of the keyTypes and valTypes map also share a dictionary. Dictionaries are sent once and can be reused across batches, so this could yield some improvements. You can optionally send delta dictionaries or replacement dictionaries but depending on your implementation maybe only one or the other or neither. If you can allow for scans of the batch prior to conversion you could instead do some schema discovery and then do Map<DenseUnion<KeyType1, KeyType2, ...>, DenseUnion<ValType1, ValType2, ...>> which should also be well supported. You could also dictionary encode the values of the keys or even the values of the values depending on how homogeneous the keys and values are across maps in the seq. This gets encoded pretty close to your "superset of all the keys and just designate holes where maps shouldn't contain keys" Eventually you wrap all of these heuristics in a stream-of-streams that maintains a window for schema discovery and restarts the inner IPC stream with more specialized schemas when possible and falls back to more generic schema if encoding a batch fails.

john 2025-05-03T15:47:10.923939Z

Very insightful, thanks!

john 2025-05-03T15:51:07.770299Z

It's a super interesting question, where that scheme can be efficient - what the most efficient chunk/batch size should be. Whether there's some Goldilocks, happy-zone where both sequential and columnar data get relatively efficient performance characteristics

Daniel Slutsky 2025-05-03T13:12:17.445259Z

https://clojurians.slack.com/archives/C03RZRRMP/p1746277929257779

Clojurians Log v2

data-science 2025-05-03