Idea: a wire format that chunks columnar data, maybe by rows of 32. (1024?) Then they can be easily consumed as a sequence of maps or as the column format. Is that already a thing?
You'd lose a little performance, getting worse the smaller your chunk size. But you might get a decent performance profile for both use cases - serial maps and columnar - making it a more universal wire format
So you're basically sending tables. I suppose for the use case where you're sending heterogenous maps of lots of sizes you could make the table be a superset of all the keys and just designate holes where maps shouldn't contain keys
That would also be less performant than a format optimized for sequential data, like fressian
something like https://arrow.apache.org/docs/format/Flight.html?
Hmm yeah looks like they do batches
I guess you could wrap all non-maps with maps, so all sequences have a row. Then unwrap non-maps on the other side.
The underlying IPC stream format is described better here and doesn't require buying in to flight: https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
And then just use tech.ml.dataset as the wire format
https://techascent.github.io/tech.ml.dataset/tech.v3.libs.arrow.html
Do you know if flight does this heterogenous data thing? I'm hoping for a thing that is decent for sequential data but transparently optimizes sequential maps into column format
arrow is very much column-oriented, but it does have things like StructArray columns and dense/sparse union types which all support nesting of types, so you could probably come up with a reasonable schema to fall on for heterogenous data. I'm not aware of any tools/prior art that handle it automatically, but it's been several years since I messed around with this stuff.
I would also mention the feature completeness https://arrow.apache.org/docs/status.html quite a bit across the implementations if you want to get into anything more specialized. I ended up moving from Java to Go for the side project I was working on. Which worked great until I ran into a bug in the go version's support for some combination of types I was working with that sent me back to the drawing board again.
I guess it's actually been just over a year that felt like several https://clojurians.slack.com/archives/CG3AM2F7V/p1709149525475239
Hmm. Cross platform has its benefits
So, for seqs of heterogenous/unknown maps you can start with something like a Struct<data: Map<VarBinary,VarBinary>, keyTypes: Map<VarBinary,VarBinary>, valTypes: Map<VarBinary, VarBinary>> which will be inefficient but in theory should be supported by everything but the Swift implementation, which barely supports anything. You should also be able to do zstd or lz4 compression.
Then you could start dictionary-encoding. You can dictionary-encode the map keys and the data types. The dictionaries can be shared such that key type of all 3 Maps reference the same dictionary, and the val types of the keyTypes and valTypes map also share a dictionary. Dictionaries are sent once and can be reused across batches, so this could yield some improvements. You can optionally send delta dictionaries or replacement dictionaries but depending on your implementation maybe only one or the other or neither.
If you can allow for scans of the batch prior to conversion you could instead do some schema discovery and then do Map<DenseUnion<KeyType1, KeyType2, ...>, DenseUnion<ValType1, ValType2, ...>> which should also be well supported. You could also dictionary encode the values of the keys or even the values of the values depending on how homogeneous the keys and values are across maps in the seq. This gets encoded pretty close to your "superset of all the keys and just designate holes where maps shouldn't contain keys"
Eventually you wrap all of these heuristics in a stream-of-streams that maintains a window for schema discovery and restarts the inner IPC stream with more specialized schemas when possible and falls back to more generic schema if encoding a batch fails.
Very insightful, thanks!
It's a super interesting question, where that scheme can be efficient - what the most efficient chunk/batch size should be. Whether there's some Goldilocks, happy-zone where both sequential and columnar data get relatively efficient performance characteristics
https://clojurians.slack.com/archives/C03RZRRMP/p1746277929257779