Hello guys! I have a question about TMD and memory mapped file.
I’m using a dataset arrow library to read the Arrow file like this
(arrow/stream->dataset-seq file {:open-type :mmap})
in theory it should not use the JVM heap for that dataset sequence, right?
at some point I want to concatenate those datasets into a single one so I’m using
(apply ds/concat datasets)
I went to the source code of ds/concat function and found that it will create a container like this
(dtype/make-container :jvm-heap %1 n-rows)
So the main question - does it mean that after concatenation all datasets will end up on the JVM heap or they will stay mmapped (off heap)?
(dtype/make-container :jvm-heap %1 n-rows) will for sure create a container on heap.
I played as well with TMD columns off-heap,
and I would say that there are some operations on a dataset which will "ignore" the container type.
ds/concat might be one of them,
dataset has function which work on a sequence of datasets, you might need to go into that direction.
https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html#var-group-by-column-agg
ds/concat-inplace might as well keep the data off heap.
I believe it just "links" the data together, while ds/concat copies it (and always to :jvm-heap, it seems)
Maybe you can operate column -wise, and use the functions in dtype-next directly:
concat-buffers
coalesece-blocks!
do work with columns.
I see. thanks. will try concat-inplace first
Or keep it as a sequence of datasets and work with that. concat-inplace has a per-row-index cost overhead so it isn't as ideal.
For exacty this scenario we have group-by-column-agg in the reductions namespace.
Thanks for the suggestion. I’m trying to build a generic pipeline system where data shared between steps as arrow files and I don’t know what will be the next step beforehand. So I have a set of predefined actions that expects a dataset and user of my library can provide its own functions Having a single interface for all kind of steps would make things simpler. You don’t need to switch between dataset or sequence of datasets That’s why I’m trying to provide a single dataset even in case of multiple arrow batches