data-science

ts1503 2024-12-10T12:31:22.256759Z

Hello guys! I have a question about TMD and memory mapped file. I’m using a dataset arrow library to read the Arrow file like this (arrow/stream->dataset-seq file {:open-type :mmap}) in theory it should not use the JVM heap for that dataset sequence, right? at some point I want to concatenate those datasets into a single one so I’m using (apply ds/concat datasets) I went to the source code of ds/concat function and found that it will create a container like this (dtype/make-container :jvm-heap %1 n-rows) So the main question - does it mean that after concatenation all datasets will end up on the JVM heap or they will stay mmapped (off heap)?

2024-12-10T13:50:34.932109Z

(dtype/make-container :jvm-heap %1 n-rows) will for sure create a container on heap. I played as well with TMD columns off-heap, and I would say that there are some operations on a dataset which will "ignore" the container type. ds/concat might be one of them, dataset has function which work on a sequence of datasets, you might need to go into that direction.

2024-12-10T14:01:02.742789Z

ds/concat-inplace might as well keep the data off heap. I believe it just "links" the data together, while ds/concat copies it (and always to :jvm-heap, it seems)

2024-12-10T14:04:19.525859Z

Maybe you can operate column -wise, and use the functions in dtype-next directly: concat-buffers coalesece-blocks! do work with columns.

ts1503 2024-12-10T14:32:14.918779Z

I see. thanks. will try concat-inplace first

chrisn 2024-12-10T17:18:50.922369Z

Or keep it as a sequence of datasets and work with that. concat-inplace has a per-row-index cost overhead so it isn't as ideal.

chrisn 2024-12-10T17:30:02.796259Z

For exacty this scenario we have group-by-column-agg in the reductions namespace.

ts1503 2024-12-10T17:53:45.843819Z

Thanks for the suggestion. I’m trying to build a generic pipeline system where data shared between steps as arrow files and I don’t know what will be the next step beforehand. So I have a set of predefined actions that expects a dataset and user of my library can provide its own functions Having a single interface for all kind of steps would make things simpler. You don’t need to switch between dataset or sequence of datasets That’s why I’m trying to provide a single dataset even in case of multiple arrow batches