data-science

ts1503 2024-12-03T11:37:02.943939Z

Hey guys! Are anybody working with tech.ml.dataset and arrow files? I noticed that if I have an arrow file with multiple batches I have to use stream->dataset-seq function which gives me a lazy sequence of datasets. What is a common way to deal with that sequence? Should I concatenate all datasets into a single one for further processing?

genmeblog 2024-12-03T11:45:20.350129Z

If you want to aggregate data you can reach for a https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html. Functions defined there are prepared to work on the sequence of datasets. Otherwise concatenate.

ts1503 2024-12-03T11:46:27.438609Z

thanks, will take a look

Harold 2024-12-04T02:22:31.283599Z

Yes, it definitely depends on what you're going to do with the data - generally speaking reducing over a sequence of datasets is as common (or perhaps more common) than concatenating. Sequences of datasets are very normal, and come up all the time. Being familiar with them is a good idea.

👍 1