Hey guys! Are anybody working with tech.ml.dataset and arrow files? I noticed that if I have an arrow file with multiple batches I have to use stream->dataset-seq function which gives me a lazy sequence of datasets. What is a common way to deal with that sequence? Should I concatenate all datasets into a single one for further processing?
If you want to aggregate data you can reach for a https://techascent.github.io/tech.ml.dataset/tech.v3.dataset.reductions.html. Functions defined there are prepared to work on the sequence of datasets. Otherwise concatenate.
thanks, will take a look
Yes, it definitely depends on what you're going to do with the data - generally speaking reducing over a sequence of datasets is as common (or perhaps more common) than concatenating.
Sequences of datasets are very normal, and come up all the time. Being familiar with them is a good idea.