data-science

stephenmhopper 2025-03-01T13:17:29.158469Z

I have an application which is always running. It runs background jobs. Some of these background jobs create techascent dataset instances. 1. Do dataset instances mostly use off-heap memory? 2. Do I need to explicitly close dataset instances? I couldn't find any examples of this being done and it looks like the underlying https://github.com/techascent/tech.ml.dataset/blob/ab3e2ddd9fe5f8563c32a521264d081209921006/src/tech/v3/dataset/impl/dataset.clj#L138. How is the memory associated with a dataset typically freed after it's no longer in-use? 3. I have a number of background jobs which process datasets which can take up several GB of heap space. I assume switching to use datasets instead of basic Clojure sequences of maps will be faster / more memory efficient in most cases? 4. Does a techascent dataset support columns of complex types (i.e. Clojure maps) or does everything need to be a string / primitive?

genmeblog 2025-03-01T13:37:52.985519Z

Ad.4 you can store anything inside a column.

👍 1
2025-03-02T15:37:39.009689Z

-> 1: Not by default. They are stored as native java arrays on heap (if possible) -> 2: As "by default" data is on heap, no closing is needed, GC cleans up. If you configure them as "off heap", I believe there is some reference counting ongoing incl. auto removal -> 3: I have used off-heap datasets in some occasions, and in some occasions they work much better then on-heap, but not always

👍 1
stephenmhopper 2025-03-02T18:15:42.806589Z

Thanks!