Fork me on GitHub
#data-science
<
2022-11-15
>
Álmos Zöld12:11:32

Hello! Do you know if there is a way to read in arbitrary binary data to tablecloth or neanderthal (I'd like to work with it eventually in neanderthal) ala numpy's fromfile function https://numpy.org/doc/stable/reference/generated/numpy.fromfile.html. Here you can give a raw binary and a datatype in which you want to get the data for example int32. I looked around in the neanderthal and tech.ml.dataset and eventually the dtype-next documentation (which was especially confusing). I know that it's possible to interop with python via libpython-clj but it would be nice if I could get the thing into tech.ml.datasetland and neanderthal via clojure librarys, or methods. :)) Thank you in advance for your help.

chrisn13:11:56

Raw data - read it via https://cnuernber.github.io/dtype-next/tech.v3.datatype.mmap.html then use https://cnuernber.github.io/dtype-next/tech.v3.datatype.native-buffer.html#var-set-native-datatype. If you are using JDK-8 or 11, this will work out of the box. If you are using JDK-17, then you need to set jvm-opts.

chrisn13:11:08

You can then transfer the data into jvm storage via dtype/clone which I would recommend. Furthermore I would do this within a resource context to make the memory usage deterministic:

(resource/stack-resource-context (-> mmap set-native-datatype clone))

chrisn13:11:24

Else you can read it via normal java file apis into a byte array, create and ByteBuffer nio buffer and get a double or float buffer from that. This way is not as ideal as the underlying buffer is then opaque so nothing can use, for instance, memcpy to copy the data into anything else but for your use case it may not matter. After you have the data and the datatype set, use https://cnuernber.github.io/dtype-next/tech.v3.datatype.html#var-sub-buffer and https://cnuernber.github.io/dtype-next/tech.v3.tensor.html#var-reshape to get one or more ND objects from it and off you go.

Álmos Zöld13:11:02

Ooh, this is just what I needed, thank you very much! :))

chrisn13:11:41

You are welcome! Also - since you mentioned neanderthal, from a 2D tensor you can https://cnuernber.github.io/dtype-next/tech.v3.libs.neanderthal.html - What isn't obvious from that namespace documentation is that once that namespace is loaded, https://cnuernber.github.io/dtype-next/tech.v3.tensor.html#var-as-tensor will transform a neanderthal matrix in-place into a tensor so that kind of completes the loop.

blueberry01:11:29

no need for any copying or transferring. You can view that vector as various matrices, too, with view-ge and other view-X functions.

blueberry01:11:48

of course, there is a similar function for mapping tensors to binary files in deep-diamond.

Álmos Zöld09:11:26

Thank you, this will also help :))