Hello, is there a numpy alternative available in clojure? I need it for these operations:
audio_int16 = np.frombuffer(buffer, np.int16) # ← signed int16
audio_float32 = audio_int16.astype(np.float32) / 32768.0 # ← signed range
Basically, very efficient conversion of a byte array representing a signed 16-bit integer array to float32 array and getting back the byte buffer of that representation. I could and did write something like this in Java using ByteBuffer but I’m curios if there is an even more efficient lib that does it before I reach to bit operations myself 😄The purpose is to send an audio chunk to a VAD analizer like Silero and it expects the input to be float32
https://github.com/cnuernber/dtype-next
This should do what you want - and just like NumPy arrays back Pandas dataframes, dtype-next buffers back the columnar format used by https://github.com/techascent/tech.ml.dataset and the higher-level https://github.com/scicloj/tablecloth wrapper.
Unless you’re doing any further processing, using a java nio byte buffers should also be pretty straightforward.
Yeah, but that is slow compared to native byte operations from what I saw. This needs to happen in the context of continuous realtime audio coming in where you need to decide if the user started/stopped speaking. Context: https://github.com/shipclojure/simulflow
depending on how you are getting your audio samples, you might be able to request them as floats directly so you don’t have to convert them
Do you mean, from AudioSystem, for example? The base audio format through the pipeline is 16kHz PCM mono. AFAIK, this is usually represented as ints
I think it depends on the platform. You’re allowed to request different formats. Not all formats may be supported
Yeah, but that is slow compared to native byte operations from what I saw. This needs to happen in the context of continuous realtime audio coming in where you need to decide if the user started/stopped speakingDid you profile to see what the slow part was? If anything, I would guess boxed math as the culprit and not byte buffers.
A single audio input is usually not very high data throughput, so you can usually get away with any approach.
One of the challenges with realtime audio isn’t data processing but making sure threads wake up responsively when new data arrives. You want to keep your buffers full so that there are no blips and cracks, but not too full and introduce delays
Interesting! I’d love to hear more on this! Currently, audio through the AI pipeline is mostly split into chunks (mostly of 32ms but can be less), and the processors through the pipeline work with those chunks.
Ok, I feel like you should be good. I think the minimum resolution for sleeps is typically like 2-3ms if I recall correctly.
I’m curious why you thought the short to float conversion was slow. It can’t hurt to make it more efficient, but if there are delays I would expect them to be elsewhere.
the sleep resolution is more of a problem for playing audio rather than consuming audio
I’m away from computer. Maybe it is slow. I don’t have any easy way to check at the moment.
I didn’t want to go ByteBuffer basically because of this answer: https://stackoverflow.com/a/12347176
Those benchmarks are on android so I’m not sure they would hold up on a desktop jvm.
Lol. Didn’t see that
I’m also not sure that code is doing what you want. I think you need to convert the signed integer to a float with a division. https://github.com/phronmophobic/whisper.clj/blob/bae472e3f3d4da0a723b6037bf5aefc6bf1974a3/src/com/phronemophobic/whisper.clj#L54 Unless my code is wrong, which would be good to know. It works for my use case, but might be technically wrong.
That function could be sped up for sure. I’m not sure it’s likely to be a bottleneck.
@ovidiu.stoica1094 Sorry for the late answer, I was on vacation. Deep Diamond has transformers that can convert pretty much any useful tensor to any other tensor backed by Intel's native x86 ops. I doubt anything you'll find would go faster than that (if used properly). The transform function (https://github.com/uncomplicate/deep-diamond/blob/0e054b85579120d90ef861b5562c929d80051ae0/deep-diamond-base/src/uncomplicate/diamond/tensor.clj#L302) will create a custom transformer function optimized for your particular combination of shape, layouts, and data types, and you can then call it many times on ever changing data...