I'm curious… does Neanderthal offer async execution on CUDA? Or can it return a future?
Returning a future, or parking a virtual thread would need a stream callback, and I see that there is one in clojurecuda at src/java/uncomplicate/clojurecuda/internal/javacpp/CUStreamCallback.java
but I can't find it being used anywhere.
Thinking out loud here (so please feel free to correct my errors). It occurs to me that maybe there is scope to have virtual thread support in Neanderthal.
For instance:
• C++: Implement JNI_OnLoad and keep the VM pointer.
• C++: Register a cuda completion function with cudaLaunchHostFunc
• C++: Call a kernel function, or cudaMemcpyAsync. Then return to the JVM.
• JVM: Return a CompletableFuture (or increment a CountDownLatch, or call LockSupport.park())
• C++: After the CUDA operation is done, the completion function calls vm->AttachCurrentThread() / env->some_neaderthal_function() / vm->DetachCurrentThread()
• JVM: The neanderthal function has been called from C++ land. This function delivers to the CompletableFuture, or decrements the countdown latch, or LockSupport.unpark()
From the JVM's perspective, it could look like a Future, or a blocking call that can be preempted. I've worked with operations that took hundreds of milliseconds to transfer data across the bus, or that take a while to execute in the device, so I thought it might be useful to return to the JVM until the device is ready.
All GPU-related calls in Neanderthal are async (when it makes sense).
Thank you!