This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-05-02
Channels
- # announcements (1)
- # babashka (4)
- # beginners (39)
- # calva (36)
- # cherry (11)
- # cider (23)
- # clj-on-windows (3)
- # clojure (105)
- # clojure-brasil (1)
- # clojure-chicago (3)
- # clojure-conj (8)
- # clojure-denver (4)
- # clojure-europe (18)
- # clojure-germany (5)
- # clojure-hungary (13)
- # clojure-nl (1)
- # clojure-norway (31)
- # clojure-sweden (9)
- # clojure-uk (2)
- # clojurescript (22)
- # core-async (4)
- # cursive (8)
- # data-science (25)
- # datomic (14)
- # devops (1)
- # emacs (9)
- # events (5)
- # holy-lambda (32)
- # hyperfiddle (26)
- # introduce-yourself (2)
- # kaocha (1)
- # leiningen (11)
- # lsp (17)
- # malli (8)
- # off-topic (84)
- # pedestal (4)
- # polylith (2)
- # re-frame (17)
- # reitit (1)
- # releases (1)
- # remote-jobs (1)
- # shadow-cljs (8)
- # sql (4)
- # tools-deps (8)
- # transit (5)
- # vim (1)
- # vscode (1)
- # xtdb (45)
is there a place where I can find additional documentation on the functions in the dtype.next functional ns? There are no docstrings, so the codox generated docs don't help much.
Specifically I'm trying to implement cosine similarity for two dtype next vectors. Is there a built-in place to get L2 normalization for a vector? I can't tell what the normalize
function in functional
is doing?
I know that it's not that difficult to calculate that myself, but since I'm a newbie I want to make sure I'm not straying from the beaten path and writing code that I don't have to. Here's my naive approach to writing it myself, assuming I understood the definition:
(Math/sqrt (reduce + (dt.fn/sq [0.5 0.9])))
https://github.com/cnuernber/dtype-next/blob/master/src/tech/v3/datatype/functional_api.clj
For normalize - if you are going to apply it to small vectors you will want a more efficient definition without vary-meta at least
I was trying to follow the formula I found here for vectors. https://www.geeksforgeeks.org/how-to-calculate-cosine-similarity-in-python/ I'm assuming the L2 norm function they describe here is fundamentally different from dtype's normalize because the former returns one number while the latter returns a vector. So maybe that's a red herring. Sorry for my non-precision in terms, I'm new to these concepts
for more context here's what I have so far:
(defn cosine-similarity
[x y]
(let [dp ^double (dt.fn/dot-product x y)
x-l2 ^double (Math/sqrt (reduce + (dt.fn/sq x)))
y-l2 ^double (Math/sqrt (reduce + (dt.fn/sq x)))]
(/ dp
(* x-l2 y-l2))))
(time (cosine-similarity [0.2 0.4]
[0.5 9.0]))
I'm working on modifying this to avoiding reflection. But also I'm asking just in case I have code in here that could be replaced by an available library function I'm unaware ofDepending on your use case, you may just need the dot product https://datascience.stackexchange.com/questions/744/cosine-similarity-versus-dot-product-as-distance-metrics
My strong recommendation is to convert your TMD matrix to a Neanderthal matrix (tmd has functions that go both ways : TMD<->Neanderthal) and then use the Neanderthal ops. The names may look a bit odd at first, but are intentionally based on the BLAS and LAPACK APIs that have been around forever (hence you may use any book, tutorial, blog, etc on them for reference). With just a bit of code, you can (I've done it) blast out cossim values across 100s of thousands of seqs (vector embeddings) in the blink of an eye.
It may be worth noting, that cos sim is not a true metric, if you need that, angular distance may be more what you want (it is true metric)
Also, you will likely get more response on zulip. start here https://clojurians.zulipchat.com/#narrow/stream/151924-data-science. There are topics for TMD in there and also streams for TMD/TC and Neanderthal (Uncomplicate)
more info about the data science Zulip chat streams: https://scicloj.github.io/docs/community/chat/
I play with the Neanderthal library some time ago, but I couldn’t figure out how to rewrite the cosine similarity function to utilize the GPU. If someone could help me, I’d be really grateful. I came across this post: https://dragan.rocks/articles/21/Pimp-my-Clojure-number-crunching. Based on the information there, I managed to write some code that finally works on my 2015 MacBook Pro with an AMD GPU.
(def apple (->> (for [n (range 23000)]
(->> (repeatedly 786 #(float (rand-int 10))) vec)) vec))
(def a (dge 8000 786
(take 8000 apple)
{:layout :row}))
(with-platform (second (platforms))
(let [dev (second (sort-by-cl-version (devices :gpu)))]
(with-context (context [dev])
(with-queue (command-queue-1 dev)
(opc/with-default-engine
(with-progress-reporting
(quick-bench
(let-release [gpu-x (mmt a)]
(with-release [
a-norms (inv! (sqrt! (copy (dia gpu-x))))
ab-norms (rk a-norms a-norms)]
(mul! (view-ge gpu-x) ab-norms)
gpu-x)))))
))))
but not clear for me, how to rewrite this part to opencl.. compatible.
(let-release [gpu-x (mmt a)]
(with-release [
a-norms (inv! (sqrt! (copy (dia gpu-x))))
ab-norms (rk a-norms a-norms)]
(mul! (view-ge gpu-x) ab-norms)
gpu-x))
if you have any idea.. that would be great or example how to do… I tried several things, not works.. really.. (and with that 8k rows.. calculation time is 1,1 sec)You most likely do not have your GPU properly configured. The code for GPU and CPU is the same modulo needing to swap engines.
TBH, for many (most?) things the GPU doesn't buy you much (especially given the pain of setting them up correctly before they will work) - at least in my experience with the availability of many cores.
I changed totally the code, from matrix to vector comparison. In this way I can implement with opencl.. I think. Plus I added extra parallel process (on cpu side). Therefore I got this result..
If you have truly "jumbo" stuff and are careful in how you chunk the work out to your GPU (respecting its memory limits and bandwidth), then GPU comes into its own and is worth it.
Also be aware that typical commodity GPUs intentionally cripple support for doubles. Generally that isn't a big deal as floats will work just fine for most things. But just be aware, you may end up with worse perf if you have doubles