The upcoming release will mainly be about vector index and similarity search, so Datalevin can be used for semantic search, RAG, etc.
master branch now has an initial version of vector db feature.
The second query causes java.lang.OutOfMemoryError. Is this a bug?
Update: https://github.com/juji-io/datalevin/pull/321
It's expected. You don't want to return all vectors.
Basically, you don't need a clause with the vector attribute.
?v is your vector.
Ohhh 馃挕
besides, what would you do with a vector? not much
if you want to do manual rerank, ?v is sufficient
I wanted to retrieve the vectors for future similarity-searches :^)
basically, vec-neighhors filter out the vectors, and you can then re-rank ?v, that's fine.
that's why we return [?e ?a ?v] , ?v is the vector.
Got it, I was thinking ?v was short for value and we where returning all EAV tuples of the neighbors.
in datalog store, vectors are stored as they are, but the tripe itself is the vec-ref
that's why we can immediately return [e a v], because it's directly come out of vec-id to vet-ref mapping, so there's no redirection.
we are not doing another lookup for the datoms.
The general choice we make in Datalevin is to trade space for time. Space issue will be resolved with compression.
We favor speed over space.
Compression and sharding will be the solutions for space issues. We will tackle things in time one by one.
Fixed the namespaced attribute issue.
Cheers, works on my machine :^)
I鈥檝e been testing out the vector features since yesterday, kudos for the great work!
We are using usearch, the same library used in clickhouse and duckdb
https://cloogle.phronemophobic.com/doc-search.html uses usearch+datalevin. I'm curious what your API will look like. Are you wrapping the c library or one of the other implementations?
The API will be similar to the full-text API. I wrap their C library at the moment, but in the future, we could switch to the C++ one if there's need for that.
The one feature that I don't have that would be useful is paging (eg. search for the first 50 results and then optionally find the next 50 results, etc).
wouldn't search for a larger top K and cache the results yourself sufficient?
If DL implements this, we would be doing the same anyways
Yea, it's fast enough that just doing another search works well enough.
I wasn't sure if usearch had extended their API to support paging since I last checked.
They don't have paging
We can file an issue, and get around to it at some point.
It's not a big deal. If usearch doesn't support it, then just doing another search works well enough.
We could add that enhancement, let me file an issue so I don't forget
Just browsing the full-text search API. It seems like one difference is that for vector search, the user would need to provide the vector. I assume that's something datalevin will expect the user to figure out.
Yes, initially. Because the state of art embedding models changes so fast nowadays, it doesn't make sense for a database to integrate this feature at this point.
Calculating embeddings is also really resource intensive if you don't have a GPU.
Correct. So it doesn't make sense for us to do this in Datalevin
Cloogle runs on a $5/mo digital ocean server. The server is too slow to even calculate embeddings for the search queries so I ended up using the openAI embeddings API. It's very cheap.
yeah, using an external service is a good idea. the price is dropping rapidly too
I think it was less than $5 to get embeddings for all the clojure doc strings I could find on github. Maybe even less than $1.
I haven't updated the cloogle data since I first released it. Maybe I'll try to start making regular updates once datalevin gets official support. Datalevin has been really useful for me. Thanks!