datalevin

Huahai 2025-02-25T17:41:30.849949Z

The upcoming release will mainly be about vector index and similarity search, so Datalevin can be used for semantic search, RAG, etc.

馃憖 3
馃殌 11
3
Huahai 2025-03-12T20:20:04.145399Z

master branch now has an initial version of vector db feature.

馃殌 1
o位v 2025-03-13T15:42:03.574379Z

The second query causes java.lang.OutOfMemoryError. Is this a bug? Update: https://github.com/juji-io/datalevin/pull/321

Huahai 2025-03-13T19:54:22.733259Z

It's expected. You don't want to return all vectors.

Huahai 2025-03-13T19:55:06.946779Z

Basically, you don't need a clause with the vector attribute.

Huahai 2025-03-13T19:55:58.418379Z

?v is your vector.

o位v 2025-03-13T19:56:22.637459Z

Ohhh 馃挕

Huahai 2025-03-13T19:57:40.749019Z

besides, what would you do with a vector? not much

Huahai 2025-03-13T19:59:06.143109Z

if you want to do manual rerank, ?v is sufficient

o位v 2025-03-13T20:00:07.047949Z

I wanted to retrieve the vectors for future similarity-searches :^)

Huahai 2025-03-13T20:00:14.611389Z

basically, vec-neighhors filter out the vectors, and you can then re-rank ?v, that's fine.

Huahai 2025-03-13T20:01:13.656409Z

that's why we return [?e ?a ?v] , ?v is the vector.

o位v 2025-03-13T20:03:22.210559Z

Got it, I was thinking ?v was short for value and we where returning all EAV tuples of the neighbors.

Huahai 2025-03-13T20:04:17.493819Z

in datalog store, vectors are stored as they are, but the tripe itself is the vec-ref

Huahai 2025-03-13T20:05:28.900189Z

that's why we can immediately return [e a v], because it's directly come out of vec-id to vet-ref mapping, so there's no redirection.

Huahai 2025-03-13T20:05:51.116269Z

we are not doing another lookup for the datoms.

馃憤 1
Huahai 2025-03-13T20:08:15.689799Z

The general choice we make in Datalevin is to trade space for time. Space issue will be resolved with compression.

Huahai 2025-03-13T20:08:50.400559Z

We favor speed over space.

Huahai 2025-03-13T20:09:56.690389Z

Compression and sharding will be the solutions for space issues. We will tackle things in time one by one.

Huahai 2025-03-13T21:19:44.579999Z

Fixed the namespaced attribute issue.

o位v 2025-03-13T21:29:12.593989Z

Cheers, works on my machine :^)

馃憤 1
o位v 2025-03-11T18:52:37.362109Z

I鈥檝e been testing out the vector features since yesterday, kudos for the great work!

鉂わ笍 1
Huahai 2025-02-25T17:43:32.295359Z

We are using usearch, the same library used in clickhouse and duckdb

phronmophobic 2025-02-25T19:13:26.210789Z

https://cloogle.phronemophobic.com/doc-search.html uses usearch+datalevin. I'm curious what your API will look like. Are you wrapping the c library or one of the other implementations?

Huahai 2025-02-25T19:14:21.206869Z

The API will be similar to the full-text API. I wrap their C library at the moment, but in the future, we could switch to the C++ one if there's need for that.

馃憤 1
phronmophobic 2025-02-25T19:15:44.421239Z

The one feature that I don't have that would be useful is paging (eg. search for the first 50 results and then optionally find the next 50 results, etc).

Huahai 2025-02-25T19:17:12.793859Z

wouldn't search for a larger top K and cache the results yourself sufficient?

Huahai 2025-02-25T19:17:42.134599Z

If DL implements this, we would be doing the same anyways

phronmophobic 2025-02-25T19:17:57.062199Z

Yea, it's fast enough that just doing another search works well enough.

phronmophobic 2025-02-25T19:18:15.325469Z

I wasn't sure if usearch had extended their API to support paging since I last checked.

Huahai 2025-02-25T19:18:31.957779Z

They don't have paging

Huahai 2025-02-25T19:18:57.047309Z

We can file an issue, and get around to it at some point.

phronmophobic 2025-02-25T19:19:33.402739Z

It's not a big deal. If usearch doesn't support it, then just doing another search works well enough.

Huahai 2025-02-25T19:20:26.081429Z

We could add that enhancement, let me file an issue so I don't forget

phronmophobic 2025-02-25T19:20:39.475729Z

Just browsing the full-text search API. It seems like one difference is that for vector search, the user would need to provide the vector. I assume that's something datalevin will expect the user to figure out.

Huahai 2025-02-25T19:21:50.560739Z

Yes, initially. Because the state of art embedding models changes so fast nowadays, it doesn't make sense for a database to integrate this feature at this point.

phronmophobic 2025-02-25T19:22:17.494709Z

Calculating embeddings is also really resource intensive if you don't have a GPU.

Huahai 2025-02-25T19:22:37.839219Z

Correct. So it doesn't make sense for us to do this in Datalevin

馃憤 1
phronmophobic 2025-02-25T19:23:36.887109Z

Cloogle runs on a $5/mo digital ocean server. The server is too slow to even calculate embeddings for the search queries so I ended up using the openAI embeddings API. It's very cheap.

Huahai 2025-02-25T19:24:13.094399Z

yeah, using an external service is a good idea. the price is dropping rapidly too

馃憤 1
phronmophobic 2025-02-25T19:25:54.739769Z

I think it was less than $5 to get embeddings for all the clojure doc strings I could find on github. Maybe even less than $1.

馃帀 1
phronmophobic 2025-02-25T19:27:29.345979Z

I haven't updated the cloogle data since I first released it. Maybe I'll try to start making regular updates once datalevin gets official support. Datalevin has been really useful for me. Thanks!

馃憤 1
鉂わ笍 1
鈽濓笍 2