Hello 👋🏽 I have seen that datalevin has full-text search. We have another fork of Datascript internally that uses https://www.foundationdb.org/ as the backing layer instead of lmdb. It also supports key/value get and range queries. Following a similar rationale as described in https://github.com/juji-io/datalevin/blob/master/doc/search.md, we’d like to add full-text search to it. I was wondering if the datalevin full-text search algorithm could be reused. Since this is behind https://github.com/juji-io/datalevin/blob/ede16dd86ff5fcd9fb90d4bcbf632470a695b2c1/src/datalevin/remote.clj#L258, it seems relatively straightforward. Ie. recreating a https://github.com/juji-io/datalevin/blob/ede16dd86ff5fcd9fb90d4bcbf632470a695b2c1/src/datalevin/core.clj#L2043 seems like it would allow playing with the low-level search functions (not the datalog query layer but this is good enough). There are a couple potential concerns I’d like to run by people in this channel: • 1. Is it fair to say that the datalevin search query is expected to be one or more complete words? ◦ ie. how well would it support prefix or substring search? For example given a partial name or the start of an UUID, how good would it be at returning results? • 2. What is the transaction model for datalevin? ◦ ie. does adding a searchable field make it available for search ◦ there is a 5s transaction limit in FoundationDB, could this be a concern when indexing fields? Is it correct to say that datalevin optimises for reads? • 3. What does the Key/Values structure look like? I’ll have to run experiments to fully understand it, but I have two concerns, both coming from how FoundationDB work: ◦ https://apple.github.io/foundationdb/known-limitations.html#:~:text=for%20applicable%20workarounds.-,Large%20keys%20and%20values,if%20these%20limits%20are%20exceeded.&text=FoundationDB%20provides%20efficient%20ways%20to,to%20work%20around%20this%20limitation.: keys cannot exceed 10kb, values cannot exceed 1MB, transaction reads+writes cannot exceed 10MB (excluding value reads) ◦ https://apple.github.io/foundationdb/developer-guide.html#minimizing-conflicts:~:text=use%20FoundationDB%20efficiently.-,How%20FoundationDB%20Detects%20Conflicts,-As%20written%20above: FoundationDB transactions can retry when there is a conflict (somewhat similar to Clojure Atoms). This is usually pretty bad for performance as the whole transaction has to be thrown away and re-run. Can indexing several values concurrently be conflict-free? ie. are there shared attributes in a text index that are updated when different documents are indexed?
Thanks for the pointer. Looks like another effort to improve b-tree. Not mature enough yet.
Based on my (still limited) usage so far, I can only answer your first question.
Datalevin supports all kinds of analyzers, including n-gram. Look in the datalevin.search-utils namespace for the built-in options. You can build other token filters and analyzers as well, but since they run in datalevin using sci, some work might be required in non-trivial cases.
An example of a non-trivial case: Since I work in a language where compound words is a thing (Danish), I have created my own decompounding token filter for use with datalevin. This uses a Java library for performing hyphenation. Getting this to work at the moment requires some very bad Clojure style (i.e. altering the root binding of an internal datalevin var 😱 ). When I have verified that it solves my use case well enough, I plan to contribute some version of it back to datalevin.
> they run in datalevin using sci Ah interesting I didn’t notice that it was using sci. I am guessing this was mean to make it extensible dynamically but I don’t think we’ll need that. I think we’ll also not use sci, so noted that in your experience this was not straightforward to use plain Clojure/Java.
Plain Clojure is easy enough, the trouble starts when you need sci to know about something it wasn't configured know about in datalevin. My hack is basically creating an extension point where none is given.
But if you don't need that kind of configurability, I don't think you will need to write your own token filter, since the built-ins cover a lot of use cases already.
@nha maybe relevant, maybe not; I tried to use Datalevin search capabilities with a Datascript db. It needs more iterations, but I think it is good enough for a proof of concept https://gist.github.com/jeroenvandijk/d113bf41382b514d23a750fbd5a0f818#file-datascript_slash_search-clj-L288
Right, so this seems to imply that exact word match search needs only -transact-kv for writes and -get-value for reads at the lmdb level.
Seems like it, but of course this is still a toy example
The built-in search utilities run in JVM, not in SCI anymore.
FoundationDB doesn't do well with large values, so it's not really suitable, as DL's search engine relies on large binary blobs as values, so it won't work.
In DL search engine, each search term has a doc-freq-sparse-list, which consists of a compressed integer list and a roaring bitmap, and this can be huge for frequent words.
To be honest, there is no better KV storage than LMDB for large binary blobs. Most other DB tech, particularly designed for distributed systems, are based on log-structured merge tree (LSM), which assume small values. This includes all the HOT tech you heard about: FoundationDB, DynamoDB, Cassandra, RocksDB, etc.
None of them can deal with large blobs. So the choice is really limited if that's what you are trying to do.
This includes Rama too.
If the basic index structure is LSM, you can forget about large blobs. B-tree is your only choice, basically, that's traditional DB, Postgres, mongoDB, LMDB. sqlite. There's no free lunch in computer science.
LSM based storage systems are not general purpose, as they are specifically designed to solve problems of large Internet companies. Most companies don't have their use cases. There's really no need to chase that fashion.
The good thing about LSM storage is that the write throughput is level and predicable, whereas b-tree storage is logarithmic and write becomes slower and slower. But for most use cases, that's fine, you can just throw more hardware at it as your data grows.
The new "Postgres for everything" trend has its good reason.
For answer your questions:
2. DL has a single writer. DL is "optimized for read", i.e. it's B-tree based. When you heard "optimized for write", that means it's LSM. After add-doc returns, the doc is immediately searchable, as add-doc is implemented as transaction.
3. DL KV store has a key size limitation, 511 bytes. Value size limitation is 2GB.
However, if you use our Datalog feature, key size and value size has the same limitation, 2GB.
There's single writer, so writes are queued. The only re-try we do is when the DB become larger than pre-designated size. LMDB is memory map based, so it needs a size upfront. When that size is reached during a transaction, we catch the exception , resize the DB, and retry the transaction. That's a slow process, so you want to prevent that by specifying a large size when start the DB.
To summarize, the search capability of Datalevin can be made to work with a storage that supports large binary blob values, this includes Datascript, or another B-tree based storage. However, it will have hard time with LSM based storage, because our search engine relies on storing and loading the document IDs of a search term in one go, i.e. as a large binary blob.
Of course, if the number of the documents you are going to index is small, it can be made to work.
Unfortunately, other search technology, e.g. Lucene, is similar to Datalevin, in the sense that it stores large binary blobs (they are called segments) too. So, if you want to do search, working off a LSM is a no-go, as far as I can tell. But it doesn't hurt to try. Maybe something can be invented along the way 😀
The ability for a DB to store large binary blobs is what enables the extensibility of a DB. Whatever new DB tech coming up, vector indices, full-text index, graph indices, whatever new things people coming up with, you can always fit that into a binary blob. Postgres has tons of extensions because of that. Datalevin plans to be the same.
Sorry to rise this year old thread, but this might be of interest to the crowd here: Bf-Tree: modern read-write-optimized concurrent larger-than-memory range index | Hacker News https://news.ycombinator.com/item?id=46802210
Very useful thanks a lot. Do I understand correctly that: • each term is stored separately, then looked up O(1) and then results are merged in O(query size)? • for substring searches, the brute-force approach in search is to generate n-grams, and store these individually for O(1) lookup, trading partial matches for space? Is that what datalevin does, or does it indexes whole words and lets the user choose a different analyser as needed? I wonder if some of the values could be put in the key, making the value fixed-size at the cost of storing more keys but I don’t understand enough about the compressed integer list + roaring bitmap yet to make a guess. It does sound like it could be expensive in storage in the general case. Lastly, after reading a bit more about the subject I should mention that our first use-case for search are person names (first name + surname), and it seems that there are less optimizations available there than for documents in English (because tricks like getting the root of the words cannot be applied). Does this match your experience?
Yes, each term is associated with a list of doc ids, what they call an inverted list. All search engines do this. The trick of search engine is not about inverted list, the trick is about how to walk these lists efficiently, that's what a search algorithm is all about and where Datalevin's innovation in search is at.
Yes, substring search can use ngram as the analyzer. There can be only one analyzer for a search domain. If you want to index the same data in different ways, you will have to add them to multiple different search domains.
Right, you cannot really do stemming with people names. Datalevin supports stemming for many languages (we basically lifted snowball stemming code).