Fork me on GitHub
#xtdb
<
2023-10-31
>
zeitstein17:10:00

Is there a way to write a custom LuceneIndexer such that, for the doc {:xt/id 1 :text [{:text/html "foo"}]}, association of 1 :text "foo" is made in Lucene? I've taken a crack at it, but text searches through xt/q return nothing. Using xtdb.lucene/search I see the documents have been indexed. My guess is that I'm missing the fact that the nested map (with :text/html) is hashed before being indexed by XT?

refset17:10:36

Hey @U02E9K53C9L I suspect the issue is that the field-xt-val is expected to be identical to the actual v in the doc during this resolving function (which is essentially a temporal filter) https://github.com/xtdb/xtdb/blob/2fc600840c11aae31ee0246b2d400928d98e3f4e/modules/lucene/src/xtdb/lucene.clj#L190

refset17:10:12

I would recommend just working with xtdb.lucene/search directly if you can (and not attempting to use/customise/fork the text-search Datalog function)

zeitstein17:10:08

Thanks, Jeremy. If I could hash the actual v (the map) before passing it to field-xt-val, would that work?

refset17:10:02

As I understand it: {:text/html "foo"} is hashed (based on its Nippy serialisation) and stored as v in the ave index (in RocksDB) but with your example doc you are storing "foo" as the field-xt-val (in the Lucene index) - so upon querying via text-search the resolve-search-results-a-v function (linked above) will therefore not be able to find any entries for "foo" as the v value in ave and the result will be filtered out. In principle you could override the definition stored under field-xt-val if you never needed it for wildcard searching, however I don't think it would be straightforward to calculate and store the hash needed for comparison, since Lucene isn't storing binary here, it's only working with text.

refset17:10:32

for your use case It's likely more useful to view the xtdb.lucene namespace as illustrative, like if you don't need field-xt-val then just don't store it (but that also means you have to handle your own temporal 'resolve' step)

zeitstein17:10:50

Thank you for the thorough explanation! I can kind of see how it all fits together. But, I think I'll leave the exercise for the future and just drop the nested map for now.

🙏 1
zeitstein10:11:49

Using pr-str on field-xt-val then the snippet below works.

🆒 1
zeitstein10:11:40

Can I use the same strategy in a custom resolve-search-results-a-v-wildcard or does db/ave expect a nippy'd v? (I'm assuming using the index is faster than xt/q above.)

zeitstein11:11:47

:thinking_face: I'd also wish to replace the text-search q predicate, since I'm using it as part of larger queries.

👍 1
refset11:11:28

> I'm assuming using the index is faster than xt/q above quite possibly, but I wouldn't be surprised if the difference is marginal (might be worth a quick microbench)

👍 1
Dave20:10:38

I'm curious to know more about implementing custom backends using XTDB. I looked through the source code, already, but I'm not really sure what I'm looking for. Could one glue XTDB to plain text backends, or git repositories? Assuming you had the appropriate transaction history?

refset20:10:50

Hey @UF7A9T2P4 in principle, yep, you can do such things by creating your own independent modules and wiring them into your config. The pluggable storage protocols in the 1.x are relatively small and well defined: https://github.com/xtdb/xtdb/blob/9a379bcb188ab37451344c5c935017a9d163addb/core/src/xtdb/db.clj#L72-L86 and https://github.com/xtdb/xtdb/blob/9a379bcb188ab37451344c5c935017a9d163addb/core/src/xtdb/db.clj#L100-L112 You can see implementations of those protocols in various places around the repo, e.g. for KV storage https://github.com/xtdb/xtdb/tree/master/core/src/xtdb/kv

👍 1
refset20:10:37

I've not seen anyone attempt a plain text or git backend before, fwiw 🙂

Dave20:10:01

Just messing around, might be a terrible idea, might be cool. Seems like a good learning experience regardless.

refset20:10:25

> a good learning experience regardless for sure! I learned a lot building https://github.com/xtdb-labs/crux-redis/blob/master/src/crux/redis.clj KV module for Redis and getting the generative tests working etc.

alexdavis20:10:57

@U050CTFRT has at least thought about implementing a git backend, though possibly at the site level rather than XT? Either way he might have some insight

Dave20:10:19

Fossil might actually be a better target as it maintains a more granular transaction log and is git compatible. Though it uses a SQLite backend... So I'm not sure exactly how things would mesh.