2025-02-06 datalevin | Clojure Slack Archive

datalevin 2025-02-06

simongray 2025-02-06T08:34:49.850419Z

I really want to make an analyzer for the search engine that respects a different list of stop words. However, I seem to be unable to actually make a custom analyzer. Even simply explicitly using the default English analyzer returns an exception, e.g. this works

(let [db (-> (d/empty-db "/tmp/glen"
                           {:text {:db/valueType :db.type/string
                                   :db/fulltext  true}}
                           {:search-engine {:analyzer (datalevin.search-utils/create-analyzer {})}})
               (d/db-with
                 [{:db/id 1
                   :text  "The quick red fox jumped over the lazy red dogs."}]))]
    (d/q '[:find ?text ?e
           :in $
           :where
           [?e :text ?text]
           [(fulltext $ "red") [[?e ?a ?text]]]]
         db))

but this doesn’t

(let [db (-> (d/empty-db "/tmp/glen"
                           {:text {:db/valueType :db.type/string
                                   :db/fulltext  true}}
                           {:search-engine {:analyzer (datalevin.search-utils/create-analyzer {:tokenizer datalevin.analyzer/en-analyzer})}})
               (d/db-with
                 [{:db/id 1
                   :text  "The quick red fox jumped over the lazy red dogs."}]))]
    (d/q '[:find ?text ?e
           :in $
           :where
           [?e :text ?text]
           [(fulltext $ "red") [[?e ?a ?text]]]]
         db))

The only difference between the two is that I explicitly provide {:tokenizer datalevin.analyzer/en-analyzer} as opts in the second example which is the exact same function that is defaulted to ANYWAY! Please help me understand, I just don’t get it. This is the exception:

Execution error (ExceptionInfo) at datalevin.binding.java.LMDB/transact_kv (java.clj:802).
Fail to transact to LMDB: #error {
 :cause "Error putting r/w value buffer of \"datalevin/opts\": #error {\n :cause \"Can only freeze an inter-fn\"\n :data {:x #object[datalevin.analyzer$en_analyzer 0x39f10167 \"datalevin.analyzer$en_analyzer@39f10167\"]}\n
...

mdiin 2025-02-06T11:32:51.824779Z

I think what you have to do is wrap datalevin.analyzer/en-analyzer in (datalevin.interpret/inter-fn ...), as that will produce a function that can be serialized to LMDB.

mdiin 2025-02-06T11:34:37.637859Z

If you look at the functions in datalevin.search-utils ns, they all wrap wrap a fn in datalevin.analyzer using datalevin.interpret/inter-fn.

👍 1

simongray 2025-02-06T13:02:37.059799Z

But that’s what datalevin.search-utils/create-analyzer does?

simongray 2025-02-06T13:03:03.125649Z

simongray 2025-02-06T13:03:18.282259Z

which I am already calling

simongray 2025-02-06T13:04:22.760219Z

hence my complete confusion

mdiin 2025-02-06T13:23:32.767889Z

Yes, create-analyzer creates a function like that, but it does not recur to the internal functions. When you supply it with a config that calls a function which is not an inter-fn, that inner function won't be serializable event though your analyzer function is an inter-fn.

mdiin 2025-02-06T13:24:41.968019Z

So I'm guessing if you do {:search-engine {:analyzer (datalevin.search-utils/create-analyzer {:tokenizer (datalevin.interpret/inter-fn datalevin.analyzer/en-analyzer)})}} it will work.

simongray 2025-02-06T13:36:12.011459Z

Ah! I seemingly got it working with your example code, though the last part needed to be (datalevin.interpret/inter-fn [x] (datalevin.analyzer/en-analyzer x)) . Thank you, @mdiin!

❤️ 1

mdiin 2025-02-06T13:46:09.133239Z

Cool! Happy you got it working.

Huahai 2025-02-06T15:45:10.515179Z

:search-engine should be :search-opts, I guess there are some inconsistencies in the docs and tests

Jeremy 2025-02-06T15:01:11.676159Z

Hi all, I've had to go back to the drawing board to solve my problem. I'm wondering 2 things: + Is it ok to have millions of attributes? + How can I get the datom right before a min threshold of a range query? This is needed for valid-from queries. e.g. given #{2 4 6} sorted set, scanning for 3 should give give a result that starts from 2 (since there's no 3).

Jeremy 2025-02-06T15:12:03.548359Z

I ask because attributes are represented as integers, and I want to my own dynamic attribute representation per market

Huahai 2025-02-06T16:31:32.271219Z

Sure. You can try

👍 1

Huahai 2025-02-06T16:32:38.531329Z

"Right before" can be obtained if you scan backwards

Jeremy 2025-02-06T18:03:58.340519Z

what would be the right function to use for the scan? I see rseek-datoms, but no option to limit the scan to just 1 item. it's eager, so I can't just use first

Huahai 2025-02-06T19:26:56.475139Z

it should be easy to add those functions, PR welcome

Huahai 2025-02-06T19:27:53.159469Z

or add an arity to limit the number of datoms scanned

Huahai 2025-02-06T20:40:11.738059Z

https://github.com/juji-io/datalevin/issues/312

👍 1

Jeremy 2025-02-06T20:45:40.678889Z

I'd work on a PR for it

❤️ 1

Huahai 2025-02-06T20:52:50.947029Z

❤️

Jeremy 2025-02-06T21:27:31.997479Z

(r)seek-datoms both use list-range, which is eager. I'm looking at 3 options: 1. Provide lazy versions of (r)seek-datoms with a limit option 2. #1 as well as adding an extra limit arity to existing eager versions for completeness sake 3. The extra arity of (r)seek-datoms should use get-range, but eagerly realize the result.

Jeremy 2025-02-06T21:27:39.987499Z

Thoughts?

Huahai 2025-02-06T21:31:09.075389Z

what do you mean by lazy version?

Huahai 2025-02-06T21:33:02.326619Z

the difference between get-range and list-range is that the later only works on list DBI and allow ranges for both key and values, where's as the former works only on key range.

Huahai 2025-02-06T21:36:08.074229Z

all indices of datalog are stored in list DBIs, so list-range is the only option

Huahai 2025-02-06T21:36:56.275379Z

EAV: e is key, av is value; AVE: av is key, e is value

Jeremy 2025-02-06T21:39:55.715689Z

ohh, I had mixed things up 😅. the list-range doc says the range spec is the same as get-range ; I somehow took that to mean it uses get-range . And get-range doc says range-seq is the lazy version.

Jeremy 2025-02-06T21:50:24.968299Z

That makes things simpler. I can add limit arity list-range which (r)seek-datoms uses

👍 1

Huahai 2025-02-07T16:42:12.017619Z

Thanks for the patch. Merged.

👍 1

Clojurians Log v2

datalevin 2025-02-06