I really want to make an analyzer for the search engine that respects a different list of stop words. However, I seem to be unable to actually make a custom analyzer. Even simply explicitly using the default English analyzer returns an exception, e.g. this works
(let [db (-> (d/empty-db "/tmp/glen"
{:text {:db/valueType :db.type/string
:db/fulltext true}}
{:search-engine {:analyzer (datalevin.search-utils/create-analyzer {})}})
(d/db-with
[{:db/id 1
:text "The quick red fox jumped over the lazy red dogs."}]))]
(d/q '[:find ?text ?e
:in $
:where
[?e :text ?text]
[(fulltext $ "red") [[?e ?a ?text]]]]
db))
but this doesn’t
(let [db (-> (d/empty-db "/tmp/glen"
{:text {:db/valueType :db.type/string
:db/fulltext true}}
{:search-engine {:analyzer (datalevin.search-utils/create-analyzer {:tokenizer datalevin.analyzer/en-analyzer})}})
(d/db-with
[{:db/id 1
:text "The quick red fox jumped over the lazy red dogs."}]))]
(d/q '[:find ?text ?e
:in $
:where
[?e :text ?text]
[(fulltext $ "red") [[?e ?a ?text]]]]
db))
The only difference between the two is that I explicitly provide {:tokenizer datalevin.analyzer/en-analyzer} as opts in the second example which is the exact same function that is defaulted to ANYWAY!
Please help me understand, I just don’t get it.
This is the exception:
Execution error (ExceptionInfo) at datalevin.binding.java.LMDB/transact_kv (java.clj:802).
Fail to transact to LMDB: #error {
:cause "Error putting r/w value buffer of \"datalevin/opts\": #error {\n :cause \"Can only freeze an inter-fn\"\n :data {:x #object[datalevin.analyzer$en_analyzer 0x39f10167 \"datalevin.analyzer$en_analyzer@39f10167\"]}\n
...
I think what you have to do is wrap datalevin.analyzer/en-analyzer in (datalevin.interpret/inter-fn ...), as that will produce a function that can be serialized to LMDB.
If you look at the functions in datalevin.search-utils ns, they all wrap wrap a fn in datalevin.analyzer using datalevin.interpret/inter-fn.
But that’s what datalevin.search-utils/create-analyzer does?
which I am already calling
hence my complete confusion
Yes, create-analyzer creates a function like that, but it does not recur to the internal functions. When you supply it with a config that calls a function which is not an inter-fn, that inner function won't be serializable event though your analyzer function is an inter-fn.
So I'm guessing if you do {:search-engine {:analyzer (datalevin.search-utils/create-analyzer {:tokenizer (datalevin.interpret/inter-fn datalevin.analyzer/en-analyzer)})}} it will work.
Ah! I seemingly got it working with your example code, though the last part needed to be (datalevin.interpret/inter-fn [x] (datalevin.analyzer/en-analyzer x)) . Thank you, @mdiin!
Cool! Happy you got it working.
:search-engine should be :search-opts, I guess there are some inconsistencies in the docs and tests
Hi all, I've had to go back to the drawing board to solve my problem.
I'm wondering 2 things:
+ Is it ok to have millions of attributes?
+ How can I get the datom right before a min threshold of a range query? This is needed for valid-from queries.
e.g. given #{2 4 6} sorted set, scanning for 3 should give give a result that starts from 2 (since there's no 3).
I ask because attributes are represented as integers, and I want to my own dynamic attribute representation per market
Sure. You can try
"Right before" can be obtained if you scan backwards
what would be the right function to use for the scan?
I see rseek-datoms, but no option to limit the scan to just 1 item. it's eager, so I can't just use first
it should be easy to add those functions, PR welcome
or add an arity to limit the number of datoms scanned
I'd work on a PR for it
❤️
(r)seek-datoms both use list-range, which is eager. I'm looking at 3 options:
1. Provide lazy versions of (r)seek-datoms with a limit option
2. #1 as well as adding an extra limit arity to existing eager versions for completeness sake
3. The extra arity of (r)seek-datoms should use get-range, but eagerly realize the result.
Thoughts?
what do you mean by lazy version?
the difference between get-range and list-range is that the later only works on list DBI and allow ranges for both key and values, where's as the former works only on key range.
all indices of datalog are stored in list DBIs, so list-range is the only option
EAV: e is key, av is value; AVE: av is key, e is value
ohh, I had mixed things up 😅. the list-range doc says the range spec is the same as get-range ; I somehow took that to mean it uses get-range . And get-range doc says range-seq is the lazy version.
That makes things simpler. I can add limit arity list-range which (r)seek-datoms uses
Thanks for the patch. Merged.