This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2022-08-02
Channels
- # announcements (11)
- # aws (3)
- # babashka (34)
- # beginners (20)
- # biff (2)
- # calva (3)
- # cherry (29)
- # cider (6)
- # cljs-dev (9)
- # clojure (124)
- # clojure-europe (12)
- # clojure-norway (5)
- # clojure-uk (2)
- # clojurescript (32)
- # conjure (11)
- # datalevin (1)
- # datomic (16)
- # deps-new (1)
- # etaoin (6)
- # holy-lambda (10)
- # honeysql (28)
- # hyperfiddle (21)
- # jackdaw (2)
- # jobs (2)
- # leiningen (15)
- # missionary (12)
- # off-topic (132)
- # other-languages (1)
- # pathom (13)
- # rdf (10)
- # re-frame (8)
- # reagent (5)
- # releases (1)
- # remote-jobs (4)
- # shadow-cljs (32)
- # tools-deps (6)
- # vim (15)
- # xtdb (24)
what is the absolute fastest way to get all distinct values of a given attribute? (without caring about the entity at all)… I played around in the repl and dangerously used the idx directly
(time
(with-open [snap (xtdb.db/open-index-snapshot (:index-store (xdb)))]
(let [attr-buffer (xtdb.memory/copy-to-unpooled-buffer (xtdb.codec/->id-buffer :my-string-attribute))]
(mapv xtdb.codec/decode-value-buffer (xtdb.db/av snap attr-buffer "")))))
played around with that and that is very fast way to get the distinct string valuescompare with
(time (xt/q (xdb) '[:find v :where [_ :my-string-attribute v]]))
that takes 655ms in my case (22k documents with that attribute)Accessing av
like that won't do any temporal filtering, because it's across all time, and so wouldn't give repeatable consistency either. But it's certainly a neat trick and useful for cases where those things aren't a problem 🙂
The stats index tracks a hyperloglog distinct count approximation per attribute, which would be even quicker if you only want an (approx) count...but also has the same atemporal characteristics
I don't think there's a better 'correct' solution than what you came up with, though you could also model this manually or create a custom secondary index if needed (at some cost of write performance and extra storage)
I was thinking about a case where I would have something like “tags” where a user can write a new one, but also select from the previously added ones {:xt/id "foo" :tags ["some" "tags" "here"]}
so there would need to be a quick way to get all the tags values quickly
my other idea was to get it from lucene index, but it doesn’t allow empty query string to return all… If I use raw lucene API, I guess it will have the same temporal filtering problem as av
FWIW, the Lucene module does support configuration to allow ~empty (leading) wildcard lookups, e.g. https://github.com/xtdb/xtdb/blob/e2f51ed99fc2716faa8ad254c0b18166c937b134/modules/lucene/test/xtdb/lucene/extension_test.clj#L189 but I'm not sure if it would be any faster :thinking_face:
Yeah...Lucene is not a good benchmark for beginner friendly full-text search in 2022 🙂
but the .setAllowLeadingWildcard
is a good thing to know, as it is useful for providing a good typeahead from a set of tags that works with compound words
eg. if user types “ball” it would find both “football” and “ballet” if you surround it with earmuffs
can this leading wildcard be configured to the text-search available in regular queries?
sadly no, you have to register your own custom tatut-text-search
function, like shown in that test file