Fork me on GitHub
#xtdb
<
2022-08-02
>
tatut12:08:14

what is the absolute fastest way to get all distinct values of a given attribute? (without caring about the entity at all)… I played around in the repl and dangerously used the idx directly

tatut12:08:12

(time
   (with-open [snap (xtdb.db/open-index-snapshot (:index-store (xdb)))]
     (let [attr-buffer (xtdb.memory/copy-to-unpooled-buffer (xtdb.codec/->id-buffer :my-string-attribute))]
       (mapv xtdb.codec/decode-value-buffer (xtdb.db/av snap attr-buffer "")))))
played around with that and that is very fast way to get the distinct string values

tatut12:08:22

(like under 1ms)

tatut12:08:24

compare with

(time (xt/q (xdb) '[:find v :where [_ :my-string-attribute v]]))
that takes 655ms in my case (22k documents with that attribute)

refset20:08:22

Accessing av like that won't do any temporal filtering, because it's across all time, and so wouldn't give repeatable consistency either. But it's certainly a neat trick and useful for cases where those things aren't a problem 🙂

refset20:08:20

The stats index tracks a hyperloglog distinct count approximation per attribute, which would be even quicker if you only want an (approx) count...but also has the same atemporal characteristics

refset20:08:50

I don't think there's a better 'correct' solution than what you came up with, though you could also model this manually or create a custom secondary index if needed (at some cost of write performance and extra storage)

tatut04:08:26

yes, I adding extra docs for each distinct value at write time would be one solution.

tatut04:08:00

I was thinking about a case where I would have something like “tags” where a user can write a new one, but also select from the previously added ones {:xt/id "foo" :tags ["some" "tags" "here"]} so there would need to be a quick way to get all the tags values quickly

tatut04:08:50

my other idea was to get it from lucene index, but it doesn’t allow empty query string to return all… If I use raw lucene API, I guess it will have the same temporal filtering problem as av

refset11:08:43

FWIW, the Lucene module does support configuration to allow ~empty (leading) wildcard lookups, e.g. https://github.com/xtdb/xtdb/blob/e2f51ed99fc2716faa8ad254c0b18166c937b134/modules/lucene/test/xtdb/lucene/extension_test.clj#L189 but I'm not sure if it would be any faster :thinking_face:

tatut11:08:24

neat, I’ll try that as well just for completeness sake

tatut11:08:51

the Lucene Java API interop seems pretty involved 😅

refset11:08:41

Yeah...Lucene is not a good benchmark for beginner friendly full-text search in 2022 🙂

tatut11:08:20

ok, I got 172ms for that solution, so the hacky index version is the fastest so far

tatut11:08:06

but the .setAllowLeadingWildcard is a good thing to know, as it is useful for providing a good typeahead from a set of tags that works with compound words

tatut11:08:49

eg. if user types “ball” it would find both “football” and “ballet” if you surround it with earmuffs

refset11:08:31

🙂 cool, glad that was a useful pointer

refset11:08:46

1ms sounds hard to beat

tatut11:08:21

can this leading wildcard be configured to the text-search available in regular queries?

refset11:08:40

sadly no, you have to register your own custom tatut-text-search function, like shown in that test file

refset11:08:44

I guess consistency is not a massive concern if this lookup is only driving a UI selector component (you can always check the selected value is visible as-of whenever before attempting any writes, like a manual bitemp filter)

tatut11:08:26

I would expect tags to only ever accrete in this case

tatut11:08:41

and it isn’t dangerous to show something that isn’t actually used currently

👍 1