Fork me on GitHub
#xtdb
<
2021-03-18
>
Tuomas11:03:55

I'm trying out crux for the first time and I'm very impressed so far. It took some time to figure out how to make some queries effective but I think I've figured it out. I have cases in which I need a count of entities with specific attributes like:

(crux/q (crux/db node) '{:find [(count ?e)]
                           :where [[?e :membership/type :member]]})
But when you have a million documents and no way (?) to narrow the search down, you need to do a lot of work and queries get slow. If instead of a :membership/time :member a v you have a :list.member :list.member.members e a you can narrow down the result and go form seconds tens of milliseconds
(crux/submit-tx
   node
   [[:crux.tx/put
     {:crux.db/id :list.members
      :list.members/members
      (map
       first
       (crux/q (crux/db node)
               '{:find [?e]
                 :where [[?e :membership/type :member]]}))}]])

  (q '{:find [?m]
       :where [[:list.members :list.members/members ?m]]})
      ffirst
      count)
A question: Is it possible to somehow configure/decorate/modify crux to make these sort of things under the hood, or should I make sure to model my data in such a way, that I can always start with some entity to narrow down the list of documents to search from like list.persons list.whatever?

jarohen11:03:04

Hey, I wouldn't recommend storing that many elements in a list in a single document - in order to update that list, you'd need to supply the whole list again

jarohen11:03:55

I wonder whether the performance gains that you're seeing are in fact caused by Crux caching that value, and then Clojure caching the count of the collection 🙂

Tuomas11:03:54

Restarting the repl+node should clear caches?

jarohen11:03:06

it should, yep

jarohen11:03:36

If you've done that and it's still that much faster, that'll require a bit more thought :thinking_face:

Tuomas11:03:50

; after reset
"Elapsed time: 88.236937 msecs"
131077
; second eval
"Elapsed time: 12.391279 msecs"
131077
; the original query after few evals
"Elapsed time: 2674.383644 msecs"
#{[131077]}

jarohen11:03:53

In terms of what Crux will be doing differently in the two cases, the main difference will be that when there's only one entity in play, it'll only have to do one temporal resolution rather than a million - but in that case we're talking (roughly) 2 million operations down to 1 million + 1, which I doubt would yield orders of magnitude different performance

Tuomas11:03:54

Million was a ballpark guess. Hmm how can I count all documents?

jarohen11:03:20

(crux/attribute-stats node) will give you an estimate

jarohen11:03:54

that's across all documents across all time, it doesn't tell you how many documents are visible now (although it doesn't seem it'll be too different in your test case)

Tuomas11:03:36

:crux.db/id 480587

Tuomas11:03:03

Shouldnt be too different

Tuomas11:03:01

I figured that in the original query, you'd have to check those 480k documents for :membership/type :member, and in the second query you'd only need to check 1 document for one value (the vector), which would be the source if this drastic difference

nivekuil11:03:34

I think you're inserting a list there with map, right? crux would actually break down a vector into individual EAV triples in the index, but iirc it keeps a list whole

jarohen11:03:04

good spot 👏 yep, this is what's happening

Tuomas12:03:26

If you shouldn't put too many elements in a list in a single document, then can you recommend another way of speeding up attribute based queries with aggregations where you are counting thousand of documents? I seem to get about 1s query time / 30k documents when searching by attribute with my dataset. Or should I maybe be looking at a materialized view?

Tuomas12:03:04

I'll try out lucene and see what happends with text fields

Tuomas12:03:18

For me, basically 95%+ of queries are get an entity with this id and a graph of data relating to it. These are super fast with crux. 5% or less of the queries are "find some type of entity with these values in these these attributes". If lucene/something helps with these "search from all docs" types of queries, then I can finally sneak in Clojure at work.

🙂 6
refset12:03:22

I suspect you might get a small performance boost by performing the count using a Datalog aggregate, as in :find [(count ?e)] However a user-space materialized view of some kind (either in-memory, or persisted alongside the data in your transactions) will undoubtedly give the best performance. Edit: I should add that implementing a materialized view and keeping it consistent and up-to-date is not a trivial undertaking though. e.g. transaction functions can help but then you will be trading off against write-speed, and the temporal semantics can get tricky

jarohen12:03:35

Even the Crux built-in count will still necessarily have to find all the entities with that type, but yes, it will do less work than getting the entities back and counting in user-space.

jarohen12:03:12

@UH9091BLY: your requirement to 'search from all docs' may well not need to scan all the documents every time - if you have any kind of filter available (say, 'created in the last week', or 'name = "John"' - your example to 'get an entity with this id' is obv the absolute best case 🙂) this will significantly limit the amount of data Crux has to search

jarohen12:03:28

You may well also not need Lucene, unless you're looking to search within a string field, fuzzy match, or any of the other specific benefits of Lucene's full-text search.

jarohen12:03:19

Crux also queries lazily - if you only need to return a page of members at a time, say, you should find that this is relatively fast

Tuomas13:03:55

Thanks this clears things up. For some reason I assumed there wouldn't be any way to know which document had which attribute. With lucene, checkpoints and limits I think crux will work way better than what we have now

🤞 3
jarohen13:03:07

Great to hear - let us know if you have any further questions or issues 🙂

❤️ 3
Steven Deobald17:03:59

> I think crux will work way better than what we have now @UH9091BLY Out of curiosity, do you mind telling us loosely what you have now? (Feel free to turn down that request if it's none of my business, though. 🙂) I'm always curious about the legacy system in the case of folks migrating a production system to Crux.

Tuomas18:03:25

@U01AVNG2XNF Wouldn't call it legacy yet but we're using Apollo Server + MySQL mostly.

👍 3