Fork me on GitHub
#xtdb
<
2020-10-24
>
julienvincent15:10:19

What performance characteristics can I expect from crux? I'm currently running some non-scientific tests using generated queries that aren't particularly complex (I don't think) and on small datasets (sub 1k documents) and seeing some very bad performance. Queries are taking on the magnitude of 0.5-2s Example of a generated query:

{:query {:find [?eid] :where [(or (and [?eid :model "app"] [?eid :organization_id ?org_id] (not [(= ?org_id nil)]) [?eid :crux.db/id ?ref-eid] (not [(= ?ref-eid nil)])))]}}
The above query took ~700ms to complete on a total document set of 918 and a subset where (= :model "app") of 313. The above query, I'm sure, is not particularly efficient. However it still seems slower than I would expect. As a side question, is there anywhere I can find published benchmarks of crux?

julienvincent15:10:11

This is using rocksdb for persistence and running in a Dockerfile on my local machine

julienvincent15:10:47

When using the in-memory standalone distribution I see improved performance of 200-300ms for the same query and dataset

julienvincent15:10:14

I imagine I am doing something wrong somewhere, but not sure exactly what I should be focusing my efforts on. i.e should I focusing on storing my documents in a way that's easier for queries; am I doing something stupid in my query itself that could be improved; etc

refset15:10:02

Hey again, I think you should be able to rewrite your query like so and see much better performance:

{:find [?eid]
 :where [[?eid :model "app"]
         [?eid :organization_id ?org_id]
         [(some? ?org_id)]]}
Unless I'm misunderstanding your intent due to a typo, ?eid and ?ref-eid will always be the same value, so there's definitely no need for this clause:
[?eid :crux.db/id ?ref-eid]
The reason your current approach is slow is because not executes as a kind of subquery that results in iterating over the data more than is strictly necessary.

refset15:10:34

On the point of published benchmarks - we haven't released anything yet, but please feel free to have a poke around the crux-bench subproject in the repo if you're curious. And if you're really curious, feel free to DM me so I can share some numbers 😉

julienvincent16:10:22

Ok, thanks. The main reason for the weird clauses is reuse of clause generation functions. Reworking things as you suggest into the form:

{:query {:find [(distinct ?eid)] :where [(or (and [?eid :model "app"] [?eid :organization_id ?org_id] [(some? ?org_id)]))]}}
Had a massive improvement - down to 40ms for the same query and dataset. This however opens questions about the use of not which seems particularly expensive Take these two query examples:
{:query {
:find [(distinct ?eid)] 
:where [(or 
          (and [?eid :model "app"] 
               [?eid :organization_id ?org_id]
               [(some? ?org_id)]))
          (and [?eid :model "app"]
               [?eid :organization_id ?org_id]
               [(some? ?org_id)]
               [?eid :crux.db/id "5f524113cb3f3d7285fb4fe8"])]}}

;;; with not

{:query {
:find [(distinct ?eid)] 
:where [(or 
          (and [?eid :model "app"] 
               [?eid :organization_id ?org_id]
               [(some? ?org_id)]))
        (not 
          (or 
            (and [?eid :model "app"]
                 [?eid :organization_id ?org_id]
                 [(some? ?org_id)]
                 [?eid :crux.db/id "5f524113cb3f3d7285fb4fe8"])))]}}
I'd expect these two queries to perform about the same amount of work. the first executes in 30-40ms and the second, with a not, executes in 110+ms

👍 3
julienvincent16:10:53

Is this looping over the data twice?

refset01:10:37

There's quite a lot going on here so I'm not sure what the query plans might look like. You can take a look at the planning info shown in the logs if you set crux.query to "DEBUG" (via logback.xml or there's a util function in dev.clj) One issue will be that the query is unnecessarily unifying ?org-id with what's outside of the not. This can be mitigated using not-join instead Also the use of the two or rules seems unnecessary, though that shouldn't impact performance. The query plan should reveal the reasoning fairly clearly if you are able to share the logs, but it's probably the case that not ends up scanning exhaustively through the indexes separately to the outer query.

julienvincent13:10:41

> One issue will be that the query is unnecessarily unifying ?org-id with what's outside of the not. This can be mitigated using not-join instead I arrived at the same conclusion last night after re-reading the datalog docs - I tweaked the generator to wrap sub-queries in or-join. This had a good impact on performance. > Also the use of the two or rules seems unnecessary, though that shouldn't impact performance. More a side-effect of the query generation. Each one of those or blocks could contain n clauses - I'm just feeding it data with n = 1 So after fiddling a bit more this morning I have been able to isolate the issue down to the clause combination:

[?eid :organization_id ?org_id]
[(some? ?org_id)]
I pulled in my production dataset to see what real world queries would look like and with that subclause I am seeing query times of 3-7s depending on various inputs. Removing that set of clauses drops query time down to 30ms partywombat Problem is, I still need some way of saying "this relationship exists"

julienvincent13:10:33

Anyway, this is just a problem of wrangling my query generation. I think I understand the language and execution a lot better now. I'll play around with the query planner next. Thanks for the help!

refset13:10:15

Cool, good to hear 🙂 Certainly the more efficient way to establish "this relationship exists" (and avoid the need for the [(some? ?org_id)]) is to prevent your documents from having :organization_id nil in the first place (via spec or whatever, prior to submit-tx), then you can get away with just [?eid :organization_id] skip the extra ?org_id lvar altogether

euccastro22:10:39

the reference page on queries says "Use of arguments means we can avoid hard-coding values directly into the query clauses." Is that the only reason to use :args (with a single map, at least), or do they enable some caching like in datomic?

refset22:10:01

Yeah query plans get cached in Crux too. We've just updated that section of the docs on master as the next release supports :in bindings which effectively supercedes :args

thanks2 3
euccastro23:10:21

if I want a list of documents that are not referred to in some particular attribute by other documents, e.g., if I transact this:

[[:crux.tx/put {:crux.db/id :a :name "Anne"}] [:crux.tx/put {:crux.db/id :b :name "Bob" :parent :a}]]
and I want to know "who is not anyone's parent?", is this a reasonable way to query that?
{:find ['who]
 :where '[[who :name]
          [other :parent]
          (not [other :parent who])]}

refset00:10:50

hmm, maybe try [[who :name] (not [_ :parent who])]

euccastro14:10:56

that worked, thanks!

🙏 3
euccastro15:10:48

part of my question was: is this reasonable, or is it bound to be expensive on large datasets and I had better avoid it if possible?

refset16:10:31

Yeah it's definitely not ideal for query efficiency. It would probably be best to maintain/enforce an explicit :no-parent true on the who docs at write-time (via transaction functions or otherwise), if query speed is essential

euccastro16:10:14

thanks again!

🙂 3
euccastro23:10:07

note that I can't omit the [other :parent] part, or crux will complain that it doesn't know other inside the not clause. is that by design?

euccastro23:10:28

(also, here assume the :name attribute only appears in documents of a given 'type', hence the [who :name] to select for that type)

euccastro23:10:41

in my particular use case, it is not hard to change my data model a bit so I don't need that kind of query. should I, or is this OK as is?