Fork me on GitHub
#xtdb
<
2021-11-03
>
Tuomas10:11:36

Hi, I'm test calculating aggregates and looking for pointers for improving query performance (atm im 33x slower than with similar queries in sql). My queries group by a parent, make a few joins to reach children and check existence or inexistence of attribute along the way. sql super small dataset 30ms:

select p.id, count(c3.id) from parent p
join child1 c1 on c1.parentId = p.id and c1.someColumn is not null
join child2 c2 on c2.child1Id = c1.id and c2.someColumn is null
where p.id = 1
group by p.id
order by count(distinct c3.id) desc
limit 0,10;
datalog super small dataset ~1000ms:
'{:find [p (count c3)]
       :in [p]
       :where [[c1 :c1/p p]
               [c1 :c1/c2 c2]
               (not [c2 :c2/attribute])
               [c3 :c3/c2 c2]
               [c3 :c3/attribute1]
               (not [c3 :c3/attribute2])]
       :order-by [[(count c3) :desc]]}
       :offset 0
       :limit 10
I'm expecting 50-100x more data what would mean I'd eather have to change my datamodel to not depend on :c/some-attribute checks, require less joins, design my ui in a way that you would "start generating" a report and get notified when its ready or just precalculate values / add another data store for these types on queries

Tuomas11:11:16

I'm not sure but it seems like a custom aggregate helps

(defmethod xtdb.query/aggregate 'count-c3 [_]
  (fn
    ([] 0)
    ([acc] acc)
    ([acc x]
     (let [c3 (xt/entity (xt/db @node) x)]
       (if (and (:c3/attribute1 c3)
                (not (:c3/attribute2 c3)))
         (inc acc)
         acc)))))

refset00:11:31

Hey, so not rules are definitely not free, although you might get some speed up if you change them both to be not-join instead

'{:find [p (count c3)]
       :in [p]
       :where [[c1 :c1/p p]
               [c1 :c1/c2 c2]
               (not-join [c2] [c2 :c2/attribute])
               [c3 :c3/c2 c2]
               [c3 :c3/attribute1]
               (not-join [c3] [c3 :c3/attribute2])]
       :order-by [[(count c3) :desc]]}
       :offset 0
       :limit 10
I think your idea to use a custom aggregate is here is pretty great though - probably the best solution 🙂 and very novel (to me at least!)

refset00:11:43

which SQL engine are you comparing against?

refset00:11:10

in general though, an atemporal and row-oriented query engine will have some fundamental advantages for ~optimally handling this kind of query

refset00:11:21

oh, you could also store explicit nil values to avoid the not rules in the query

Tuomas06:11:31

Yeah I did compare not and not-join earlier and thought not was faster, but after putting some thought to benchmarking I got opposite results. I got down from 1000ms to 730ms using not-join and custom aggregate for c3. I also tried using a custom aggregate at c2 level and pulling for c3 data but got worse performance. About explicit nil values. Are you suggesting that on usually [e :a nil] is more performant than (not-join [e] [e :a]) ? Also seeing that not-join provides better perf I'm wondering why would you ever prefer not over not-join?

Tuomas07:11:43

I tried getting rid of not-join with a custom predicate function and that seems to help. Down to ~650ms

(defn c2-without-attribute->c3 [c2eid]
  (let [c2 (xt/entity (xt/db @node) c2eid)]
    (when (not (:c2/attribute c2))
      (:c2/c3 c2))))

[(product.core/c2-without-attribute->c3 c2) c3]
Oh yeah and MySQL 5.7.12 is around 30-40ms

refset08:11:46

> Are you suggesting that on usually `[e :a nil]` is more performant than `(not-join [e] [e :a])` ? Yes, but only because of how queries execute, because not-join will fully materialize an intermediate result set (because it's really a subquery, and its result set then gets sorted before returning), whereas [e :a nil] will participate in the lazy evaluation

refset09:11:11

not-join may not always be faster than not, but it really helps to have clearer scoping when sanity checking performance, in my experience

refset09:11:47

try doing it as a custom predicate function like this:

(defn c2-without-attribute->c3 [db c2eid]
  (when (not (ffirst (xt/q db {:find [c2] :in [e] [[e :c2/attribute c2]]} c2eid)))
    (ffirst (xt/q db {:find [c3] :in [e] [[e :c2/c3 c3]]} c2eid)))

[(product.core/c2-without-attribute->c3 c2) c3]

Tuomas10:11:32

Doing 2x q (is some cases) instead of 1x entity makes the query almost 2x slower. If the first q short circuited more often It should get more and more performant. Using the $ convention to pass db to the entity version seems to be upgrade though. For this specific dataset I'm down to about 520ms / query. I did setup a playground with a different type of dataset but similar query needs and data structure. Explicit nils seemed to give a huge boost to query performance, but I worry about having to backfill nils into the past when more attributes come up. Also custom aggregates and predicates seemed to degrade performance, so it all seems to really come down to the dataset you are querying. Btw, you can pass db to custom predicates, but it seems you can't do the same for aggregates? For my niche case it could give another small boost to queries

refset23:11:44

> Doing 2x q (is some cases) instead of 1x entity makes the query almost 2x slower. interesting! Makes sense though, since IO is expensive. It would be a different story with an S3 document store + cache miss 🙂

refset23:11:59

> you can pass db to custom predicates, but it seems you can't do the same for aggregates? seems not :thinking_face: I don't think anyone has considered this kind of application for aggregates before, although it doesn't seem like a completely crazy idea to me to expose db as an arg - feel free to create an issue if you're keen to discuss the possibility(!)

refset00:11:07

I don't fully understand everything in your gist after a brief skim (thanks for sharing though 🙏 ) but it's cool to see that the naive case is hard to beat https://gist.github.com/tkovis/e1dfde9df1c20dc11e08f51d8a12d08e#file-clj-L163-L178

Tuomas06:11:36

Yeah. Playing around with the rates at which documents get the must-have and must-not-have attributes and the amout of c1/c4 entities gives the opposite result though (220ms / query for vanilla and 20ms / query for custom aggregate + predicate).

Tuomas07:11:38

Also I had an idea that instead of checking that an attribute doesn't exist, or that 1 attribute doesn't exist and another exists, or checking anything else of that sort, you could also add an attribute that signifies the same thing. Eg. a "potential new customer" could be a customer document that doesn't have a relation to an "order" document, or it could be a customer document with the attribute :customer/potential-new? . You would need to make sure to add and remove those attributes at write time and I'm not sure how you would backfill new "datalog rule"-attributes but it seems like and it makes sense that you get the best perf using those https://gist.github.com/tkovis/92877d637f30b32d0aeb14a193d43d5f

refset11:11:48

Nice! It's hard for me to say what might be best in your case without really diving in, but certainly the full spectrum of modelling and performance is valuable to understand before loading in too much data 🙂 Again, thanks for sharing these gists, it's quite valuable for everyone to similarly be able to approach this kind of experimentation and understanding of tradeoffs

Tuomas12:11:16

full spectrum of modelling and performance is valuable to understand any resources you'd recommend?

refset21:11:02

well, as it relates to XT specifically, there are a few threads worth reading to build an appreciation of how the execution model works. Particularly https://github.com/xtdb/xtdb/discussions/1514 and https://juxt-oss.zulipchat.com/#narrow/stream/194466-xtdb-users/topic/query.20planner.20documented.3F

refset21:11:13

I think you already have a good grasp for exploring the various possibilities quite scientifically ☺️

refset13:11:25

@UH9091BLY I just realised another option here, which looking at my repl is faster than not or not-join (though probably still not faster than entity), is doing:

[(get-attr e :b :not) [n]]
[(= :not n)]
...where :not is a user-supplied default value

Tuomas13:11:06

Thanks for pointers. Relating to this same issue, I was thinking about somehow creating a sort of a tabular materialized view in mysql/pg if order by/aggregration queries prove to be too sluggish. Secondary indices seem a bit daunting to implement and Im not sure if they would even help but I guess you could insert/update/delete rows in a tx fn too and expose the data with something like walkable

refset14:11:04

Interesting - there are certainly a lot of tradeoffs to consider. Are the reports you are generating inherently interactive?

Tuomas08:11:13

Depends on what you mean by interactive. I am looking to support arbitrary filtering of items and sorting by attribute. Attribute could be something from the document, a computed value or an aggregate. I am hoping to be able to provide ”realtime” reports that are generated on load but also historic reports that could be generated by submitting a form and persisted for future referense/sharing. Historic reports are generated once by submitting a form, so I presume users dont expect it to be instant. Realtime reports on the other hand should be a lot faster because users dont actively generate them by submitting something, but instead just navigate to see

refset23:11:23

> Realtime reports Ah yes, that's exactly what I mean by interactive 🙂 I'd be happy to chat about this on a call next week, if you're free at all?