Fork me on GitHub
#xtdb
<
2022-12-30
>
Nikolas Pafitis14:12:32

Is it possible in an xtdb query to order by using a custom predicate? For example I'd like to order by a key which is an instant and I'd like to use tick/<

refset12:01:17

Hey, apologies for the delayed response! The order-by uses Clojure's compare under the hood (per find arg), see https://github.com/xtdb/xtdb/blob/76db6a7448f0a3b8625adc2c90cc70b92aa9a9e9/core/src/xtdb/query.clj#L1728, which means the semantics should respect Clojure's standard treatment of < with Instants. There's no way to hook into this with a custom comparator currently but you can always do the order-by sorting in userspace to achieve the same result (the main disadvantage being that you can't make use of XT's spill-to-disk functionality, or how pull is run after the order-by)

Nikolas Pafitis15:12:31

Is there more extensive documentation on custom aggregates?

Mark Wardle22:12:03

Hi all. I am experimenting with datalog stores - including datalevin and using xtdb with both an lmdb and in-memory store, but am getting poor performance from aggregate queries and xtdb. For example, (time (xt/q (xt/db node) '{:find [(max ?v)] :where [[?e :uk-composite-imd-2020-mysoc/UK_IMD_E_rank ?v]]})) "Elapsed time: 811.955584 msecs" => #{[42619]} (time (xt/q (xt/db node) '{:find [(max ?v)] :where [[?e :uk-composite-imd-2020-mysoc/UK_IMD_E_rank ?v]]})) "Elapsed time: 262.353125 msecs" => #{[42619]} (time (xt/q (xt/db node) '{:find [(max ?v)] :where [[?e :uk-composite-imd-2020-mysoc/UK_IMD_E_rank ?v]]})) "Elapsed time: 205.060667 msecs" => #{[42619]} While with datalevin, and essentially the same data, but obviously with a schema, which probably helps....: (time (d/q '[:find (max ?rank) . :in $ :where [_ :uk-composite-imd-2020-mysoc/UK_IMD_E_rank ?rank]] (d/db (.-conn svc)))) "Elapsed time: 53.634125 msecs" => 42619 Am I missing something to optimise, or index to help such queries? Apologies if I have missed something in the documentation. Of course, I could fairly easily set up something to query and cache these kinds of aggregate data queries, so they are available more quickly.... and I very much appreciate the two are designed for different purposes.

refset13:01:06

Hi Mark, thanks for sharing your analysis and apologies for the delayed response! I haven't dug into Datalevin's code but I suspect both engines are doing the same thing here, just Datalevin is doing it faster because it has less machinery getting in the way (i.e. no bitemporal filtering or handling of content-hashes, which are currently an unavoidable tax in XT due to the temporal and schemaless aspects respectively). Essentially both engines will be doing a full scan of a sorted AV index and then filtering to get the max value, without recognising that the aggregate could more effectively be "pushed down" into the main query. XT does have alternative to offer though, which is using a "range constraint" (i.e. the > predicate that is not the same as clojure.core/>) - these constraints are detected and accelerated via the index to avoid any unnecessary scanning, however it only works ~usefully in the ascending direction (because most KV stores only like to iterate in one direction). It can also only work reliably in simple query scenarios, see https://github.com/xtdb/xtdb/blob/76db6a7448f0a3b8625adc2c90cc70b92aa9a9e9/test/test/xtdb/query_test.clj#L4073-L4168 and this discussion https://github.com/xtdb/xtdb/discussions/1514

refset13:01:01

Out of interest, did you get any further with your analysis/conclusions? I'm guessing you already have a clear baseline performance in mind with Hermes πŸ™‚

refset13:01:28

Are there certain kinds of queries you were hoping to express more cleanly/efficiently via Datalog?

Mark Wardle13:01:21

Thanks @U899JBRPF - I appreciate your thoughts on this so thank you. In hermes, I am making direct use of LMDB and won't plan to replace with a datalog abstraction layer at this point as I have hand-optimised the indexing for the domain in question and versioning/time travel occurs by switching snapshots/endpoints. But I do have other services in which datalog makes a lot of sense - including a clinical data repository in which bitemporality will be an amazing fit, and other services such as my wrappers around data products such as the UK's dictionary of medicines and devices, and socioeconomic deprivation population data - in which time travel is less important, but using the same API as other tools would be very attractive, as would be being able to declaratively switch between backend implementations depending on use-case (e.g. an interactive backend for an EPR might use a different backend than a small service designed for use in a data pipeline with Apache NiFi). But for some calculations I need to find 'max' value of a particular key but I could track this on ingestion and build a little cache and get the benefit of the xtdb API with better performance.

Mark Wardle14:01:36

The code I've ported to xtdb from datalevin is https://github.com/wardle/deprivare/blob/main/src/com/eldrix/deprivare/core.clj. It's been very easy to port, and performance absolutely fine, except for the aggregate query to get 'max'. I mean, I could cheat as I know the max will be in a certain range, so could use the limit and then max probably has to only scan a much smaller subset of the values! Or I'll just keep track of a derived attribute 'max' for each 'key' on ingestion. The benefit of using a single datalog implementation outweighs the need to manually manage this one performance issue.

refset14:01:12

Ah cool, thanks for that pointer and for sharing more context, it's always really fascinating (and useful) to see the problem domain first-hand! If there's anything else like this that you want to discuss sometime please don't hesitate to bring it up...knowing that XT can be of some use to our UK public health is particularly motivating ☺️

πŸ‘ 2