Fork me on GitHub
#datomic
<
2021-05-05
>
joshkh16:05:48

let's say i have one million entities in my database that represent a simple checkin log of the guests at my hotel (business is booming), and i want to find the latest checkin date of all guests. is there a more performant method than using the max aggregate? in my test case it takes ~4 seconds.

(d/q '{:find  [(max ?d)]
       :in    [$]
       :where [[?g :guest/checkin ?d]]}
     db)
"Elapsed time: 4411.448713 msecs"

joshkh16:05:32

i thought the indexes might have something to do with it. this seems like a better option

(first
 (d/index-pull
  (d/db sort-conn) {:index    :avet
                    :reverse  true
                    :selector [:guest/checkin]
                    :start    [:guest/checkin]}))
"Elapsed time: 524.218481 msecs"

Joe Lane16:05:50

Are you timing this operation from your laptop through the client api + socks proxy to a cloud system?

joshkh16:05:06

yup, very crude measurement. though i thought max runs on the query group, so i'm just gaining* time on the i/o of the query result, right?

Joe Lane16:05:38

500ms is at least an order of magnitude more time than you would experience from within an ion for your d/index-pull operation.

joshkh16:05:23

ah yes, i suspect index-pull will be much faster when deployed

joshkh16:05:01

i was still a little surprised to see the max version take so long though

Joe Lane16:05:46

Is this "latest checkin for each guest" or "most recent checkin of any guest"?

joshkh16:05:27

the most recent checkin of any guest

Joe Lane16:05:41

TBH, in your measurements the network will dominate the time, assuming your data is already cached.

joshkh16:05:13

that is true, i'll run some more accurate tests in the vpc. still from my local repl i return "simple" queries in under 100ms so i don't think the network is blame for the four second return from the max aggregate. i understand that sticking with the index-pull/index-range is the best scenario anyhow, but i'm still curious why the max aggregate is so much slower than the index pull. do you know?

Joe Lane16:05:10

You're asking it to do significantly more work.

Joe Lane16:05:03

(dotimes [_ 10] 
 (time (d/q '{:find  [(max ?d)]
              :in    [$]
              :where [[_ :guest/checkin ?d]]}
         db)))

Joe Lane16:05:31

Note the _ . What is the output of that?

Joe Lane16:05:17

Actually, I don't think it matters. The query has to consider all ?d values and thus load them all from storage, put them in memory, etc. Asking the index only considers the first (or last) element and it's done.

Joe Lane16:05:29

So, yeah, query is doing more work here.

joshkh16:05:48

that's what i thought. in one case i'm pull and sorting one million values, and in the other case i'm just stepping along the index and stopping at the first/highest* value in my case (when :reverse true anyway)

joshkh16:05:54

"Elapsed time: 3668.642766 msecs"
"Elapsed time: 4459.585873 msecs"
"Elapsed time: 4711.916313 msecs"
"Elapsed time: 4607.739472 msecs"
"Elapsed time: 5062.974172 msecs"
"Elapsed time: 4710.631811 msecs"
"Elapsed time: 4298.420077 msecs"
"Elapsed time: 4505.261057 msecs"
"Elapsed time: 5294.66838 msecs"
"Elapsed time: 4841.231095 msecs"

Joe Lane16:05:43

And now try the same dotimes but with

(first
 (d/index-pull
  (d/db sort-conn) {:index    :avet
                    :reverse  true
                    :selector [:guest/checkin]
                    :start    [:guest/checkin]}))

joshkh16:05:37

yup, of course it's a lot faster

"Elapsed time: 488.688867 msecs"
"Elapsed time: 383.445716 msecs"
"Elapsed time: 395.543983 msecs"
"Elapsed time: 407.429619 msecs"
"Elapsed time: 394.223538 msecs"
"Elapsed time: 425.687384 msecs"
"Elapsed time: 382.726732 msecs"
"Elapsed time: 397.136071 msecs"
"Elapsed time: 392.209467 msecs"
"Elapsed time: 414.867937 msecs"

Joe Lane16:05:18

An important thing to look at is the variance between min and max here vs when running inside the vpc vs inside an ion. It gets pretty fast.

Joe Lane17:05:12

(might not matter for you in this case though)

joshkh17:05:57

in my opinion it's an interesting bit of "learned knowledge" about datomic. 🙂 i've seen my fair share of queries written to find the latest this-or-that using min and max aggregates. maybe it's just muscle memory from the SQL days:

(time (j/query pg-db ["select MAX(checkin) from guests"]))
"Elapsed time: 187.651582 msecs"
dubious test because that postgres connection is local. anyway i'll follow your advice and try out min/max from inside the vpc and from an ion. you would expect the ion version to run fastest, right?

Joe Lane17:05:50

Yes, but the point of the ion test is to discern "What am I really measuring?"

joshkh17:05:26

because the ion is running in the same memory space

Joe Lane17:05:33

Yes, but more importantly, You're NOT measuring network overhead or auth fns or lambda coldstart, etc You're focusing on JUST the time it takes to do the query/db operation, many times, after a warmup period of issuing the query a few times to warm all the caches (if the data fits in cache).

joshkh18:05:40

Thanks @U0CJ19XAM, those are some very useful details

JohnJ18:05:11

Curious, why are you saving the data to both a rdbms and datomic?

Joe Lane18:05:13

I don't think he is, I think he was using that as an example of what he was used to doing in other systems to achieve the same thing.

👍 3
jdkealy17:05:44

where can i find docs on the map query syntax ?

Joe Lane18:05:58

On-Prem or Cloud?

Joe Lane18:05:55

There is an example of passing a map to d/q, but were you talking about the map form of the query itself?

jdkealy18:05:30

I'm trying to figure out how to programatically generate a datomic query

Joe Lane18:05:02

Do you have a specific scenario in mind?

jdkealy18:05:23

Imagine i had the schema like the above and I had optional fields :firstname and :lastname, and i would reduce over the inputs and programatically add {:where [ ] } clauses depending on whether the key exists or not

Joe Lane18:05:39

(cond-> '{:find [[?e ...]]
          :in [$]
          :where []}
  lastName (update :in conj '?lname)
  lastName (update :where conj '[?e :user/lastName ?lname])
  firstName (update :in conj '?fname)
  firstName (update :where conj '[?e :user/firstName ?fname]))

Joe Lane18:05:11

Just use clojure ¯\(ツ)

jdkealy18:05:59

awesome 🙂 thanks