datomic 2021-11-15 | Slack Archive

Tatiana15:11:54

Hey! I'm using integrant in my app on ion. At the moment, as in ion-event-example, I call d/connect only in transactions. My integrant init-key returns (partial get-connection config). Based on this, do we open the connection only when called? Maybe someone made a similar integrant component. What is the best way to do this? Here's my code

(def get-client
  (memoize
   (fn [config]
     (d/client
        (cond
          (:dev? config) {:server-type :dev-local
                          :storage-dir :mem
                          :system "dev"}
          :default {:server-type :ion
                    :region (utils/get-param "region")
                    :system (utils/get-param "system-name")
                    :endpoint (utils/get-param "endpoint")})))))

(defn get-connection
  "Get shared connection."
  [config]
  (utils/with-retry #(d/connect (get-client config) {:db-name (utils/get-param "db-name")})))

hden04:11:29

Not sure if it’s the best solution, but I’ve made a duct / integrant module for this. https://github.com/hden/duct.module.datomic

kenny20:11:13

Would it be a fair expectation that a single value returning query aggregate would be more performant than pulling the same data & aggregating client side?

favila20:11:43

Client api: almost certainly; peer api or ion: maybe.

kenny20:11:22

(using client api)

kenny20:11:42

& yeah, that's what I thought. Here's some stats for a test of 50 runs with each approach: • aggregate: 2772ms avg, SD 718, min 2150, max 6415 • pull: 2317ms avg, SD 782, min 1807, max 5682

kenny20:11:27

This is over a fairly large dataset. For some reason, pulling & sending thousands of maps over the wire is quicker than aggregating on the db.

favila20:11:08

The query engine itself doesn’t have any aggregation smarts: it realizes a whole result set before it aggregates, and sometimes memory pressure can cause that to be slow. This is why index-pull or d/datoms with an incremental aggregator can sometimes be faster. On a peer, it can very often be much faster.

favila20:11:47

But on client-api especially, the smaller the pipe between the client and peer, the more you want to push the aggregation into the peer.

kenny20:11:11

Wouldn't you realize the whole result set regardless of pull or aggregate?

kenny20:11:45

e.g., (pull ?e my-pattern) vs ?by (sum ?x)

favila20:11:30

using q or qseq?

kenny20:11:39

favila20:11:57

oh, yes, that makes no sense

➕ 1

favila20:11:32

maybe it’s the grouping around ?by?

kenny20:11:32

What do you mean?

kenny20:11:51

Caveat: I am attempting to aggregate 13 optional values. Maybe it doesn't like that many aggregates?

kenny20:11:18

Query looks like this, where the find var & where clause is repeated 12 more times for each value I aggregate

{:find  [?services
           (sum ?v)],
   :where [[(get-else $ ?e my-attr 0.0) ?v]],,
   :with  [?r]}

favila20:11:24

what’s the pull version look like?

kenny20:11:28

Better comparison (first option has 12 additional clauses in where & 12 more find vars)

{:find  [?services (sum ?v)],
 :where [[?r :x ?e]
         [(get-else $ ?e :my-attr 0.0) ?v]],,
 :with  [?r]}

{:find  [(pull ?r [:services
                   {:x [*]}])]
 :where [[?r :x ?e]]}

kenny20:11:23

e.g.,

{:find  [?services (sum ?v) (sum ?2) ... (sum ?vN)],
   :where [[?r :x ?e]
           [(get-else $ ?e :my-attr 0.0) ?v]
           [(get-else $ ?e :my-attr2 0.0) ?v2]
           ...
           [(get-else $ ?e :my-attrN 0.0) ?vN]
           ]
   :with  [?r]}

favila20:11:06

where is services from?

favila20:11:27

I would say the result set after :where is going to be much bigger for the aggregation one

favila20:11:51

what happens if you sum only one ?v at a time? (i.e. 12 queries)

kenny20:11:53

Oh, sorry - omitted that in creating the example. Services is from ?r.

kenny20:11:03

Why do you think the aggregation one is larger?

I'll try that in 5m.

it has more columns

How?

?e ?r ?services ?vX… vs ?r ?e

kenny20:11:23

Hmm. Why'd you add r and e at the end?

favila20:11:40

the pull version only has ?r and ?e

favila20:11:46

I’m comparing the two

kenny20:11:37

Right. Why is there dupes at the end of https://clojurians.slack.com/archives/C03RZMDSH/p1637009275327000?thread_ts=1637006953.319600&channel=C03RZMDSH&message_ts=1637009275.327000

favila20:11:51

vs = versus

kenny20:11:24

Oh haha. Oops

kenny20:11:59

I see what you mean. I was trying to express: for a ?services, sum up all these values.

favila20:11:41

is :services one or many?

kenny20:11:40

one

kenny20:11:52

So, if I understand you correctly, I should compare the result set size of:

{:find  [?service ?v ?v2 ... ?vN],
   :where [[?r :service ?service]
           [?r :x ?e]
           [(get-else $ ?e :my-attr 0.0) ?v]
           [(get-else $ ?e :my-attr2 0.0) ?v2]
           ...
           [(get-else $ ?e :my-attrN 0.0) ?vN]]
   :with  [?r]}

{:find  [(pull ?r [:service
                     {:x [*]}])]
   :where [[?r :service ?service]]}

favila21:11:23

put it that way, you don’t even need ?service, it may be dropping it

favila21:11:32

so :where is just ?r

favila21:11:50

rows is number of unique ?r, cols is 1

favila21:11:37

are your benchmark numbers including whatever aggregation code you rolled yourself for the pull?

favila21:11:04

and the thing I was curious about is how this performs comparatively:

(->> attrs
         (mapv #(d/q '{:find  [?service ?attr (sum ?v)]
                       :with  [?r]
                       :in    [$ ?attr]
                       :where [[?r :service ?service]
                               [?r :x ?e]
                               [(get-else $ ?e ?attr 0.0) ?v]]
                       :with  [?r]}
                     db %))
         (reduce (fn [aggs [service attr s]]
                   (assoc-in aggs [service attr] s))
                 {}))

kenny21:11:41

> you don’t even need ?service, it may be dropping it What do you mean? I need the sums by service

kenny21:11:00

That is so much better. Will try now.

kenny21:11:28

Wowza. Your query version returns avg 531ms.

favila21:11:44

with all 12 attrs?

kenny21:11:48

Yes

favila21:11:09

that’s even issuing them serially

kenny21:11:34

As opposed to 12 concurrent for each attr?

favila21:11:39

yep

favila21:11:18

so, it’s something about large result sets, donno what exactly

favila21:11:34

but in general datomic’s datalog is really bad at aggregating

kenny21:11:43

Ok so stepping back. Do we know why the other was so slow? It seems like it an equivalent query.

favila21:11:44

so keep it simple

kenny21:11:01

Oh yeah. Your query is so much better. Thank you for that.

favila21:11:02

“the other” meaning summing all 12 at once?

Yes

not the pull

Right

I suspect it’s very wide, high-cardinality intermediate result-sets from the where

kenny21:11:03

Is there a way to know?

sure, source code

or maybe a profiler

Datomic's?

yes

✔️ 1

kenny21:11:47

Well, this is extremely exciting. 2-3s -> 0.5s is a huge win for this query. Thank you for taking the time to help.

favila21:11:56

can you indulge me once more?

kenny21:11:03

Sure

favila21:11:04

How does this compare:

favila21:11:07

(->> attrs
     (d/q '{:find  [?service ?attr (sum ?v)]
            :with  [?r]
            :in    [$ [?attr ...]]
            :where [[?r :service ?service]
                    [?r :x ?e]
                    [(get-else $ ?e ?attr 0.0) ?v]]
            :with  [?r]}
          db )
     (reduce (fn [aggs [service attr s]]
               (assoc-in aggs [service attr] s))
             {}))

favila21:11:22

where attrs is a vector of your attrs

favila21:11:51

notice we added ~~columns~~ rows (x12) but not ~~rows~~ columns

favila21:11:34

also I’m wondering why there’s a :with ?r that is thrown away

kenny21:11:23

How did we add columns?

favila21:11:47

sorry, I meant rows not columns

favila21:11:20

also the ?r is obvious now; although I think my reduction may be wrong

favila21:11:32

if your intention is sums per service

favila21:11:38

service+attr

kenny21:11:26

The output needs to be by ?service, but the sum must include all ?r

kenny21:11:01

Btw, I added 1 more clause to pull out the db ident

kenny21:11:20

We're back to 3s with the input version

favila21:11:06

weird

favila21:11:10

maybe it is just memory pressure

favila21:11:23

do you know if GC is happening during this query?

kenny21:11:32

The output is difference btw. The input version returns one map for all ?attr passed in. The non-input returns one map for all ?attr the db is actually using

favila21:11:05

I’m not sure what you mean. Could you just show the query code?

kenny21:11:02

Non-input:

{:query   '{:find  [?services ?attr-k (sum ?value)],
            :where [[?r :cs.model.monitored-resource/service ?services]
                    [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
                    [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
                    [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
                    [?orig-value-breakdown ?attr ?value]
                    [?attr :db/ident ?attr-k]]
            :keys  [service
                    vb-k
                    sum],
            :with  [?r]},
 :args    (list db),
 :timeout 60000}

Input

{:find  [?services ?attr-k (sum ?value)],
 :in    [$ [?attr ...]]
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [(get-else $ ?orig-value-breakdown ?attr 0.0) ?value]
         [?attr :db/ident ?attr-k]]
 :keys  [service
         vb-k
         sum],
 :with  [?r]}

kenny21:11:39

(count (dbu/q non-input))
=> 15

(count (dbu/q input))
=> 39

kenny21:11:16

The the input version, there are many maps that have a sum of 0.0

favila21:11:10

there is no get-else to turn those missing items into 0, they will just be missing from the output

kenny21:11:20

Right

kenny21:11:26

(and that's totally ok)

favila21:11:31

also this part forces EAVT [?orig-value-breakdown ?attr ?value]

favila21:11:47

is that OK? Is every ?attr in there trustworthy?

kenny21:11:32

Good point. It happens to be right now, but that seems like a very sketchy assumption.

favila21:11:29

you can also drop get-else if it doesn’t matter

kenny21:11:28

Dropping get-else avgs around 2.1s & has the same 15 results.

kenny21:11:06

Geez, yeah I don't think I can use [?orig-value-breakdown ?attr ?value] . Nothing guarantees a non-numeric key won't get added to ?orig-value-breakdown

kenny21:11:16

So now we're

;; non-input 
{:find  [?services ?attr-k (sum ?value)],
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown ?attr ?value]
         [?attr :db/ident ?attr-k]]
 :keys  [service
         vb-k
         sum],
 :with  [?r]}

;; input 
{:find  [?services ?attr-k (sum ?value)],
 :in    [$ [?attr ...]]
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown ?attr ?value]
         [?attr :db/ident ?attr-k]]
 :keys  [service
         vb-k
         sum],
 :with  [?r]}

favila21:11:58

The part I don’t get is why this

(mapv #(d/q {:find  [?services ?attr-k (sum ?value)],
 :in    [$ ?attr]
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown ?attr ?value]
         [?attr :db/ident ?attr-k]]
 :keys  [service
         vb-k
         sum],
 :with  [?r]} db %) attr-vec)

would be 0.5 seconds total, but this:

(d/q {:find  [?services ?attr-k (sum ?value)],
 :in    [$ [?attr ...]]
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown ?attr ?value]
         [?attr :db/ident ?attr-k]]
 :keys  [service
         vb-k
         sum],
 :with  [?r]} db attr-vec)

would be over 2 seconds

favila21:11:16

All I can think is memory pressure somewhere in the system

kenny21:11:52

It seems happy

kenny21:11:13

Input: 2112ms avg Non-input: 561ms avg

kenny21:11:33

Lol

{:find  [?services ?attr-k (sum ?value)],
 :in [$ ?my-attr-set]
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown ?attr ?value]
         [?attr :db/ident ?attr-k]
         [(contains? ?my-attr-set ?attr-k)]]
 :keys  [service vb-k sum],
 :with  [?r]}

kenny21:11:32

^ 628ms avg

favila21:11:12

For whatever reason, EAVT on ?orig-value-breakdown is cheaper than 12 AEVT on ?attr ?orig-value-breakdown (assuming these queries are representative)

favila21:11:26

That’s counter to my expectations

kenny22:11:38

> assuming these queries are representative Could you clarify this?

favila22:11:16

I’m just getting confused by the edits and revisions, pseudo and non-pseudo queries, not sure which ones correspond to which timing anymore

kenny22:11:34

Understandable. You should see the size of the comment block I've got now 😅 Here's a summary: Query 1 • 561ms avg • Fastest option • Downside is ?value is not guaranteed to be numeric.

{:find  [?services ?attr-k (sum ?value)],
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown ?attr ?value]
         [?attr :db/ident ?attr-k]]
 :keys  [service vb-k sum],
 :with  [?r]}

Query 2 • 2112ms avg • Idiomatic alternative to 1 with severe perf impact.

{:find  [?services ?attr-k (sum ?value)],
 :in    [$ [?attr ...]]
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown ?attr ?value]
         [?attr :db/ident ?attr-k]]
 :keys  [service vb-k sum],
 :with  [?r]}

Query 3 • 628ms avg • Hack around 2 to ensure ?value is sumable.

{:find  [?services ?attr-k (sum ?value)],
 :in [$ ?my-attr-set]
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown ?attr ?value]
         [?attr :db/ident ?attr-k]
         [(contains? ?my-attr-set ?attr-k)]]
 :keys  [service vb-k sum],
 :with  [?r]}

favila13:11:58

This makes it look like my original supposition was false: intermediate result set size wasn’t the problem. It really looks like index choice is what matters.

favila13:11:55

What is the total time if you run Query 1 12 times with ?attr as an input parameter (a different attr each run)

kenny15:11:05

I think your original supposition was right -- I didn't include the original query because the above are so much cleaner and I forgot about it 🙂 . In these cases, I think you're right again on index choice. I'll try that out.

kenny15:11:05

Do you think it matters whether I use a literal for ?attr or a query input?

kenny15:11:08

Tried both literal and input. Input • 473ms avg

{:find  [?services (sum ?value)],
 :in [$ ?attr]
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown ?attr ?value]]
 :keys  [service sum],
 :with  [?r]}

Literal • 458ms avg

{:find  [?services (sum ?value)],
 :in [$]
 :where [[?r :cs.model.monitored-resource/service ?services]
         [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
         [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
         [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
         [?orig-value-breakdown :cs.model.value-breakdown/provider-cost ?value]]
 :keys  [service sum],
 :with  [?r]}

kenny15:11:02

I can try all 12 concurrently too. Curious what the impact on the DB CPU would be.

kenny15:11:39

Concurrent • Code looks similar to the below, just ran 50 tests & took avg. • 2573ms avg

(def futs-f
  (fn []
    (mapv
      (fn [vb-k]
        (let [qmap {:query   '{:find  [?services (sum ?value)],
                               :in    [$ ?attr]
                               :where [[?r :cs.model.monitored-resource/service ?services]
                                       [?r :cs.model.monitored.cloud-acct/cloud-account ?cloud-acct]
                                       [?cloud-acct :cs.model.monitored.cloud-acct/mode :cs.model.monitored.cloud-acct/mode-fetch-all]
                                       [?r :cs.model.value-breakdown/orig-value-breakdown ?orig-value-breakdown]
                                       [?orig-value-breakdown ?attr ?value]]
                               :keys  [service sum],
                               :with  [?r]},
                    :args    [db vb-k],
                    :timeout 60000}
              results (manifold.deferred/future (dbu/q qmap))]
          results))
      cs.model.value-breakdown/value-breakdown-ks)))
(time (mapv deref (futs-f)))

kenny15:11:31

Spikes CPU

favila15:11:36

This strongly suggests that the object cache is simply not large enough to hold the AEVT indexes involved, and it has to swap them in and out as it runs

favila15:11:05

Using EAVT to pull those related stats is the critical difference here

kenny16:11:41

Hmm. To be sure I'm following, we're talking about the internal Datomic call for the AEVT index component {:components [?attr ?orig-value-breakdown]} ?

kenny16:11:49

Full view of the spike during the parallel queries.

favila16:11:23

[?orig-value-breakdown ?attr ?value] This one. If ?attr is bound, query prefers AEVT; ?orig-value-breakdown is bound but ?attr is not, then EAVT. That’s really the only difference between the fast and slow approaches

kenny16:11:45

Wait, let me back up. The query for 1 attr https://clojurians.slack.com/archives/C03RZMDSH/p1637077448348900?thread_ts=1637006953.319600&cid=C03RZMDSH took under 500ms. Running all 12 concurrently took 2573ms on avg. So you're saying that the concurrent one took longer b/c "the object cache is simply not large enough to hold the AEVT indexes involved, and it has to swap them in and out as it runs" ?

favila16:11:19

that’s how it seems

favila16:11:42

running 12 concurrently is no different than running 1 query with 12 attrs as input

favila16:11:53

or, I was trying to see if that would be so

kenny16:11:15

And that is true, given our results, right?

favila16:11:54

yes

favila16:11:45

IO is death

kenny16:11:49

Got it. Very insightful. Our concurrent one is slightly worse than letting Datomic handle that. Ok so the theory on the perf difference is that the query that uses the AEVT index is much slower due to a too small object cache.

favila16:11:41

yeah, to verify this you should observe the object cache hit rate directly, or storage gets, or even just aggregate network

favila16:11:17

if this is right, then the EAVT query will have a higher hit rate, lower storage get, and lower network activity as it runs

favila16:11:31

and the AEVT version the opposite

kenny16:11:24

I see. Ok. So you're supposing that the cache is large enough & warm enough to hold EAVT index but not the AEVT?

favila16:11:34

yes

favila16:11:56

Are there any other query loads on this instance? it could be they are using EAVT of these indexes already (e.g. via pull).

favila16:11:21

so maybe they are already loaded, and when evicted to make more space they are quickly re-loaded

kenny16:11:06

No, just me testing these scenarios. Certainly one of the most common queries executed in this group would be a pull.

favila16:11:39

I mean direct d/pull; I’m fairly sure pull in a query uses AEVT also

kenny16:11:48

Ah. Then there's a lot of both d/pull and d/q + pull

kenny16:11:23

If the object cache was full, would we expect very little available memory?

favila16:11:05

I’m not sure how it works on cloud. On peer, by default half the heap space is reserved for object cache, and this is tuneable

favila16:11:16

it doesn’t just consume heap opportunistically

favila16:11:53

so you may in fact have plenty of heap free, but still have a full and churning OC

kenny16:11:17

> if this is right, then the EAVT query will have a higher hit rate, lower storage get, and lower network activity as it runs This instance is running with ssd valcache so perhaps no network changes?

favila16:11:32

ah, yeah then it would be disk

favila16:11:46

that makes the difference even more alarming

kenny16:11:16

Because you'd expect valcache to make up the difference with object cache not holding the aevt index?

favila16:11:13

I just wouldn’t expect it to be 4x slower

favila16:11:29

valcache is pretty fast

favila16:11:36

it definitely wouldn’t make up the difference completely. A heap pointer is always going to be much much faster than IO+decompression+deserialization

✔️ 1

kenny16:11:45

I thought so 🙂 & I think it should have no problem loading whatever it needs given the ssd is 0.5tb. Total system is a little over 900m datoms, but this db we've been testing on is some fraction of that.

kenny16:11:40

I'll run the tests again & try to align the disk usage metric with the test.

kenny23:11:48

Got ambushed by a flurry of meetings 😢 Now to run the tests...

kenny23:11:57

Booted up a brand new instance to make sure we've got a good starting point. Calling d/connect on the db we've been working with results in 14.6m bytes written.

kenny23:11:04

I started with Query 3, the most likely candidate I will use later. The first run of the query took 3583ms. The second run took 1445ms. We now have a second spike in our graph that happened when I ran query 3 for the first time. It wrote 21.6m bytes.

kenny23:11:44

Running query 3 50 times resulting in an avg time of 666ms and no disk bytes read.

kenny23:11:27

Now I ran the Concurrent query. The first run took 6074ms. The second run took 4343ms. And our 3rd disk write appears. It wrote 20.9mb.

kenny23:11:27

Running Query 4 50 times resulting in an avg time of 4067ms and no disk bytes read.

favila23:11:21

Are you sure this is the valcache drive? None of these seem to have reads. Do you not have access direct oc, valcache, and storage get/hit rate metrics?

kenny23:11:43

Good point. I am not 100% certain - will double check aws docs. Perhaps it's simply not using valcache for some reason.

kenny23:11:00

Perhaps because result is available in object cache?

favila23:11:19

If that’s true the mystery only deepens

favila23:11:46

But these metrics should be published directly, we shouldn’t have to infer from disk activity

kenny23:11:10

This is the full dashboard Datomic provides for a query group (plus my manual disk chart)

kenny23:11:23

I added the vertical lines for each event.

kenny23:11:25

Actually, I think it's highly likely that disk chart is the valcache drive since we are seeing disk writes that directly align with when the queries happen.

favila23:11:52

It claims hit rates of 100%. So my theory is not correct and I am out of ideas

kenny23:11:45

I think that "Cache Hit Ratios" chart is only has lines for EFS, which is different.

kenny23:11:55

From https://docs.datomic.com/cloud/whatis/architecture.html#caching, it seems Cloud's cache order is: 1) object cache 2) valcache 3) efs 4) s3 fallback

kenny23:11:19

So it's skipping valcache for some reason or the chart's legend is misleading.

kenny23:11:55

Erm, ddb should be in that list somewhere, I'd think.

kenny23:11:03

I sshed into the node to make sure we aren't going crazy. From df -h:

Filesystem                                 Size  Used Avail Use% Mounted on
devtmpfs                                   7.5G     0  7.5G   0% /dev
tmpfs                                      7.5G     0  7.5G   0% /dev/shm
tmpfs                                      7.5G  496K  7.5G   1% /run
tmpfs                                      7.5G     0  7.5G   0% /sys/fs/cgroup
/dev/xvda1                                 8.0G  2.5G  5.6G  31% /
  8.0E  173G  8.0E   1% /opt/datomic/efs-mount
/dev/nvme0n1                               436G  102M  414G   1% /opt/ssd1
tmpfs                                      1.5G     0  1.5G   0% /run/user/4242
tmpfs                                      1.5G     0  1.5G   0% /run/user/1000

kenny00:11:57

Very interesting results I might say. Datomic's CF template attaches an EBS gp2 (non-ssd) drive to the nodes automatically. It surprises me that the drive has 2.5gb used. The nvme drive does have 102m written though.

kenny00:11:12

Ok, it's definitely written to the ssd drive.

tree -da /opt/ssd1/
/opt/ssd1/
├── datomic
│   └── valcache
│       ├── 000
│       ├── 001
│       ├── 002

kenny00:11:03

Just need info on what the Cache Hit Ratios chart mean. There are 2 paths: 1. Cache Hit Ratios comprises any Datomic cache type. So those points we've seen could mean that Datomic is reaching for valcache in our test queries. 2. Cache Hit Ratios really only includes data for the EFS cache. If so, why is it reaching for EFS when it likely has the data available in valcache? Both of these would lead to more questions and still leave us with a mystery of why one query is slower than the other.

kenny00:11:03

This has been a very enlightening discussion. I believe we are well into the area of opening a support ticket 🙂

kenny00:11:32

One observation from our previous discussion. We had previously agreed on this: > Ok so the theory on the perf difference is that the query that uses the AEVT index is much slower due to a too small object cache. However, that doesn't seem to hold. I would think that once you load the AEVT index into the object cache subsequent queries using that index would be quick.

favila00:11:57

That’s right. So either these metrics are deceiving or not revealing oc churn, or the theory is incorrect. I am pretty confident that the different indexes for that clause is all that is different, but cannot explain the performance difference

kenny00:11:34

Got it. I've opened a support ticket with everything we learned here. Thank you so much for your help. Happy to update you on our findings if you're interested.

2021-11-15

Channels