Fork me on GitHub
#datomic
<
2016-02-04
>
currentoor05:02:05

I'm seeing very slow queries and my database is not even that large. Any suggestions for how to proceed?

currentoor06:02:45

If I have an expensive query that returns 5MB of data, I can see that the first time the query is made it takes about ~3 seconds. But the second time that same query is made, shouldn't it be way faster because of caching?

currentoor06:02:07

I'm wondering if I've setup something incorrectly.

currentoor07:02:25

I thought perhaps the peer does not have enough memory but based on New Relic I can see that I haven’t hit the max heap size yet, so that’s probably not the cause.

currentoor08:02:16

Nevermind, turns out to be a different issue.

Ben Kamphaus20:02:35

@currentoor: if you revisit this again and can share the query or an obfuscated form of it, there are common issues like clause ordering, typos in variable bindings, inclusion of clauses that don’t relate and lead to cartesian product intermediate sets of tuples, etc. that result in inefficient queries (and sometimes those inefficiencies may only become glaringly obvious at scale).

Ben Kamphaus20:02:46

also note that index or log segments go into the object cache and won’t by default consume the entire heap, you can change the objectCacheMax system property (defaults to half of heap), more on that here: http://docs.datomic.com/caching.html#object-cache

ljosa23:02:01

Does Datomic do okay with the transactor and storage on the other side of the country (~100 ms)? Our west coast people are trying to get started with Datomic and are reporting slowness. From a newly started peer JVM, a query that takes 5 s within the same AWS region as the servers takes 99 s from laptops in our Portland, Oregon, office (~100 ms ping times). And d/connect takes 80 s. It's much better for subsequent queries, as the peer starts to cache most of what it needs. But is this expected behavior, and is it network latency that is the determining factor? Or is something wrong? Should I be looking for Couchbase connection problems?

Ben Kamphaus23:02:59

@ljosa: it sounds like network latency is certainly a contributing factor and that’s not a configuration I would typically recommend. Is there also a cross-regional consistency setting (i.e. replication or something) that’s a confounding factor as well?

ljosa23:02:12

no, no couchbase xdcr, as it doesn't guarantee the consistency that Datomic requires. just a transactor and a couchbase cluster, both in us-east-1.

Ben Kamphaus23:02:20

@ljosa: what are your memory-index settings?

ljosa23:02:06

on the transactor?

lockdown23:02:17

yep, I would try couchbase direct queries first

lockdown23:02:36

to make sure you can discard it

ljosa23:02:10

memory-index-max=512m
memory-index-threshold=32m
object-cache-max=128m

Ben Kamphaus23:02:26

ok, that looks reasonable.

Ben Kamphaus23:02:05

reason I ask re these two things is (1) really common issue with sudden latency spikes of users on e.g. Cassandra is cross-datacenter consistency/replication, have seen two orders of magnitude jump in latency out of that (2) peers have to accommodate memory index (and read all log/memory index segments into memory) with the initial call to connect, so that could be a contributing factor where even a small amount of latency could have a big impact.

ljosa23:02:02

is the peer able to pipeline its couchbase reads, or is there a lot of read-wait-read?

ljosa23:02:55

I did some couchbase testing from my house in Massachusetts. Ping times around 25 ms. Connecting takes a few seconds. The query that they used in Oregon takes 30 s. Also tested directly with Couchbase, and things look reasonable: 200 ms to create cluster, 930 ms to open bucket, 30 ms to read a small document. No errors from Datomic or Couchbase.

Ben Kamphaus23:02:41

@ljosa: it’s certainly true that (especially with the cross-country latency contributing) a warm query will be significantly faster as it won’t be retrieving segments from storage. If the entire database or most frequently accessed segments can be held in the object cache on the peer, performance should be fine after the warm up period.

Ben Kamphaus23:02:54

do you have peer logging enabled?

ljosa23:02:41

yes, after my 30 s query I get the first metrics:

[Datomic Metrics Reporter] INFO  datomic.process-monitor - {:tid 22, :AvailableMB 2590.0, :StorageGetMsec {:lo 26, :hi 389, :sum 33313, :count 857}, :pid 37440, :event :metrics, :ObjectCache {:lo 0, :hi 1, :sum 75, :count 944}, :LogIngestMsec {:lo 0, :hi 601, :sum 601, :count 2}, :MetricsReport {:lo 1, :hi 1, :sum 1, :count 1}, :DbAddFulltextMsec {:lo 0, :hi 29, :sum 29, :count 2}, :PodGetMsec {:lo 54, :hi 76, :sum 186, :count 3}, :LogIngestBytes {:lo 0, :hi 3581246, :sum 3581246, :count 2}, :StorageGetBytes {:lo 67, :hi 48478, :sum 10179767, :count 857}}

Ben Kamphaus23:02:53

hm, the average StorageGetMsec time for the peer doesn’t seem notably slow from the Datomic peer view, (39 msec average)

ljosa23:02:29

I'm going to try to increase concurrency and see if it changes.

Ben Kamphaus23:02:09

the same query is an order of magnitude increase? I would only expect that from latency if e.g. the StorageGetMsec time is extremely fast (i.e. an order of magnitude lower if we’re talking 3 vs. 30 sec), though this assumes storage reads dominate.

Ben Kamphaus23:02:08

cold and hot query comparisons, system configs identical re: heap and object-cache size? (i.e not cross a memory threshold for intermediate representation on differently configured systems?)

ljosa23:02:19

-Ddatomic.readConcurrency=10 didn't change anything.

ljosa23:02:47

same query, in lein repl on identical laptops. No -Xmx

ljosa23:02:22

The query takes 5.3 s from an AWS instance in the east. Metrics:

{:tid 19, :PeerAcceptNewMsec {:lo 1, :hi 1, :sum 1, :count 1}, :AvailableMB 1200.0, :StorageGetMsec {:lo 0, :hi 5, :sum 444, :count 846}, :pid 12134, :event :metrics, :ObjectCache {:lo 0, :hi 1, :sum 81, :count 936}, :LogIngestMsec {:lo 1, :hi 619, :sum 620, :count 2}, :MetricsReport {:lo 1, :hi 1, :sum 1, :count 1}, :PeerFulltextBatch {:lo 1, :hi 1, :sum 1, :count 1}, :DbAddFulltextMsec {:lo 0, :hi 35, :sum 35, :count 2}, :PodGetMsec {:lo 12, :hi 31, :sum 71, :count 3}, :LogIngestBytes {:lo 0, :hi 5165426, :sum 5165426, :count 2}, :StorageGetBytes {:lo 67, :hi 48478, :sum 10071059, :count 846}}

Ben Kamphaus23:02:55

wow, StorageGetMsec average is 0.52 msec, vs. 39 msec in the other example, so I’d say that could certainly account for the difference (very good fit actually to 5.3 second versus 30 second ratio).

ljosa23:02:17

I tried -Ddatomic.readConcurrency=1000 also, without much effect. (Well, it went from 30.8 s to 28.8 s, not sure if I just got lucky.)

Ben Kamphaus23:02:09

may just be luck, I think the latency is the bottleneck. The storage retrieval component of the query just being masked by the extremely fast storage access in the primary config.

ljosa23:02:59

Do you have other tricks that may speed up the connect and first query? Or do our people in Oregon just have to get used to long startup times? (This is for dev work and ad-hoc analysis; we don't have Datomic peers on production servers in the west.)

Ben Kamphaus23:02:58

the usual answer for reducing latency in population the object cache is memcached ( http://docs.datomic.com/caching.html#memcached ) but not sure you’ll want to configure it for the dev work and ad-hoc analysis situation you describe. I’m not sure where the costs with the queries are being made perf wise.

Ben Kamphaus23:02:44

i.e. if it’s intermediate reps and joins, narrowing, etc. or if your clauses match a ton of results that have to be then passed on. You could throw up a REST server to return query results for ad hoc analysis and submit queries to the endpoints, that way the peer stays warm, though I’m not sure that would save you much trouble if you’re getting really large result sets.

ljosa23:02:50

does such a memcached have to be reachable by both the transactor and the peer?

Ben Kamphaus23:02:52

some of the costly queries may be able to be tuned via clause re-ordering, or strategies for handling time/tx provenance if those are a component?

Ben Kamphaus23:02:33

different Datomic processes can use a different memcached

ljosa23:02:25

so a developer could have a memcached on his laptop without the transactor needing to be configured with memcached as well?

ljosa23:02:11

the query itself is just a simple join and pulling four attributes on ~1285 joined pairs of entities:

(count (d/q '[:find
              (pull ?c [:c/d :c/i])
              (pull ?b [:b/n :b/x])
              :where
              [?c :c/e true]
              [?b :b/c ?c]] db))
=> 1285