Fork me on GitHub
#datomic
<
2021-03-08
>
danm15:03:36

Does Datomic have any sort of known issues around creating a number (~300-400) of connections in a very short space of time? We've got an app that on startup spawns 30-40 threads, each of which is creating connections to pull data from 10 separate databases (the databases are unique per-thread, so only 1 connection per db, but on the same Datomic cluster). We frequently get a load of category interrupted 'Datomic Client Timeout' exceptions on startup, and have to delete/recreate the container, even though the AWS metrics (Datomic Cloud production setup) don't show any particular issues with mem, CPU, etc. Once the app has started it seems to be fine and stable, no timeouts (with transact and q calls calling d/connect each time they run), unless we perform an action that's going to require it to run through and recreate a lot of those connections rapidly again.

ghadi15:03:27

I would put exponential retries with some jitter

3
ghadi15:03:52

the exception you receive should be marked with a :cognitect.anomalies/category that should indicate if it's retriable

danm15:03:10

Yeah, it's interrupted (namespaced of course)

ghadi15:03:14

you don't want to destroy a container just because 1/400 connections fail

kenny15:03:15

It’s likely you’re getting throttled ops.

kenny15:03:57

(Can check in CW dashboard)

danm15:03:21

I was going to do some work to add that, but there has been a bit of pushback from some in the team because recommendations/docs from Cognitect elsewhere recommend a retry on unavailable, but don't mention other categories

danm15:03:44

So having feedback that it would be a good idea is 👍:skin-tone-2:

ghadi15:03:48

interrupted, busy and unavailable are the 3 retriable anomalies

danm15:03:47

@U083D6HK9 You mean some Datomic internal throttling, or on the DynamoDB? We did used to see a bit of DDB throttling, so we changed from provisioned resource to on demand scaling (basically, no scaling needed but pay per-request), and don't see them any more. Once we have a better idea of longer-term access patterns we'll probably change that back

ghadi16:03:12

@U6SUWNB9N cloud or onprem?

danm16:03:16

We are going between VPCs though, as the CloudFormation for Datomic cloud sets up its own VPC rather than being able to 'join' an existing one, and we already had an existing one with EKS in etc. We're not currently finding any limits being hit r.e. inter-VPC comms though

ghadi16:03:35

The cloudwatch dashboard should show Throttled Ops

ghadi16:03:45

(Dashboard for Datomic)

ghadi16:03:04

This is separate than ddb throttling, but could be caused by ddb throttling

kenny16:03:43

Also curious if you're pointing all 300-400 to the primary compute group.

danm16:03:22

Oh yes, with you. Nothing in the dashboard. Occasional OpsTimeout there too, but no OpsThrottled

danm16:03:22

@U083D6HK9 At the moment, yes. We've not deployed any query groups (right terminology? I'm pretty new to Datomic), so the only instances running in the cluster are the 2x i3.large ones that are part of the standard template. Our access pattern involves a fair bit of writing. In some cases we're 1:1 read:write. There is a small lean towards q requests on startup as it loads initial state, but that is only maybe 10% above the transact requests, so I wasn't sure that query groups would help.

ghadi16:03:39

plan for exponential retry/backoff on transact, connect, q

danm16:03:22

👍:skin-tone-2: Our next challenge we already know is "how do we make this faster?", but that's a good start. Thank you. And we'll have metrics to know when we do retry

ghadi16:03:06

look into Query Groups to isolate read load

3
ghadi16:03:22

can scale those independently of the primary compute group

kenny16:03:14

You can consider pre-scaling a query group prior to app deploy.