Fork me on GitHub
#datomic
<
2024-04-22
>
favila17:04:38

I updated datomic from 1.0.6610 to 1.0.7075. After this change, the backup process never seems to complete due to dynamodb timeout exceptions. The invocations are the same (no backup pacing), and both are incremental backups, both are reading from dynamodb provisioned storage and writing to the local filesystem. Is there some hidden configuration default that changed, or are the reads much more aggressive that it’s hitting provisioned capacity limits where it didn’t before?

Joe Lane17:04:07

Hey @U09R86PA4, can you share (however you'd prefer) the flags used during your backup process?

favila17:04:17

there are no flags used

favila17:04:01

I just added backupPaceMsec and no change, still crashes

favila17:04:19

I’m beginning to suspect GC--does newer version use more memory?

Joe Lane17:04:39

• Yeah, you may have more memory pressure because because we increased some branch-level concurrency in the backup process. I imagine because you've got a large enough database you're going to be taking advantage of that concurrency. • The ddb timeouts are likely because of the new ddbRequestTimeout setting, which now defaults (IIRC) to 1s and previously was 1m. • Try increasing mem • If increasing mem doesn't solve it, set -Ddatomic.ddbRequestTimeout=10000

Joe Lane17:04:02

You might be getting DDB ReadThrottled if you're using provisioned capacity.

favila17:04:27

right, which was my initial suspect, but why was this not an issue on 6610?

favila17:04:59

by all appearances the network activity from 7075 seems to have decreased

favila17:04:27

I’m comparing jstats now from the two versions

Joe Lane17:04:32

https://docs.datomic.com/pro/changes.html#1.0.7010 • "New System Property: ddbRequestTimeout ..."

favila17:04:57

yeah, so you’re saying the old value was much higher?

favila17:04:08

(old, hidden value)

Joe Lane17:04:29

Yeah, old value was the SDK Default of 1 minute (If I could only share the things I've seen...)

favila18:04:22

I think it’s all memory pressure

favila18:04:56

running on 6610, jstat shows GC isn’t as active. :backup/segment :msec values remain very low

favila18:04:25

on 7075, GC is very active, and msec values degrade over time. I suspect from gc

Joe Lane18:04:31

Are they being skipped or copied. Since you're doing an incremental backup they may just be getting skipped.

Joe Lane18:04:50

(Not disagreeing, just an alternative hypothesis for what you're seeing)

favila18:04:00

overall network throughput from 6610 is higher according to CW metrics from the instance

favila18:04:09

they are getting skipped, but 7075 never got past that. 6610 just finished a 30 minute incremental in < 5 minutes

favila18:04:31

7075 just falls over in the middle

favila18:04:22

6610 invoked full gc 5 times, 7075 invoked it > 150 times

favila18:04:29

so yeah, I think it’s memory pressure

favila18:04:35

7075 uses more memory

favila20:04:59

For those following along at home, I was using a 2g heap with 6610; using a 4g heap with 7075 appears to be enough for it to complete.

🏡 2