Fork me on GitHub
#sql
<
2023-01-03
>
cddr11:01:26

Does anyone run on aurora? How do you teach your connection-pool about autoscale events?

devn19:01:22

i hadn’t thought about that, no. are you thinking you want to resize your pool dynamically?

devn19:01:03

at least with hikari im not sure how much on-the-fly reconfiguration is even possible without stopping and then creating a new pool

cddr19:01:00

Well the problem I’ve seen is that you can have a handle to a connection that you think is a primary but it gets “demoted” to replica on autoscale events. In these cases, it is sometimes necessary to basically reacquire a fresh connection. It is the reason for this existing: https://github.com/FundingCircle/pg_failover

devn19:01:36

ah i see

cddr19:01:15

I saw it in two jobs now (out of two I’ve had exposure to aurora) and figured it must be something everyone who uses it needs to solve but maybe that’s jumping the gun 🙂

devn19:01:31

hm no it’s interesting and i’m glad you brought it up. i am building against an aurora pg db and i don’t think anyone has considered this/we haven’t run into autoscale issues that changed the primary

devn19:01:25

i might be showing some ignorance here but… DNS?

devn19:01:08

as in, if you point at a hostname which always refers to the primary, then maybe you could just get away with low TTL?

devn19:01:33

(just thinking out loud here)

devn19:01:15

what are you using for connection pooling?

cddr20:01:13

Ah yeah that might be a better way to solve it. We’re using hikari now.

devn20:01:09

for some reason i remember in a past gig we were on aurora and we hit recovery mode, but i don’t recall if we switched DNS for the primary or if it just… was broken for a bit

devn20:01:23

could have been a full deploy from being fixed

devn20:01:38

but it was during some known maintenance scenario

devn21:01:44

ah so i wasn’t crazy!

devn21:01:20

not really sure what to make of this, but maybe worth being aware of

devn21:01:19

though i do note that the second to the last post says “we have a ttl of 30sec”…“it takes about 1 minute for the connection to recover”

devn21:01:08

with a pool size of 100, depending on how they’re testing, that could be about right, but it would also depend on the default hikari opts, which i don’t know all of off the top of my head

devn21:01:25

as a for instance, I recall the hikariCP docs saying they strongly recommend setting maxLifetime , for instance, as the default is like 30min, connectionTimeout defaults to 30sec, etc.

devn21:01:07

anyway, seems like it should be easy enough to test that you’re configured correctly during a simulated failover