Fork me on GitHub
#architecture
<
2023-09-01
>
pavlosmelissinos06:09:40

Re: system architecture. We're trying to help our devops people do maintenance, e.g. upgrade our cloud postgres (the tech stack doesn't matter), without affecting the integrity/dependability of the system, so I've been thinking about introducing a mechanism using queues that will pause all background jobs. The process I'm considering goes like this: 1. devops person turns the maintenance flag on somewhere 2. all requests for background work are sent to one or more queues 3. devops do they thing, then turn the flag back off 4. the system "resumes"; it starts consuming the pending events from the queue. However, the more I think about it, the more I believe it should be approached as a "fault tolerance first" problem, rather than a special mode that has to be switched on or off. If you make as few assumptions as possible about whether a certain service will be available, perhaps it won't be necessary to introduce a special mode and the system will be able to handle unforeseen use cases too. Also, a background job might not be affected by this kind of maintenance, so why should everything be paused? Is there a tipping point though? I mean if your application stores practically everything it does to a postgres database and depends on that data existing in the database, does it make sense to try to work around it? How does the industry approach similar problems? I'm not sure where to look, I'd love to be pointed to reading material 🙂

Teemu Kaukoranta07:09:58

What currently happens when a background job is running and the database is down? Does the job fail gracefully? If yes, why is that not good enough? (I can guess reasons why that is not good enough, I'd just like to hear your reasons).

👍 2
pavlosmelissinos07:09:45

> Does the job fail gracefully? Well, yes and no. It's graceful as in it will be as if the job never ran (with some logged failures here and there) but the system won't auto recover when the database gets back online (as in the failed requests won't run). Manual intervention is required (and this is what I'm trying to fix/improve). Oh well, it does sound like a fault tolerance problem 😅

potetm14:09:51

I'm going to guess that these two requirements (DevOps maintenance, fault tolerance) do not overlap as much as you think they do. Running in another mode for a preplanned, controlled, and monitored duration is much different that letting the system operate full time in another mode.

👍 2
potetm14:09:10

There might be an argument for backing particular endpoints with a queue full time. (i.e. it's not "another run mode," these endpoints just never touch postgres). If you do that, you're kinda pushing the problem around (what happens when you need to update your queuing software?), but in theory it's easier to point to a new queue than a new db.

👍 2
potetm14:09:22

It all depends really. I can imagine scenarios where it makes sense to back with a queue on some endpoints. I can also imagine scenarios where it's total overkill.

👍 2
fuad15:09:50

I've worked in large systems where it was pretty common to stop consumers of a certain queue in specific situations: one-off maintenance routines, a downstream incident where consuming messages would only make things worse. You could also do that automatically using some sort of circuit breaker pattern (if the consumer circuit breaker trips, stop consumers). Another approach could be to send the messages whose consumption failed to a deadletter queue so they can be re-processed later (via a manual trigger). Here it would be important to analyze what are the necessary message ordering and idempotence guarantees. The deadletter approach could be enough to cover the regular maintenance but it could feel a bit dirty. For example, if you have error rate metrics for the consumer they would be at a high rate (potentially 100%). Being able to control the flow of data by start/stopping the consumers manually feels a bit cleaner and would give you more meaningful metrics (mesages consumer per second = 0).

👍 2
pavlosmelissinos19:09:16

Interesting answers, thank you all very much! I'll have to think about your other suggestions but the DLQ approach is closest to what I roughly had in mind. Like, a background job is invoked, it cannot access a critical resource or faces some other problem, so it sends the input to a DLQ. When maintenance ends, someone will feed the DLQ messages back into the system. What's important here is that the system does not have to be aware that failure X was caused by maintenance It's definitely less efficient than maintenance mode because the app needs to run, figure out there's a failure and then halt, so maybe both are needed indeed... :thinking_face: hammock

Teemu Kaukoranta09:09:03

Yeah, both may be useful. I would start by making sure that a failing job does not require manual intervention. Can the job just retry automatically later, if it fails once? You're dealing with a distributed system so it's a fallacy to think that the jobs can only fail because the database is in maintenance mode. Then you can add the maintenance mode later. I think it would mainly be useful so you don't pollute your alerting system with false errors.

👍 2