Fork me on GitHub
#architecture
<
2023-06-05
>
fuad11:06:22

How do you coordinate distributed side-effects in your systems? Context: Quite frequently I find myself in the situation where I need to, as part of a certain operation, perform multiple side-effects in different resources. One common example is having to write something to the db and then notify an external service about it, or persist something to data store A and then something else to data store B. This obviously opens up the possibility for errors on the subsequent side-effects, which might lead the overall system to an inconsistent state, at least until the problem is fixed. Embracing eventual consistency is generally acceptable (at least in the contexts I work in), and I have done so in the past (especially when working with microservices) in the following way: • Design operations so that they are idempotent and can be retried several times, especially out-of-rder • Break down operations with multiple side-effects into smaller ones that each perform a side-effect and • Sequence these operations by reifying them in some way (usually as records in a db or messages in a durable queue like kafka) This is not rocket science, but it does require several infrastructure components to be in place: • the queue/consumer system with ordering and delivery guarantees • a monitoring solution to understand when/why/where/how many errors happen and triage problems • automated and manual retry mechanisms • a mechanism to drop messages when it makes sense • etc Since I do believe this is a common pattern, I wonder if there are common solutions to this that can be reused without incurring in a huge maintenance overhead - which is especially tolling for small teams. For instance, in the past I have seen great solutions for this built on top of Kafka (which would provide ordering and delivery guarantees), but there was a lot of tooling that was custom built to make it operational and easy to use. I'm wondering if there are off-the-shelf-ish solutions out there I might not be aware of. PS: there's nothing clojure-specific about this problem, but I'm curious to hear how this is handled by the clojure community since, as functional-minded programmers, we tend to be mindful about managing state and side-effects.

👀 10
💡 2
👍 2
fuad11:06:10

I have just recently learned about https://github.com/nilenso/goose, which seems to tick some of the boxes above, although it might need some work to integrate the monitoring and ops side of it into a user-friendly workflow.

msolli11:06:47

I’ve built https://github.com/msolli/proletarian, which also ticks some of the boxes. Only on Postgres, though. But if you’re on Postgres, it provides the additional benefit of letting you enqueue jobs in the same transaction as you application/domain operations.

👀 2
Thomas Moerman11:06:34

@U3RBA0P4L i had the same question, started out with Goose as a task-queue, but now I'm transitioning to https://temporal.io/ using the #temporal-sdk.

👀 2
Thomas Moerman11:06:35

@U3RBA0P4L check out the video on the Temporal clojure SDK https://youtu.be/gztsbSP5I3s

❤️ 2
Rupert (All Street)12:06:40

> How do you coordinate distributed side-effects in your systems? We use idempotentance and retries. Sometimes the retries are unlimited so the component is effectively blocked until something suceeds. Some of our components maintain their own state (sometimes can be as simple as a durable atom on disk) so if they are restarted the continue from wherever they last saved their state. I did look into more comprehensive solution for side effects, but ended up deciding YAGNI for us for now. We tend to use REST APIs. I've found that queues/topics can be a real pain. • How do you remove bad data that is sitting on a queue (e.g. causing consumer to fail/retry loop). • How do you cancel messages that are no longer valid once they are on the queue? • How do you put data back on the queue that was processed wrongly. Is the consumer or producer responsible for this? • How do you enforce priority so important messages aren't delayed by low priority ones. • How do you do RPC/Request-Response over queue? • Has the queue really decoupled anything? The sender and receiver are still tightly coupled in the contents/semantics of the message. Breaking changes to sender/receiver still need to be done in lock step.

adi13:06:49

> How do you coordinate distributed side-effects in your systems? "Kafka" is how I've heard it described, unironically.

olttwa08:06:50

@U3RBA0P4L, when you say coordinate distributed side-effects and embracing eventual consistency, I'm presuming you mean to build an event-driven architecture. As @U051MHSEK mentioned, I've mostly seen Kafka being used for this when built in-house. If you're looking for a quick start without a maintenance overhead, I'll suggest Confluent Cloud, Google Pub/Sub, AWS Kinesis, Uber-Cadence or Temporal.

olttwa08:06:59

A few side-notes: 1. Event-driven system are generally high-throuphput/low-latency. You will not have an in-built feature of dropping individual messages, but only to purge messages in bulk. 2. An important feature you should consider is "replay" of messages.

Jorin07:06:31

I really like the mentioned solutions! To add two more options: 1. Use a database like XTDB or Datomic, which have a log built in. You can then build additional async operations as followers of the log 2. Use Change Data Capture on top of a DB like Postgres. There are tools like Debezium to do that. Or Kafka Connect. In some scenarios you might even want to write your own code that follows the Postgres WAL. There are tradeoffs with all the options. No solution is trivial and the choice depends on the problem.

Jorin07:06:51

One more option I like to add: As mentioned before, Kafka is a useful tool for this sort of system. For smaller projects I personally find Kafka non-trivial to work with (even a managed Kafka cluster). A lightweight alternative is https://docs.nats.io/nats-concepts/jetstream. Had good experinse with that. The API and operations are pretty simple.

☝️ 2
fuad07:06:19

@U0DJ4T5U1 thanks for the pointer. I'm familiar with the idea of two-phase commit but never implemented it. It is indeed a solution to this problem and seems to be a good way to go especially if eventual consistency is not acceptable. I did some googling and found this interesting discussion on the topic https://groups.google.com/g/clojure/c/qiKnCVAaZKw I'm a big fan of Kleppmann's work, should probably take a look at Designing Data Intensive Applications again.

Jorin07:06:54

Distributed transactions were all the rage in JavaEE. There are reasons they are not as popular these days and people opt for queues, event stream or tools such as Temporal instead. It's just really hard to get dist transactions right. You can find some tools for Clojure via Immutant: http://immutant.org/documentation/current/apidoc/guide-transactions.html The Saga pattern works similar in a microservice architecture: https://microservices.io/patterns/data/saga.html

💯 4
2
Ivar Refsdal20:06:40

I wrote https://github.com/ivarref/yoltq to solve some of these problems when using Datomic + idempotency on the receiving side. If the receiving side isn't idempotent, multiple deliveries is still better than zero deliveries. I also wrote https://github.com/ivarref/mikkmokk-proxy, a simple http proxy that does fault injection, so that you can (more easily) check how your services responds to duplicate requests, etc.

didibus05:06:04

I think people mentioned most solutions already, but two phase commits, sagas and idempotency are the only ways I know. Two Phase Commit: tell each system to prepare commit, and when they all do, tell them to commit for real or revert. Sagas: Make each change one after the other, but if one fail, initiate revert changes in the ones that you'd already made the change. idempotency: Make each change one after the other, but make the last thing you change your source of truth. For example, send notifications and then commit to the DB. The DB is your source of truth, if something fails before it, you can fail your request and don't clean up anything. As the request is retried, it'll see the before transaction state in the DB, and it will retry cleanly, and send the notifications again for example, but maybe with some same ID so the receiver can dedupe them. Or just accept some duplicate, like maybe a duplicate email to the user, not the end of the world.

👌 2
chrisblom23:06:20

I'd let the service only read/write from 1 master data store, and update secondary stores or other services by subscribing to a change data capture (CDC) log or a transactional outbox. CDC pushes all changes made to a table to a queue or topic, and is supported by most databases (postgres, mongodb & datomic have good support, check out Kafka Connect) A transactional outbox is basically a table where you insert messages to other systems, along with updates to your tables, within the same transaction. These messages are then published to a queue by a separate thread or process. Kafka or Pulsar are good solutions for storing the CDC or outbox log. You can setup one or more consumers that process the outbox or CDC log, and push updates to secondary store or services. The advantage of only interacting with 1 master store is is that you only couple the service to 1 datastore instead of many. In my experience this is simpler (less code in main service), faster (only write to 1 store) and more reliable (less runtime dependencies) It's also easier to ensure your master data store does not get in an inconsistent state as you can use normal DB transactions for updates (no distributed transactions), and the CDC log or outbox table is great for debugging. Another pro is that is easier to extend, you can easily hookup additional consumers to the CDC / outbox log later. A downside is that updates to secondary stores or other services can be delayed a bit (eventual consistency) but in practice this is often acceptable. If possible, make updates to secondary store or other services idempotent, so you you can force synchronization by replaying the CDC log or outbox table. One alternative is distributed transactions, but these are more complex and less reliable (all systems involved in a tx need to be up), less performant (you need to write to multiple stores instead of 1) and not all external systems support these.

fuad11:06:15

Thanks for all the responses! It helped me reflect a bit on the problem and refresh my understanding of distributed systems patterns. First addressing the conceptual bits: I think the problem I'm describing is well understood in academia and industry and the solutions to it are variations around the same core concepts: • ACID ◦ distributed transactions which can be implemented in several ways, one of which is using a two-phase commit • Eventual consistency ◦ can be implemented in several ways, e.g. using some form of event sourcing, sagas, etc These are reasonably well described in literature and the tradeoffs seem to be well documented. A few references on those: • https://martinfowler.com/articles/patterns-of-distributed-systems/two-phase-commit.htmlhttps://martinfowler.com/articles/patterns-of-distributed-systems/https://microservices.io/https://microservices.io/patterns/data/saga.htmlhttps://microservices.io/patterns/data/event-sourcing.htmlhttps://microservices.io/patterns/data/transactional-outbox.htmlhttps://microservices.io/patterns/data/transaction-log-tailing.html Now the more practical bits: As I mentioned in the original message, for my particular use case, eventual consistency is acceptable. It is the approach I have more experience with and therefore I'm leaning towards it, but I think the practical implementation questions would apply similarly to other approaches. The problem, as I see it, is: how to integrate an implementation of (one of) these patterns into a business application without incurring into 1. a maintenance overhead (e.g. spinning up, deploying, monitoring several services, message brokers, etc), which can be especially problematic for small teams and 2. without having to implement any complex algorithms by hand. This is the case with the Kafka setup I mentioned for illustration purposes: you need to setup and maintain a Kafka cluster and implement a lot of the patterns on top of it: retries, circuit breakers, deadletters, discarding messages, etc. Looking deeper at Temporal it does seem to provide a robust programming model to implement some of these patterns. It offers you the semantics to build with out of box and implements the orchestration for you, so you can do "inversion of control". It can be used as-a-service, which can save a lot of effort. I want to learn more about and play with it.

🚀 3
apbleonard20:06:21

The above discussion refers to many concepts, but I saw no references to https://blog.acolyer.org/2019/03/06/keeping-calm-when-distributed-consistency-is-easy/ - the rather lovely theoretical underpinning to things like https://lostechies.com/jimmybogard/2013/06/06/acid-2-0-in-action/ (a briefly popular term) and https://crdt.tech/ - so I thought I'd add it here for those who like a bit of theory. To be fair, while this thread is about making multiple things either happen or not happen together, CALM tackles the similar consistency problem of handling events received out of order by setting out precisely when order doesn't matter. As rabbit holes go it's a fascinating one 🙂

sonnyto19:07:10

Use a log as your source of truth instead of a db. Every db has a log but it's not exposed to the programmer . As mentioned by someone else, Kafka can be used for this purpose . Here's a good book about the humble log ... So simple yet so powerful. A log should be at the core of your distributed system . Only need to sync with what's in the log. What's in the log can be transformed into different forms in different places. If it's not in the log, then it doesn't exist. Only need to sync with what's in the log https://www.oreilly.com/library/view/i-heart-logs/9781491909379/