Design question:
In an app where some non-trivial "object|data structure" is being constructed using a combination of user input, external computations or generally using input signals produced by async actors, how do you guys coordinate this interplay in your apps?
The typical CRUD app or microservices approach is imo not sufficient:
• it is hard to enforce some kind per-object queue of signals over which is "reduced" to build the final object (data structure).
• it is hard to implement follow-up actions as reaction to certain signals
• it is hard to model the interaction and reactivity as a whole that survive node crashes etc.
• It is hard to model resilience concerns like idempotency, retries, etc...
The gist of the problem is the orchestration or coordination between the different actors in the system (users, async services, events like crashes, etc.)
The three solutions on my radar that explicitly tackle the problem of modeling a reactive durable workflow in code (without resorting to a plethora of point solutions) are:
• http://temporal.io - we have a CLJ client for that, courtesy of @ghaskins, see #temporal-sdk
• http://restate.dev - no CLJ client to my knowledge, but a Java SDK exists
• http://www.resonatehq.io/ - Dominik Tornow, the guy behind this is ex-temporalio employee
So my question for the architecture-community is: what are your approaches?
People implementing fintech products or applications that sit up-stream of datascience applications (the capturing of data through modeled processes) surely bump into this as well?
Ideas? Suggestions? Hot takes?
Thanks!
I've been on an extended Datomic kick so am biased by that, but can't help but feel that starting off with Datomic puts you in a good spot for most the problems you mention, with the caveat that its performance might be unsuitable in practice for some things.
This sounds like reactive pub/sub event driven architecture to me. In the small: I have async code on the boundaries of my code. And I fight to keep it out of the business logic. In the large: Nice infra. Hopefully managed by someone else.
is eventual consistency acceptable?
a sort of lightweight event sourcing might be a good fit. Chuck all events in tables, and derive the aggregate on the fly using sql views. If your logic is too complex for sql, deriving it using clj is also an option. This will work fine as long as you don't have huge volumes of events, but i don't know your use case.
Datomic might also be a good fit. If you to trigger actions based on changes to your objects: subscribing to changes in datomic is really simple compared to other databases.
Atomic architecture might also be good to look into https://www.youtube.com/watch?v=sf3QZ-HBNFk
Example a web app that contains some "process" (very contrived example):
• I am fiddling with some params
• I record my desire to use those params as input for some computation
• if the API i want to post to is up, post, else retry in a while (imagine there is no queue)
• The API comes back up, the params are posted to the API, I have the request ID to fetch the results
• The process waits a bit to query the API, not ready yet
• The process is killed and restarted (K8s doing its thing)
• The process resumes waiting to attempt querying the API again
• The poll succeeds, I obtain the computation result
• The process continues with follow-up side-effecting actions
• The process notifies several users it is ready to accept more user input, and continues when 2 inputs are provided
• I give my input
• Someone else gives input
• The process continues and does some more side-effecting operations until it reaches the next wait state
• The process runs for 7 days, still waiting for input
• The process is restarted because of an app update
• The process keeps waiting until hey presto, the final input arrives and the process finishes to completion
Many of these processes run concurrently, some of which might even signal another one.
Solving this involves:
• some state-machine / domain logic
• keeping state in a DB
• accepting inputs in a serialized way to avoid concurrent mutations
• retry logic when services are unavailable
• performing side-effects correctly (e.g. avoid double writes)
• wait states surviving system crashes
• "continuation" until the next wait state that can pick up work after restarting the system
So there's a lot of problems involved in a single use case that can be solved using a combo of Datomic, an event queue, a resilience library, etc. etc. but the dev experience becomes a big PITA. Basically the code I want to define the "logic" in looks like a reduce over a stream of data items, with all of the infra-related stuff pushed out of the way. That's what the aforementioned tooling offers, with the benefit that the nitty-gritties are abstracted away from the domain logic developer to much higher degree than trying to invent a home-grown combo of FSM, event sourcing, listening to a tx-log and queued events, writing different event handlers.
Anyway, there are many ways to solve this, but many of these are labour intensive and hard to code. Hence the question of how these things are tackled by other community members.
This explains it better https://youtu.be/nKio-9e0Cfg?si=g75gxSRMfV6X0Jlq
IMO a durable execution engine is the ideal companion for a DB like datomic or XTDB, supporting the "Atomic Core" of the app, where the cats (actors) are wrangled while useful conclusions are be transacted to the DB.
It's one of these things you cannot "unsee" anymore once you've seen it 😉
hmm ok, i'd go with a state machine + datomic in that case, but indeed that requires some work. I assume durable execution engines compile down to a state machine + db operations under the hood, but I don't have experience with these.
@brandon342 interesting, I assume it's proprietary? Any chance of sharing more details|code?
@chrisblom yeah the durability makes it a whole lot more tricky. If that wasn't a must-have, something like missionary could also have been an option
i've played around with Rama a bit, i don't think it is an ideal match, but I guess it could take the place of the durable queues + database. You would still need to build some kind of state machine logic on top of it. Also, Rama's dsl takes some getting used to.
@thomasmoerman yeah proprietary. the db row enforces uniqueness and run locking. work is sent via mq to bg handlers which use the dag edn to dispatch to fns. each step fn is meant to be retryable units of work. a bg scheduler handles retries and fairness. audit tables for history.
I used https://www.flowable.com/open-source/docs/bpmn/ch02-GettingStarted in the past (no reactivity). For me the main issue with Flowable is that it only models a "flat" process, and every complex process I had to model had a 1:n relationship in it (e.g., the main process needs to wait for "n" sub processes to finish), so you end up having quite a few things that are a bit hidden. The rest of our technical stack was using event driven architecture with kafka, and I don't thing Flowable provided that much, for our use case anyway. The main advantage was that it allowed me to create a dashboard quite easily.
The timely https://github.com/clojure/core.async/commit/03b97e0b3e0ec329629bcbf76106658dce4a5d61#diff-4df867a8cd7775659f653f7a1280ca7356172e7896d4bb8efeddd7109d0f4eb4R72 seems like it might be in the ballpark
yeah our project has a durable activity orchestrator. it describes activities as edn with fn bindings and has a context. when you kick off an event type with an entity id you get a single instance and every step is written to the db. retries are built in etc. first version only took a couple of days to write but we have been using it successfully for like 7+ yrs.
my first impression of https://github.com/clojure/core.async/commit/03b97e0b3e0ec329629bcbf76106658dce4a5d61#diff-4df867a8cd7775659f653f7a1280ca7356172e7896d4bb8efeddd7109d0f4eb4R72 is that is is a way to structure async message processing flows, but it doesn't handle durability
true
Would all your problem be solved by just using some durable queues? Say have SQS queues + Lambdas. Send request to an SQS queue, a Lambda picks it up, calls the API, if it fails/timesout puts the message back on the queue. If it succeeds, takes the resultID puts it on another SQS queue. Another Lambda picks it up and polls the API for a result, if no result puts it back on queue.
@didibus I can’t speak to the problems @thomasmoerman is solving, but I can say this: While there is some overlap between durable execution frameworks (like Temporal) and durable queues (like Kafka), there is a class of problems where durable queues alone are not enough, and thus you end up needing to build solutions on top.
I started my company as a clojure shop about 5 years ago, and we initially used a CQRS microservice pattern using Kafka. It worked, but it required a lot of fragile constructs built on top to accomplish goals like implementing SAGA state machines. The net result was that building the SAGA logic was hard, and thus, we were held back in implementing features. We swapped out Kafka for Temporal about 2 years ago and this really unlocked the architecture.
A typical SAGA looks like this (for us)
(defworkflow create-vault
[args]
(log/trace "create:" args)
(let [{{:keys [mrn] :as vault-info} :vault-info :as rspec} (input->generate-rspec args)]
(with-lock {:resource-id mrn}
(try+
@(-> (local-invoke vault-exists? rspec)
(p/then (fn [exists?]
(when exists?
(throw+ {:code 6 :msg "vault already exists" ::e/non-retriable? true}))))
(p/then (fn [_]
(-> (saga/execute
(reify saga/Driver
(prepare [_]
(prepare-create rspec))
(abort [_]
(undo-create rspec))
(commit [_]
(commit-create rspec))))
(p/then (fn [_]
[:ok vault-info]))))))))))And the durability framework takes care of all the checkpoint/restart across failures
Trying to do this across bespoke queues and handlers was a bit awkward, especially when you consider things like retries/fault-remediation/observability
I don't know Temporal, what does it do exactly? Is this referring to a saga as in handling multiple distributed commits with handling error in each by rolling them back in reverse order?
Temporal is in a class of software that offers a “durable execution” model.
Basically means you write more or less natural code (e.g. in clojure, go, whatever) and it provides mechanisms for ensuring that the code executes reliably…so there are aspects of durability, retries, replay, etc, built into the model
> Is this referring to a saga as in handling multiple distributed commits with handling error in each by rolling them back in reverse order? Yes, though note that temporal isnt specific to SAGAs…its just that SAGAs are easy to implement using such a framework.
if you were curious to learn more, I did a Boston Clojure meetup presentation on the subject a few years ago: https://youtu.be/gztsbSP5I3s
I've only skimmed the video, and got that it's a bit of a workflow engine, but the workflows are defined dynamically in normal-ish code?
yeah…i actually dont like the term “workflow” for these frameworks….to me, that conjures some BPMN like thing….as you stated, these are more “normalish code” that are durably executed
You can certainly use them to manage things that a traditional workflow engine like Camunda might manage…but you can also do other things that are less about “workflow” and more about reliably running a program in standard languages
(such as the SAGAs that we use at my company)
There probably isnt a meaningul distinction, but to me “workflow” sounds like something the business team came up with 🙂
Seems very similar to AWS SWF, but maybe with a bit more conveniences into the SDK
Which stands for Simple Workflow Engine, and is one of the first "workflow engine" I used. So I associate workflow engine with that 😛
In SWF, you expose Decider APIs and Activity APIs. The backend calls your Decider API to retrieve the next step. And each Activity maps to a step. And it tracks full history, and your Decider gets access to that history.
I’m not familiar with SWF but sounds about right
That's neat. To be honest, I always liked SWF, but it was rough around the edges, and AWS abandoned it a bit. Then they pivoted to Step Functions which feels more competing with AirFlow. Having an improved and open source version is pretty sweet.
I see the big differentiator is that it automatically replays state into your function, and that's also how it lets you define workflows in code in your own programming language. In SWF you'd have to implement all that replaying functionality yourself, and dealing with capturing non-deterministic state and all as well. I also like the continue_to_new and then you can use it for infinite almost Erlang actor-like
Exactly, and that is also the case with functions+durable queues. You can do it, but you are on your own and it gets messy quick
I guess it's worth giving a shout-out to https://redplanetlabs.com/ even though I have not used Rama, I think it is in the category to solve these problems as well.