Fork me on GitHub
#architecture
<
2020-04-22
>
adamfeldman18:04:32

This release from Flink is fascinating to me — in short, it enables more-easily building an event-driven system, where Flink manages the event stream and also manages invocations of external stateless services (FaaS or otherwise). https://flink.apache.org/news/2020/04/07/release-statefun-2.0.0.html. “Flink invokes the functions through a service endpoint via HTTP or gRPC based on incoming events, and supplies state access. The system makes sure that only one invocation per entity (`type`+`ID`) is ongoing at any point in time, thus guaranteeing consistency through isolation. By supplying state access as part of the function invocation, the functions themselves behave like stateless applications and can be managed with the same simplicity and benefits: rapid scalability, scale-to-zero, rolling/zero-downtime upgrades and so on.” It’s so exciting to me to see higher-level distributed systems primitives being born and maturing. I see Flink’s work as part of a pattern where lots of people across the distributed systems community are working on treating (micro|stateless) services/functions as virtual actors: • Virtual actors in https://dapr.io. • Durable functions on Azure (Microsoft Research invented virtual actors in https://dotnet.github.io/orleans) There are parallels here to what’s happening in the Kafka ecosystem as well, with Kafka Streams, ksqldb, etc. And with Apache Pulsar and its Functions… Where do AWS and GCP initiatives fit into all this?

richiardiandrea18:04:04

I was happy to to see this. Apache Pulsar is also super interesting cause it basically is able to store event on different storage (like S3). The only thing I would about all these JVM/Apache tools is the overhead in ops. It's a pity cause you really need to ponder before adopting them.

adamfeldman18:04:20

That’s also what I most appreciate about Pulsar — having Bookkeeper offload to blobstores. Kafka has recently made public moves to support that as well. I expect that eventually everything will have a Kubernetes operator — it doesn’t solve all the problems, but a lot of them

richiardiandrea19:04:32

Yeah but that's then another piece to hold knowledge about. I don't know, I feel we are getting farther and farther from simplicity here 😄

lukasz18:04:18

That's really interesting - so I can create a pipeline which consumes events from somewhere an on each event it can call an arbitrary HTTP endpoint to do something with the event record?

lukasz18:04:35

and push results somewhere else ?

lukasz18:04:49

GCP has Apache Airflow support I believe

lukasz18:04:56

which seems to be an equivalent

adamfeldman18:04:06

@lukaszkorecki I think that’s a good description. Flink StateFun will also ensure there is only 1 concurrent invocation for each unique event record (which I see as actor-like). GCP has Cloud Dataflow, which is based on the Apache Beam runtime model (for which Flink is a first-class runtime, along with the hosted Dataflow service)

lukasz18:04:59

Right - that's apache Beam. Too many competing Apache projects ;-)

4
adamfeldman18:04:11

(GCP has Cloud Composer, which is Airflow, which IIUC isn’t designed to handle a high volume of workflows — that’s why these workflow tools exist https://netflix.github.io/conductor/ and https://github.com/uber/cadence)

lukasz18:04:01

Need to think about this - I'm trying to figure out how we can create a reliable pipeline of workers to do API fetches, but with rate-limiting per tenant

adamfeldman18:04:02

I’d say AWS Step Functions is similar in spirit to conductor/cadence

adamfeldman18:04:22

I’m also interested in the outbound rate limiting problem

adamfeldman18:04:33

I don’t understand why there isn’t off-the-shelf tooling for that

lukasz18:04:41

at this point we have a janky rate limiter based on redis :-/

lukasz18:04:56

but still no queues per tenant (because we rely on RabbitMQ)

adamfeldman18:04:19

https://zeebe.io is an OSS-ish workflow tool (from Camunda who do BPM)

adamfeldman18:04:27

Maybe helpful for building that stateful pipeline

adamfeldman18:04:51

Perhaps you could limit calls per-tenant by sending all tenant calls through the same worker? Then you don’t need a distributed rate limiting tool, just a way to partition tenants across workers

adamfeldman18:04:40

Or, similarly, a logical queue per tenant

adamfeldman18:04:00

ah you mentioned RabbitMQ

lukasz18:04:24

Logical queues are ok, but only to a point - Rabbit cannot scale well with a large number of queues

lukasz18:04:56

Same worker could potentially do it, but then we tie our deployment to a number of tenants, which come and go

hiredman18:04:06

there is a great talk about rate limitting

lukasz18:04:41

btw, I don't enforce rate-limits - it's the 3rd party APIs that we need to call

lukasz18:04:59

and limits vary per tenant (enterprise accounts vs regular plans) etc etc

lukasz18:04:05

it's a multi-variable nightmare

adamfeldman18:04:15

Salesforce is often the culprit there

lukasz18:04:31

we solved that by pushing processing to Salesforce itself - they call us ;-)

❤️ 4
lukasz18:04:58

but it's terribly hard to debug if something fails so that's a big trade off

hiredman18:04:49

you may also want to look at aws step functions for inspiration