Fork me on GitHub

Fair warning: this is an event sourcing question, not inherently related to Jackdaw. I'm asking here because this channel probably has the best expertise, but will redirect chatter elsewhere on request. With that said... I'm building up a CQRS+ES system and considering the datastore for commands & events as they flow through Kafka. Traditional advice seems to be (A) "write to an external datastore", for instance what does (postgres). The reasoning is that you need a way to execute a "save this event only if the version of the entity is x." However the newer release of "exactly once delivery semantics[1]" & "stateful stream processors[2]" creates the possibility of containing your command+event logic entirely within Kafka Streams (B). [1] [2]


Does anyone have thoughts on the viability of option B? Seems nice in theory -- your event datastore is simply Kafka. I question what unknown gotchas might be hiding out there though. Is there a sample repo out there that demonstrates the "pure Kafka" approach?


I went hunting for recent Confluent-produced resources on event sourcing and found this: If you haven’t already, be sure to check-out ksqldb (formerly ksql). I’ve been watching it, as a foundation for a CQRS+ES system:

👍 4

we do option B in some cases here (at fc where jackdaw is written) one gotcha is that a db or document store has prior art for common maintenance tasks and problems, and kafka as a store is not as mature or as simple. I don't believe there's any place where we trust kafka as the store without some backup, and backing up a proper db like postgres is turnkey, backing up kafka streams is not


Do you guys just use RocksDB for Kafka Streams state? How long do you keep things around in KTables etc?


we use RocksDB in at least one service, with indefinite storage


(caveat, It's been a while since I worked on that service and I'm not sure how much of that implementation has remained the same and I probably shouldn't get too specific :D)


@U051SS2EU Appreciate the info (and thanks to FC for Jackdaw -- excellent little library.) If you still remember... in the case where you have a "backup from Kafka", is this backup datasource actually queried by the app runtime or just written to? @UCHV4JZ7A That's an incredibly relevant resource, thank you!


Also intrigued if you ever, routinely or in an emergency, had to replay the whole log?


yes - emergency iirc


With my current company they had it all running on Kafka, but are about to use Mongo as datastore. The three most important reasons are that RocksDB being just a key value store was a bit limited, which meant intermediate topics and additional complexity. And several global tables taking several gigabyte of data on each instance + it takes about half a minute to get those from the broker. The last one is that the company want to move to a multi cloud environment, and with atlas it's already available from multiple clouds, while with Kafka some replication needs to be put in place.


In my hobby bank project I used PostgreSQL to store state, mainly to have easy idempotence.


a workaround we've implemented is putting documents in an external store, and flowing document ids through kafka (with the convention that the documents are versioned and immutable)

👍 4

That's sounds pretty much like crux, only in that case the data also goes through Kafka. Having the indexes apart from the application does solve some of the issues with using just Kafka.


Any gotchas implementing custom data stores? Is it a widely accepted architecture to have some tables that just grow monotonically forever?


Basically we do updates or retractions potentially months later and so far Kafka Streams seems completely magic compared to other platforms for keeping downstream stuff up to date, as long as we feed it storage and don’t mind new instances taking a while to drink the data in. But we’d perhaps feel more comfortable if it wasn’t just backed by RocksDB.


You could also use ksqlDB, that basically let you have the RockDB exist in another service, and has an api to get data from there. It does solve the restart cost, but you need another piece. And with the other piece you might use a more traditional database as well. You also need to take into account how you want to restore the data in case something happens. Relying just on replication in Kafka might be to risky.


also consider what happens if your data is subject to "right to be forgotten" type rules