Fork me on GitHub
#architecture
<
2022-02-02
>
Marius13:02:57

Hi all! I’d appreciate to get your feedback and discuss my solution for the following problem: I’m recently started developing an application in which a fairly large number of details of employees (e.g. address, monthly income, marital status) is stored. For each of those fields I need to keep some kind of “valid from” information, so that I can in create a timeline of all changes, e.g. track all dates on which an employee had an salary increase or change of address. This information is then used for payroll calculations. I’ve been pondering about the perfect solution since quite a while, e.g. looking into specific https://en.wikipedia.org/wiki/Temporal_database or manual (self-written) database constructs, but after reading about Data-Oriented Development, I am convinced that I don’t want to handle the problem in the database (and have to define a schema there), but instead do it in the application using maps. My currently preferred approach is to use PostgreSQL and store my data in text blobs. I imagine that the frontend submits the whole employee object and I use the great Diff’ing library https://github.com/juji-io/editscript by @huahaiy to conveniently compare it against the current value and then put only the diff, together with the valid_from, in the database (with pr-str). (Note: I don’t store each edit, only each change per date on which the change is happening, e.g. new salary valid from May 15th 2022). If the employee data is requested by the frontend for a given day, I will retrieve all diffs from the database (with read-string) and apply/patch them one after another to eventually construct the model as it is valid for that given day. If I need to know all dates on which e.g. the salary changed, I would select only those diffs which contain the specific key, e.g. :monthly-salary. The only problems with this approach is that I would prefer JSON blobs, but that is not easily compatible with Editscript diffs, which are Clojure vectors containing keywords. And the querying for specific changes seems a bit awkward. If there should arise a problem with too many diffs/changes per employee, I would store intermediate snapshots of employees, so that the number of diffs is reduced. I’m planning to use Postgres because the application will have the need to also store a large number of numerical data, for which common relational tables are perfect. What do you think, is this approach feasible? What would you do differently? Thank you!!

lukasz14:02:07

You could change your approach slightly: store only the current version of the record in PG, and offload diff storage to S3 - you can even dump EDN files and not worry about JSON.

lukasz14:02:36

This way you can inspect history of changes to an object on demand, without worrying about dumping all of this into PG - you could, but you still need editscript to do its thing

Marius15:02:50

Yes that’s an alternative. But since object size isn’t an issue (many fields but not big data), I guess a text blob on PG would suffice.

ghadi16:02:33

reinventing Datomic

Marius16:02:02

I don’t think so (but please correct me if I’m wrong), but Datomic is keeping a journal of changes with timestamps (transaction time), it’s not intended to change this journal as it would be common in my case (a journal is not necessary).

Huahai17:02:19

At Juji, we do something similar: we store the current JSONB doc in Postgres, but that’s only for persistence in the end. In other places, we work with a EDN doc, with Editscript diffs sending from client to the server. Yes, there’s indeed a need for conversion between JSON and EDN, but if you institute some rules as to what kind of EDN are allowed, the conversion can be lossless.

Huahai17:02:43

Our rules are the following: 1 map keys must be either integer or keyword; 2. map values must be either integer, boolean or strings, not nil. 3. there can only be maps and vectors, no set nor list.

Huahai17:02:03

Obviously, our plan in the future is to do away with Postgres entirely, and store everything in Datalevin, then we would not have any of these problems.

Huahai17:02:21

Basically, my goal of developing Datalevin is to replace both Datomic and Postgres in our code base. We are gradually doing the switch.

seancorfield18:02:45

At World Singles Networks, we mostly store the current version for general access and record each of the changes (timestamp, field, old/new values) but we're contemplating a possible migration to XTDB at some point where we could rely on the database to just track all changes over time and also track valid time/transaction time. I am not sure, from your question @U02G3DBJ2SY, whether the important thing for you is to be able to ask what sequence of changes happened between time A and time B or just what did the employee look like at time A (and at time B etc)?

seancorfield18:02:53

XTDB has the advantage (for us) that it is document-based and schema-less (whereas Datomic requires you to define schema fragments upfront).

seancorfield18:02:09

Since you can get a history of transactions for any given entity (document), I think XTDB would also satisfy the "sequence of changes" requirement?

Huahai18:02:11

For most operational databases, my opinion is that transaction history is not necessary and the overhead is too great, not just in data stores, but also in developer mental models. Most developers treat databases as external states. It is harder to think about if it’s both states and history. YMMV.

Huahai18:02:25

Our experience of using Datomic taught me that most developer cannot think straight about a database that’s both state and history. Mistakes are made and it just doesn’t worth it.

Marius18:02:11

@U04V70XH6 I did look at XTDB, but the sequence of events is only a nice-to-have (for audit trail purposes, “who gave this employee a raise?!“), more important is to be able to model a sequence of changes (“when will this employee get a raise”), whereas this sequence can be modified by the user. This is something that Datomic and XTDB apparently not support.

Huahai18:02:14

I agree that “keep things separated” is the motto of Clojure. Unfortunately, Datomic violated that. If one needs history, an analytics store is perfectly fine for that. We have lots of great options nowadays. Keep it simple. State is just state.

seancorfield18:02:27

@U02G3DBJ2SY So you would be modeling a speculative future history? XTDB supports that to some degree by providing "valid time" (as well as tx time) but being document-based, I guess you would then need to deal with conflicts as other attributes "change" over time. An interesting puzzle to solve.

Marius18:02:34

@huahaiy yes I am thinking about either storing the Editscript diff as EDN in a Postgres text blob, or implementing a custom mapping between Editscript diff and JSON, so that I can store them in a JSONB blob and query them with SQL (useful to select only those diffs affecting a certain field)

Marius18:02:47

@huahaiy I have to admit I was not aware of Datalevin, I’ll check it out

Huahai18:02:15

Guess I need to do more publicity 🙂 Anyway, my plan is to do more of that at version 1.0.0. At that point, the query performance should be competitive to relational databases, then it truly can be a replacement. Right now, all the Datalog stores in the Clojure world are too slow to replace relational stores.

Huahai18:02:44

Datalevin maybe the fastest option, but it is still too slow for large scale data and complex queries.

richiardiandrea16:02:06

At Cohesic we use xtdb for part of this use case, using valid time to correlate multiple records(entities) between each other. Unfortunately XTDB does not allow "patching" an entity, only transact it as a whole, but with transaction functions you have pretty good control of what goes in and then you have got history for all the records. Correlating and augmenting histories is a bit of a challenge at the moment but the XTDB team is working on it 🙂 What we do is to pull all the record histories we need and merge them in memory

Marius07:02:52

That’s very useful, thank you! I had a look at XTDB before, but somewhat missed the interesting part about time travel (https://docs.xtdb.com/language-reference/datalog-queries/#valid-time-travel). It seems to do what I need, with only exception: Querying all timestamps on which only a certain field in the map has changed seems not possible yet. I guess I could track all fields individually, but that would be too much effort.

richiardiandrea16:02:51

@U02G3DBJ2SY yeah there are some limitation around that use case, you would probably still end up diffing previous and current state, but with the history at your disposal that's an easy one.

jmayaalv13:02:44

any specific reason why do you want to use blobs instead of jsonb? json support in postgres is fantastic.

Marius14:02:02

I agree. The problem I encountered is that serializing Clojure objects to JSON is not lossless. In particular the diffs from Editscript are vectors containing keywords, and those get lost after a roundtrip (e.g. :foo becomes "foo"). That is why I thought using text fields and pr-str / read-string is much easier. But that’s probably something to reconsider.

Huahai18:02:33

Obviously, if you have a Clojure code base, JSON is not as convenient as EDN everything. That’s one of the reasons for me to develop Datalevin: to replace Postgres and Datomic. We are not using the time travel feature of Datomic anyway.