Fork me on GitHub
#xtdb
<
2023-12-20
>
emil0r06:12:37

Has anyone found a good solution on referential integrity in XTDB v1 or v2?

1
jarohen10:12:47

In XT1, I'd be looking at transaction functions, in the cases where you really need to enforce it; in XT2, you also have assert-exists and assert-not-exists DML operations which can check for the presence/absence of certain documents

refset22:12:48

Hey @U0545PBND please could describe the scenarios you're most concerned about enforcing? Are you needing referential integrity across time?

emil0r14:12:57

No, I just need it at point in time. What Datomic has would suffice. Basically I want the database tell me when I have messed up on references (ie, deleting something that is still referenced, or adding a reference that doesn’t exist at this point in time). I’m interested in replacing SQL with XTDB or Datomic for a new project I’m working on, and I really, really want to have referential integrity similar to Foreign Keys from SQL. Either enforced by the db itself, or if someone had a really nice library that did it. I wrote a small something using xtdb v1 transactor functions that works (only got put at the moment). But it’s a bit clunky and was wondering if someone knew of a good solution already

refset17:12:38

Got it - very helpful - thank you for the detail 🙂

refset17:12:27

if you want to share your small transaction function example I'm sure others might appreciate seeing it and it might spur others to share also

emil0r05:12:23

The transactor function:

;; :with-schema/put
   {:xt/id :with-schema/put
    :xt/fn '(fn [node schema-name e timestamp]
              (try
                (let [db (xtdb.api/db node)
                      {malli-schema :malli/schema
                       db-schema :db/schema :as schema}
                      (xtdb.api/entity db schema-name)]
                  (when (nil? schema)
                    (swap! novi.xt/tx-log
                           update timestamp conj
                           {:db.error/type :missing-schema
                            :missing-schema schema-name
                            :message (format "Schema %s does not exist in the datatabse" (str schema-name))}))
                  (when malli-schema
                    (when-not (malli.core/validate malli-schema e)
                      (swap! novi.xt/tx-log update timestamp conj
                             (assoc (malli.core/explain malli-schema e)
                                    :db.error/type :malli))
                      (throw (ex-info "Invalid malli schema" {:schema/name schema-name
                                                              :entity e}))))
                  (when db-schema
                    (doseq [{:keys [:db/ident :db/unique :db.ref/path] :as current-schema} db-schema]
                      (when unique
                        (when-let [eid-uniques
                                   (->> (xtdb.api/q db {:find ['?e]
                                                        :where [['?e ident (get e ident)]]})
                                        (map first)
                                        (into #{}))]
                          (when (and (> (count eid-uniques) 0)
                                     (not (eid-uniques (:xt/id e))))
                            (swap! novi.xt/tx-log update-in
                                   [timestamp :db/unique] conj
                                   {:xt/id (:xt/id e)
                                    :db.error/type :db/unique
                                    :conflicting/xtids eid-uniques
                                    ident (get e ident)
                                    :schema current-schema})
                            (throw (ex-info "Invalid db schema" {})))))
                      (when path
                        (let [query '{:find [?target-id]
                                      :in [[?target-id ...]]
                                      :where [[?target-id :xt/id]]}
                              target-ids (reduce (fn [entity lens]
                                                   (let [v (get entity lens)]
                                                     (cond (sequential? entity)
                                                           (reduced (map lens entity))

                                                           (sequential? v)
                                                           v

                                                           (map? v)
                                                           v

                                                           :else
                                                           (reduced v)
                                                           )))
                                                 e path)]
                          (when (and (seq target-ids)
                                     (empty? (xtdb.api/q db query target-ids)))
                            (swap! novi.xt/tx-log update timestamp
                                   conj {:db.error/type :db.ref/path
                                         :msg "Unable to find the reference(s)"
                                         :reference-ids target-ids
                                         :path path})))))))
                (catch Exception e
                  (taoensso.timbre/error "Unable to run :with-schema/put" {:exception e})
                  (throw (ex-info "Unable to execute :with-schema/put" {}))))
              [[:xtdb.api/put e]])}
The put function:
(defonce tx-log (atom {}))

(defn put [node schema-name e]
  (let [timestamp (java.lang.System/nanoTime)
        tx (xt/submit-tx node [[::xt/fn :with-schema/put schema-name e timestamp]])]
    (xt/await-tx node tx)
    (let [report (get @tx-log timestamp)]
      (swap! tx-log dissoc timestamp)
      (when report
        (throw (ex-info "XT Transaction errors" report)))
      tx)))
Example of a schema migration
;; :novi/user
   {:xt/id :novi/user
    :malli/schema novi.specs.user/User
    :db/schema [{:db/ident :user/email
                 :db/unique true}
                {:db/ident :team/members
                 :db.ref/path [:team/members :team.member/id]}]}
It’s setup this way since there was no way to get information out of a transactor function via throwing exceptions. I also built in malli checks, which you could remove or choose to keep. Haven’t really done any performance checks on how much slower it is, but the little I have tried with it suggests it’ll be more than fine for smaller applications.

🙏 1
refset12:12:58

amazing, thank you for sharing that! it's really helpful for us to reflect on 🙂

1
Marius08:12:57

Good morning! I have some more questions regarindg XTQL, I hope you don’t mind…. First question: Is it possible to dig into maps with a query? E.g. like get-in .

FiVo08:12:08

Hey, there is currently nothing in the standard library to go down multiple levels, but that could be added down the road.

FiVo08:12:37

The following shows how to unnest something one leve deep.

(xt/q node '(-> (rel [{:x {:y 1}}] [x])
                  (with {:z (. x y)})))
  ;; => [{:x {:y 1}, :z 1}]

Marius08:12:19

Okay, it seems the recommended approach would be to keep documents rather flat or unravel them into multiple „tables“, right?

FiVo08:12:14

Yes, important information should sit toplevel because this is also what we currently have metadata about and hence are likely to produce more permanent queries.

jarohen10:12:48

I'd make the decision about what to separate out based on update granularity, tbh - if you frequently find yourself wanting to update the nested structures, and it has a natural entity primary key, that's the point at which I'd be considering a separate table.

Marius10:12:18

Okay, just like good old normalisation 🙂

jarohen10:12:03

absolutely, yep 🙂

Marius08:12:08

Second question: I was planning on querying all timestamps (valid-time) on which a certain field on a document has changed. How would such a query look like in XTQL?

jarohen09:12:53

best bet for this will be 'window functions' (which we don't have yet, but very much on the roadmap) - in SQL, you can achieve something like this with:

SELECT field_value AS old_value,
       LEAD(field_value) OVER (PARTITION BY xt$id ORDER BY xt$valid_from) AS new_value,
       xt$valid_from,
       xt$valid_to
FROM your_table
WHERE field_value <> LEAD(field_value) OVER (PARTITION BY id ORDER BY valid_from)

jarohen09:12:01

(IIRC, at least - it's been a while since I've used them in anger!)

jarohen09:12:19

we'll include something similar in XTQL, too - exact syntax tbc

jarohen09:12:03

there're ways you can do this without window functions, using self joins, but I recall that always being relatively convoluted

Marius10:12:15

Okay, thank you! If it was just about finding changes of the whole document, would it be enough to query xt$valid_from? This would assume that documents without changes are not stored / redundant entries are removed, which is probably not the case, right? Every xt/put stores the document, even if it did not change (ignoring internal optimization, which is transparent to the user).

jarohen10:12:18

> Every xt/put stores the document, even if it did not change it does, yes - we store the fact that you've re-asserted it. this is partly because, with bitemporality, it's not always the case that you can coalesce together two updates in this way, particularly if you have back-/forward-in-time updates

Marius10:12:54

Ah right, that 2nd timeline again…. yeah, very tricky…

mesota13:12:03

This is kind of a generic/high level question. I just learned that there has been some significant work in enabling bitemporality in PostgreSQL. Case in point is this project among others: https://pgxn.org/dist/temporal_tables/ How does that compare against the XTDB’s bitemporality? Of course, there’s the obvious fact that bitemporality in XTDB is native while that of postgresql is kind of an afterthought. But what does that mean in practical terms? For example, what capabilities/advantages does a person using temporal extension to postgres ‘miss out’ by not using XTDB instead?

refset22:12:52

Hey @U9FEN7GF6 the only Postgres support for bitemporal tables I've seen is with https://github.com/hettie-d/pg_bitemporal (the extension you linked is system-time only) The biggest complexities with those approaches is that the queries become very complicated to reason about when you have joins across many bitemporal tables, as do the indexes. Schema migration is also very complex. XTDB aims to avoid/simplify all that with good defaults

👍 1
refset22:12:10

I know the author of this blog post is working on a patch to bring valid-time support into Postgres https://illuminatedcomputing.com/posts/2019/08/sql2011-survey/ - but really I'm not sure how successfully Postgres can be retrofitted to make using this stuff either easy or desirable

👍 1
Stef Coetzee13:12:05

Thanks for the presentation, gents! Had to drop off early, but had some questions: https://discuss.xtdb.com/t/ann-upcoming-a-first-look-at-xtdb-v2-live-session/310/4

🙏 1
markaddleman17:12:55

In XTDB v2, I’m having trouble getting a query plan.

markaddleman17:12:26

This returns the query result. I think it should return a query plan.

1
markaddleman17:12:09

In case it matters, I’m running with SNAPSHOT versions and the latest docker image

jarohen17:12:13

yep, bug on our part - flag not being forwarded via the HTTP server

markaddleman17:12:36

so, it should work for an in-process node?

👍 1
markaddleman17:12:05

cool. I’ll switch that until the bug is fixed. Thanks!

jarohen17:12:47

fixed and pushed a docker image - docker pull to update. thanks for the report 🙂

🔥 2
markaddleman17:12:00

Thanks for the quick turnaround!

🙏 1
markaddleman19:12:59

I’m trying to simulate an event log in XTDB. Each document includes an entity id, a timestamp and an arbitrary set of key value pairs. At query time, I want to join the most recent value of some key value pairs on the eid.

markaddleman19:12:38

I expect the result to be

[{:eid "abc", :first "mark", :city "Houston"} {:eid "xyz", :first "Andrea", :city "Houston"}]

markaddleman19:12:54

But, instead, I get

[{:eid "abc", :first "mark", :city "Houston"} {:eid "xyz", :first nil, :city nil}]

markaddleman19:12:31

I’m sure I’m doing something silly but I don’t see my mistake

jarohen22:12:38

@U2845S9KL if you run the inner queries of each of the left joins individually, do you get the results you'd expect?

jarohen22:12:38

I suspect you could probably make it work this way, but I also might be tempted to get the tables and bitemporality working in your favour. could you, for example, have a cities table, use the eid as its xt/id, use the event time as its valid from, and then it's 'just' a current time query on each table?

jarohen22:12:38

alternatively, keep it all in one table, perform updates on it for the changed fields, then it should also be a current time query? then query the table for all time if you need the events back?

markaddleman22:12:51

I considered that but there’s a problem: These documents are coming from remote agents that have no knowledge of the data schema. The schema is arbitrary and imposed at query time. The xt/id is simply a unique identifier of a statement of fact that the agent noticed and the eid is coming from the domain and only known as a primary key at query time.

markaddleman22:12:17

I’m overstating my requirements just a bit: The schema is known before query time but unknown at the time the data is ingested and, more importantly, the schema changes over time. One possibility is that the data is ingested as the agents report it and then rewritten to a new table when the schema is defined but that approach seems pretty cumbersome.,

jarohen22:12:49

so when you get the first update in, is this the agent saying "the doc keyed by abc now has first:mark"?

jarohen22:12:27

as in, does it know that the eid attribute is the identity for the entity?

markaddleman22:12:26

No, it does not. The agents are monitoring other applications with their own information schemas.

markaddleman22:12:06

Th only thing the agent knows is how to obtain key value pairs from those applications and the time the data is collected

jarohen22:12:09

ah, ok, and so the only time you know how to like these statements together into a single real world entity is at query time?

jarohen22:12:47

if so, what happens if the agent doesn't provide enough to get its fact linked in correctly? or am I missing something?

markaddleman22:12:36

You are not missing anything. That situation is possible. In that case, we either treat the missing data as nil or we cannot link the data

👍 1
markaddleman22:12:42

Currently, I use traditional relational database for this work and my queries are filled with left joins for exactly that reason

jarohen22:12:47

mm, suspect your queries here will be the same 🙂

🙂 1
markaddleman22:12:39

Yes. I don't mind that but how to get the most recent value for a given key attribute and value attribute?

jarohen22:12:07

my first thought would be window functions (although we don't have them yet) - first_value(city) filter (where city is not null) over (partition by eid order by t desc)

jarohen22:12:15

so, if that works, question would be (until we do have them) how to best emulate it

markaddleman22:12:17

Ha. Yes, in addition to left joins, my current queries are filled with last-value window functions

😄 1
markaddleman22:12:00

To make those efficient, I think compound a compound index is necessary: the partition key plus the timestamp.

jarohen22:12:35

yeah - unfortunately our current indices aren't working in your favour here...

markaddleman22:12:11

Yeah, that was going to be my next question: how to emulate a compound index

markaddleman22:12:58

I thought there was a way to do that in v1. Something like a vector of values but maybe I'm wrong about that

jarohen22:12:17

a map of values would get you close, but that's an identity index not a range index (as you'd want for your timestamp ordering)

jarohen22:12:01

we can't use the XT bitemp index (which is sorted by timestamp) because you don't have the PK

markaddleman22:12:16

Yes, what I really want is an identity+range index but a combine b tree is probably plenty good enough

jarohen22:12:02

the bitemp index is the only sorted index atm - the content indices are all hash indices for the time being

jarohen22:12:51

could you use valid-from for your timestamp values?

markaddleman22:12:00

My not preferred approach would be to rewrite the agent docs in xtbd's preferred format but that would effectively double the ingestion load. My agents send about 1000 docs per second.

jarohen22:12:27

right, yeah

markaddleman22:12:41

I did not see your performance slide at yesterday's presentation (wink, wink) but I feel like I'd be pushing limits here

jarohen22:12:37

you'd certainly be the highest volume real XT 2 user right now 😄

🙂 1
markaddleman22:12:49

One necessary ingredient to make this approach work would be some kind of listener interface to know when to rewrite new docs.

markaddleman22:12:55

Any plans for something like that?

jarohen22:12:32

yep, that's on our feature list 👍

jarohen22:12:56

up there with window functions 🙂

👍 1
jarohen22:12:18

(we have a busy 2024 in store too)

👍 1
markaddleman22:12:30

Ok. All this is experimental right now for me. I'm unhappy with my current database and looking for options but I am in no rush to make a change

markaddleman22:12:11

Thanks for all your input. I'd like xtdb to be a viable solution for my use case. I'll keep an eye on your progress 🙂

🙏 1
jarohen22:12:47

sounds good 👌 if you don't mind being on our early adopters list (cc @U899JBRPF) it'd be great to work this through with you as we go?

ack 1
markaddleman22:12:22

I'd love to

👍 1
🙏 1