xtdb

2025-08-18T10:48:11.513759Z

finally giving xtdb a go to deal with UK gov't open data updates and as of analytical queries

2025-08-18T10:49:27.099739Z

it has felt the right shape for a while, but my annoyance with the input data has only recently gotten big enough to go for it.

2025-08-18T10:50:06.380339Z

I can see some issues with the valid from dates as sometimes the updated open data can be worse than the previous stuff, but I'll figure out those issues as I come to them

2025-08-18T10:50:39.407409Z

I'll be starting with open data and some open source work so I should be doing everything in public which will hopefully allow people to mock me and learn from my mistakes

jarohen 2025-08-18T10:57:04.355469Z

awesome ☺️ yeah, let us know how it goes!

2025-08-18T11:02:45.129489Z

will do. I think my first question is "what should go in a good :xt/id? I've got a bunch of csv files where the unique key would be something like calendar year and Local Authority ID, but I'm not quite sure what the constraints are on it and I'm not finding it quickly in the docs

jarohen 2025-08-18T11:03:14.291549Z

generally recommended to go with UUID

jarohen 2025-08-18T11:03:34.491849Z

in this case, I'd look at one of the hashing UUIDs, combine the natural key together into a hash

2025-08-18T11:08:33.792549Z

I wonder then if I should start with a problem statement. Various gov't bodies release new datasets on yearly, or other slow schedules. Later releases will often revise earlier ones. I'd like to update the earlier records using any later records. None of the files have a unique ID as such, though there is usually some combination of fields that should satisfy a uniqueness constraint (most often some kind of date or an event, like a census point, plus the ID of the organisation like the LA). I might have to normalise the incoming data, but this holds for most of the input data I'm looking at. Occasionally the event date needs to be derived from the file name or some other out of band knowledge, but it is easy enough to get this into a dataset before input.

2025-08-18T11:09:19.204049Z

so, IIUC, using a UUID won't give me a way of saying "use the event here that is valid from this date" as there won't be a way to match up the updated event records?

2025-08-18T11:10:16.004649Z

(The SEN2 return is a good example of this. There is a return every year where the start date of the new file overlaps the end date of the previous year's file and I want to update the records at the end of the previous year's file with the ones from the new year's file)

jarohen 2025-08-18T11:10:17.270469Z

ah, ok - in that case, I'd use just the local authority ID, and have valid-time be the period of validity for the metric

jarohen 2025-08-18T11:10:35.785549Z

do they need to overlap in time at all?

jarohen 2025-08-18T11:12:01.489309Z

There is a return every year where the start date of the new file overlaps the end date of the previous year's fileso let's say you've got 2024's file and 2025's, what time range do those files apply to, what updates do you make and when?

2025-08-18T11:13:34.646229Z

2023's file applies to 2011-2023, 2024s file applies to 2022-2024, 2025's file applies to 2022-2025

2025-08-18T11:13:45.637039Z

(in this concrete example)

jarohen 2025-08-18T11:14:33.711279Z

aha, interesting - so does 2024's file supersede the 2022-2023 period from the 2023 file, or do you need to have both accessible?

2025-08-18T11:14:39.565269Z

yes

πŸ‘Œ 1
2025-08-18T11:14:56.053289Z

(well sometimes, but at least it is that's years best guess)

jarohen 2025-08-18T11:15:04.251289Z

right, that makes a lot of sense πŸ™‚

2025-08-18T11:15:43.136599Z

this is why I was thinking the as of properties or xtdb would be useful for me

πŸ’― 1
jarohen 2025-08-18T11:15:58.467659Z

ok, so I'd probably model this as one entity for each local authority (assuming that's the granularity of data that's coming in) - XT then gives you a timeline for each entity

2025-08-18T11:16:00.736019Z

I was about to do a Greenspun's 10th law on it

jarohen 2025-08-18T11:16:06.858209Z

haha πŸ˜„

jarohen 2025-08-18T11:16:54.231639Z

so at the start you'd have 2023's file being asserted (in the same sense as Datomic assertions) from valid-time 2011-2023 (2021?)

jarohen 2025-08-18T11:17:07.452239Z

then, when you get 2024 in, you assert that from valid-time 2022-2024 - etc etc

2025-08-18T11:17:36.591429Z

the :xt/id is the essential bit for the bitemporality tho (along with a valid-time)?

2025-08-18T11:18:45.745979Z

bbiab πŸ₯ͺ

2025-08-18T11:18:47.173639Z

and thx

πŸ™ 1
jarohen 2025-08-18T11:19:01.552819Z

two questions you can then ask: β€’ what's our most recent understanding of the data for 2023? (`FOR VALID_TIME AS OF 2023`, for system-time as of now is implicit) β€’ what did the 2023 survey say about the data for 2023? (`FOR VALID_TIME AS OF 2023 FOR SYSTEM_TIME AS OF 2023-12-31` , or whatever the dates are - you could even do FOR VALID_TIME AS OF 2023 FOR ALL SYSTEM_TIME to get the history of what you knew about 2023)

jarohen 2025-08-18T11:20:40.060639Z

yep - xt/id is how you determine what updates form a single entity timeline, what facts you're updating when you insert a new document - so the London results would have their own valid/system-time timeline, East Anglia theirs

2025-08-18T12:53:12.199039Z

I think I'll try to get something together quick but wrong for today and then see how I get on from there.

2025-08-18T12:53:56.929709Z

being able to do everything in process is really handy (I'm going to be doing stuff with clerk, tech.ml.dataset, and xtdb all together). I'll share what I have here

2025-08-18T14:05:02.113349Z

I seem to be holding it wrong. I'd expect this to insert 2 documents?

(ns benchmarking.regional-template
  #:nextjournal.clerk{:page-size            nil
                      :auto-expand-results? true
                      :budget               nil}
  (:require
   [xtdb.node :as xtn]
   [xtdb.api :as xt]))

(defonce node (xtn/start-node))
(xt/status node)

(def data 
  [:put-docs {:into :users :valid-from #inst "2025-01-01"}
   {:xt/id "jms" :first_name "James"}
   {:xt/id "bld" :first_name "Bruce"}])

(comment
  
  (xt/submit-tx node data)

  )

2025-08-18T14:05:25.282689Z

once I evaluate (xt/submit-tx node data)

2025-08-18T14:05:57.004689Z

but I get the exception

1. Unhandled xtdb.error.Incorrect
   expected SQL string/vector (rest of trace elided)

jarohen 2025-08-18T14:06:33.149009Z

try [[:put-docs ...]]?

2025-08-18T14:07:02.421659Z

bingo

2025-08-18T14:07:13.009319Z

of course it is another layer of [] πŸ˜‰

2025-08-18T14:13:28.268189Z

great success!

(ns benchmarking.regional-template
  #:nextjournal.clerk{:page-size            nil
                      :auto-expand-results? true
                      :budget               nil}
  (:require
   [xtdb.node :as xtn]
   [xtdb.api :as xt]))

(defonce node (xtn/start-node))
(xt/status node)

(def data 
  [:put-docs {:into :users :valid-from #inst "2025-01-01"}
   {:xt/id "jms" :first_name "James"}
   {:xt/id "bld" :first_name "Bruce"}])

(comment
  
  (xt/submit-tx node [data])

  )

(-> (xt/q node
          '(from :users [_id first_name])))
;; => [{:xt/id "bld", :first-name "Bruce"}
;;     {:xt/id "jms", :first-name "James"}]

2025-08-18T14:15:58.473589Z

This seems to be the tutorial I'm after? https://docs.xtdb.com/static/learn-xtql-today-with-clojure.html

2025-08-18T14:31:56.068169Z

is there a way to pull out the valid from and to times?

jarohen 2025-08-18T14:33:09.443379Z

yep - '(from :users [xt/id xt/valid-from xt/valid-to first-name])

2025-08-18T14:35:09.629179Z

cool, and putting a new doc updates the valid-to time. That's what I was hoping for

πŸ‘Œ 1
2025-08-18T14:36:50.224869Z

I'm glad that my annoyance driven development has pushed me towards finally using this sweet system you've built

2025-08-18T14:39:08.339949Z

please give everyone involved a hug from me if they'd like one. If not, then just a "nice one, cool hack"

❀️ 1
2025-08-18T14:39:48.295659Z

I'm liking how this solves some of my fundamental problems with golden source systems (there are still the DDD ones, but there are probably other solutions for that)

2025-08-18T14:40:18.544229Z

now to see if I can actually get the things I need done by Friday

2025-08-18T15:26:43.933829Z

just being able to avoid column names that have the years encoded into them is going to be great (see https://geoportal.statistics.gov.uk/datasets/3959874c514b470e9dd160acdc00c97a_0/explore )

2025-08-18T16:05:36.072769Z

After being asked by @seancorfield is there anything that describes the tradeoffs between using sql vs xtql?

refset 2025-08-19T07:01:01.185559Z

Hey @otfrom nothing I can point you to, but there are a few dimensions to consider: β€’ Support - right now we have limited capacity to focus on providing a first class, bug-free experience across both, and SQL is taking the priority by far (this may change however) β€’ Breaking changes - XTQL is not necessarily a finished design and may require changes to support richer functionality in the future, which could impact upgrade paths β€’ Technical capability - SQL has a bigger surface area and therefore there will be a few things you can do in SQL already that you may struggle to achieve in XTQL, e.g. set operations β€’ Culture - SQL will raise fewer eyebrows in most organizations and probably give team members who are less invested some warmer feelings about using XT in general β€’ Etc.

2025-08-19T07:05:17.278189Z

Thx. That is really helpful. The breaking changes bit is probably the one that affects me the most