finally giving xtdb a go to deal with UK gov't open data updates and as of analytical queries
it has felt the right shape for a while, but my annoyance with the input data has only recently gotten big enough to go for it.
I can see some issues with the valid from dates as sometimes the updated open data can be worse than the previous stuff, but I'll figure out those issues as I come to them
I'll be starting with open data and some open source work so I should be doing everything in public which will hopefully allow people to mock me and learn from my mistakes
awesome βΊοΈ yeah, let us know how it goes!
will do. I think my first question is "what should go in a good :xt/id? I've got a bunch of csv files where the unique key would be something like calendar year and Local Authority ID, but I'm not quite sure what the constraints are on it and I'm not finding it quickly in the docs
generally recommended to go with UUID
in this case, I'd look at one of the hashing UUIDs, combine the natural key together into a hash
I wonder then if I should start with a problem statement. Various gov't bodies release new datasets on yearly, or other slow schedules. Later releases will often revise earlier ones. I'd like to update the earlier records using any later records. None of the files have a unique ID as such, though there is usually some combination of fields that should satisfy a uniqueness constraint (most often some kind of date or an event, like a census point, plus the ID of the organisation like the LA). I might have to normalise the incoming data, but this holds for most of the input data I'm looking at. Occasionally the event date needs to be derived from the file name or some other out of band knowledge, but it is easy enough to get this into a dataset before input.
so, IIUC, using a UUID won't give me a way of saying "use the event here that is valid from this date" as there won't be a way to match up the updated event records?
(The SEN2 return is a good example of this. There is a return every year where the start date of the new file overlaps the end date of the previous year's file and I want to update the records at the end of the previous year's file with the ones from the new year's file)
ah, ok - in that case, I'd use just the local authority ID, and have valid-time be the period of validity for the metric
do they need to overlap in time at all?
https://clojurians.slack.com/archives/CG3AM2F7V/p1755515435785549?thread_ts=1755514091.513759&cid=CG3AM2F7V how do you mean?
There is a return every year where the start date of the new file overlaps the end date of the previous year's fileso let's say you've got 2024's file and 2025's, what time range do those files apply to, what updates do you make and when?
2023's file applies to 2011-2023, 2024s file applies to 2022-2024, 2025's file applies to 2022-2025
(in this concrete example)
aha, interesting - so does 2024's file supersede the 2022-2023 period from the 2023 file, or do you need to have both accessible?
yes
(well sometimes, but at least it is that's years best guess)
right, that makes a lot of sense π
this is why I was thinking the as of properties or xtdb would be useful for me
ok, so I'd probably model this as one entity for each local authority (assuming that's the granularity of data that's coming in) - XT then gives you a timeline for each entity
I was about to do a Greenspun's 10th law on it
haha π
so at the start you'd have 2023's file being asserted (in the same sense as Datomic assertions) from valid-time 2011-2023 (2021?)
then, when you get 2024 in, you assert that from valid-time 2022-2024 - etc etc
the :xt/id is the essential bit for the bitemporality tho (along with a valid-time)?
bbiab π₯ͺ
and thx
two questions you can then ask:
β’ what's our most recent understanding of the data for 2023? (`FOR VALID_TIME AS OF 2023`, for system-time as of now is implicit)
β’ what did the 2023 survey say about the data for 2023? (`FOR VALID_TIME AS OF 2023 FOR SYSTEM_TIME AS OF 2023-12-31` , or whatever the dates are - you could even do FOR VALID_TIME AS OF 2023 FOR ALL SYSTEM_TIME to get the history of what you knew about 2023)
yep - xt/id is how you determine what updates form a single entity timeline, what facts you're updating when you insert a new document - so the London results would have their own valid/system-time timeline, East Anglia theirs
I think I'll try to get something together quick but wrong for today and then see how I get on from there.
being able to do everything in process is really handy (I'm going to be doing stuff with clerk, tech.ml.dataset, and xtdb all together). I'll share what I have here
I seem to be holding it wrong. I'd expect this to insert 2 documents?
(ns benchmarking.regional-template
#:nextjournal.clerk{:page-size nil
:auto-expand-results? true
:budget nil}
(:require
[xtdb.node :as xtn]
[xtdb.api :as xt]))
(defonce node (xtn/start-node))
(xt/status node)
(def data
[:put-docs {:into :users :valid-from #inst "2025-01-01"}
{:xt/id "jms" :first_name "James"}
{:xt/id "bld" :first_name "Bruce"}])
(comment
(xt/submit-tx node data)
)once I evaluate (xt/submit-tx node data)
but I get the exception
1. Unhandled xtdb.error.Incorrect
expected SQL string/vector (rest of trace elided)try [[:put-docs ...]]?
bingo
of course it is another layer of [] π
great success!
(ns benchmarking.regional-template
#:nextjournal.clerk{:page-size nil
:auto-expand-results? true
:budget nil}
(:require
[xtdb.node :as xtn]
[xtdb.api :as xt]))
(defonce node (xtn/start-node))
(xt/status node)
(def data
[:put-docs {:into :users :valid-from #inst "2025-01-01"}
{:xt/id "jms" :first_name "James"}
{:xt/id "bld" :first_name "Bruce"}])
(comment
(xt/submit-tx node [data])
)
(-> (xt/q node
'(from :users [_id first_name])))
;; => [{:xt/id "bld", :first-name "Bruce"}
;; {:xt/id "jms", :first-name "James"}]This seems to be the tutorial I'm after? https://docs.xtdb.com/static/learn-xtql-today-with-clojure.html
is there a way to pull out the valid from and to times?
yep - '(from :users [xt/id xt/valid-from xt/valid-to first-name])
cool, and putting a new doc updates the valid-to time. That's what I was hoping for
I'm glad that my annoyance driven development has pushed me towards finally using this sweet system you've built
please give everyone involved a hug from me if they'd like one. If not, then just a "nice one, cool hack"
I'm liking how this solves some of my fundamental problems with golden source systems (there are still the DDD ones, but there are probably other solutions for that)
now to see if I can actually get the things I need done by Friday
just being able to avoid column names that have the years encoded into them is going to be great (see https://geoportal.statistics.gov.uk/datasets/3959874c514b470e9dd160acdc00c97a_0/explore )
After being asked by @seancorfield is there anything that describes the tradeoffs between using sql vs xtql?
Hey @otfrom nothing I can point you to, but there are a few dimensions to consider: β’ Support - right now we have limited capacity to focus on providing a first class, bug-free experience across both, and SQL is taking the priority by far (this may change however) β’ Breaking changes - XTQL is not necessarily a finished design and may require changes to support richer functionality in the future, which could impact upgrade paths β’ Technical capability - SQL has a bigger surface area and therefore there will be a few things you can do in SQL already that you may struggle to achieve in XTQL, e.g. set operations β’ Culture - SQL will raise fewer eyebrows in most organizations and probably give team members who are less invested some warmer feelings about using XT in general β’ Etc.
Thx. That is really helpful. The breaking changes bit is probably the one that affects me the most