xtdb

jussi 2025-12-05T09:02:43.055389Z

Hi, pondering how to approach a XTDB v1 situation where I might have to evict at least 300 000 documents. 😅 Currently we have eviction done like this • Fetch id's • Partition id's to 1000 document buckets • Create tx operations for each bucket • Run tx operations one bucket at a time What we have noticed is that this is quite heavy operation and in production we have seen that it can block writes resulting in errors. Are there any other approaches to this particular problem? This is not frequent and usually the reason is customer requirement as they want us to update their data significantly. I can tackle this by timing it well enough (like, weekend night when acitivity is low), but as our customer base grows I expect this to become more frequent. Especially would like to know if the id fetching could be streamed or "windowed" somehow to avoid huge spike in CPU/Mem usage.

jussi 2026-01-09T08:26:20.149079Z

One observation more, the checkpoint restoring is taking "forever" after the eviction of those documents (forever from Cloud Run perspective at least) causing errors as the startup probe fails. I've had to increase the startup probe tolerance quite a lot.

jussi 2026-01-09T08:27:37.796859Z

"forever" means over 20 minutes easily. This is a small instance (2 CPU, 8GB), so if the startup probe tolerance cannot be raised high enough, I may have to double the container specs.

jussi 2026-01-09T10:09:16.291909Z

Note for others 😎 If you are running XTDBv1 on Google Cloud Run and you encounter similar issua as we did (evicting 566 000 documents) that result in HUGE checkpoint update and Cloud Run startup probe thresholds are not sufficient to allow checkpoint consuming before startup probe fails and force instance restart 1. https://v1-docs.xtdb.com/administration/checkpointing/#_note_about_using_filesystem_checkpoint_store 2. change the http startup probe to tcp 3. tolerate unresponsive database for a while 4. profit

jussi 2026-01-09T12:42:02.715869Z

Sidenote, latest checkpoint contains 113 files where ~100 are of size 67MB 👀

jussi 2026-01-09T12:54:01.829069Z

@taylor.jeremydavid if I'm not interested to store all those 566k [::xt/evict (first id)] queries in checkpoints in order to hasten the startup of a new node/revision, how should one proceed? Can I just delete stuff 🙈

jussi 2026-01-09T13:24:04.977329Z

We have a single node using XTDB and basically checkpoints are needed only when we deploy new revision with new features into pilot/prod. And due to changes in our infra we have seen drastic change in query frequencies and the query performance is not that critical to us anymore. I just started to think that do we actually need checkpointing when our query patterns are random and infrequent (XTDB is the golden storage with versioning and for most use cases we just export the data to other data processors).

👍 1
refset 2026-01-09T13:49:13.126019Z

> Can I just delete stuff 🙈 you mean from the Postgres table(s) directly, via some one-off SQL statements? That can work, yep. Just be careful 🙂 The checkpoint stores the latest offset, so if you only delete entries that are greater than that offset, the indexer will skip to the next greatest entry without realising anything is missing

jussi 2026-01-09T13:50:33.283999Z

Oh no, that sounds way too scary 😅

🙈 1
jussi 2026-01-09T13:52:04.500339Z

Anyway, the production API's are up and running and new data is constantly flowing in

jussi 2026-01-09T13:53:04.984319Z

So the checkpoints have new content after the evictions

jussi 2026-01-09T13:54:52.748579Z

What would be the effect if I disable checkpointing completely and deploy a new node at this point of time? Would the next deployed instance need to index the whole history of all documents before being usable?

refset 2026-01-09T22:23:37.430899Z

> Would the next deployed instance need to index the whole history of all documents before being usable? Essentially yes, checkpoints are just a kind of cache

👍🏻 1
jussi 2026-01-10T07:06:53.961049Z

Could I expedite the checkpoint "vanishing" by introducing retention policies?

jussi 2026-01-10T07:07:43.075639Z

I mean, I just want to somehow get rid of the large checkpoints so that I can stop worrying about the startup probes failing 😂

refset 2026-01-10T08:03:30.989299Z

Pruning older checkpoints manually is safe, but yes retention policies are the automatic approach, configurable since https://github.com/xtdb/xtdb/pull/2591

👍🏻 1
jussi 2026-01-10T08:04:13.131709Z

So I could just remove files from a checkpoint?

jussi 2026-01-10T08:04:55.013129Z

I mean from the latest checkpoint or does that cause havoc in indexing?

refset 2026-01-10T16:23:53.302879Z

Ah you mean file within RocksDB - I wouldn't recommend deleting any files directly, Rocks spreads data throughout them and does integrity checking. But you can evict (delete) kv entries and then run compaction

refset 2026-01-10T16:25:25.824749Z

You're kind of fighting XT's state replication at this point, but if you're careful and only have one active node, and backups, you can be quite confident about the impacts

jussi 2026-01-10T16:50:12.306729Z

Yeah, maybe I'll just implement an aggressive retention policy, wait for it to do it's job and then restore a more lenient version. This is not a production blocker, I can use tcp startup probe to overcome limitations in GCP start up probes max timeouts

👌 1
jussi 2026-01-08T13:44:15.034879Z

FWIW streaming allows us to maintain normal operations during the eviction, but eviction of 200 documents takes on average 40 seconds meaning 5 ops/sec. This means evicting 283 000 documents takes approx. 15h 😅

jussi 2026-01-08T14:06:43.858999Z

(disclaimer: the prod instances are not very large, this could be faster with larger instances)

refset 2026-01-08T16:16:19.329859Z

Interesting. Thanks for following up. Is it still causing a pain then?

jussi 2026-01-08T17:06:17.097009Z

No, I truly hope this is a unique situation. Integrator pushed data to s single owning entity where he should have pushed then to at least 6 different owners, thus we needed to evict the docs to re-push the data to correct owning entities. We pondered to programmatically move those documents but chose to clean the slate just to be sure data is where it belongs. Integrator couldn't provide certain enough identification to be 100% sure of the programmatical change of ownership

🤞 1
jussi 2026-01-14T08:25:26.295159Z

Deploying now a version with 1 minute retention time to our pilot system since can't see any signals of clearing checkpoints. Checkpoints are uploaded normally

jussi 2026-01-14T08:27:40.869749Z

Just for the sake of completeness, this is our kv-storage config now.

(defn kv-store [dir]
  {:kv-store (merge {:xtdb/module 'xtdb.rocksdb/->kv-store
                     :db-dir (io/file dir)
                     :sync? false
                     :enable-filters? true

                     :block-cache {:xtdb/module 'xtdb.rocksdb/->lru-block-cache
                                   :cache-size (* (:xtdb-cache-size-mb config) 1024 1024)}}

                    (when (:xtdb-checkpoint-bucket config)
                      {:checkpointer {:xtdb/module 'xtdb.checkpoint/->checkpointer
                                      :store {:xtdb/module 'xtdb.google.cloud-storage/->checkpoint-store
                                              :path "/mnt/xtdb-checkpoints"}
                                      :approx-frequency (java.time.Duration/ofHours 1)
                                      :retention-policy {:retain-newer-than (java.time.Duration/ofMinutes 1)
                                                         :retain-at-least 5}}}))})

jussi 2026-01-14T08:36:54.985429Z

It does not seem to run clearing at all. Below some recent logging from our GCP instance

2026-01-14 10:25:20.171
time="14/01/2026 08:25:20.171529" severity=INFO message="File system has been successfully mounted." mount-id=pilot-xtdb-checkpoints-5595e368
2026-01-14 10:25:29.413
2026-01-14 08:25:29,411 [main] INFO  xtdb.checkpoint - restoring from {:xtdb.checkpoint/cp-format {:index-version 22, :xtdb.rocksdb/version "7"}, :tx {:xtdb.api/tx-time #inst "2025-08-06T10:49:26.697-00:00", :xtdb.api/tx-id 2219256}, :xtdb.checkpoint/cp-path #object[sun.nio.fs.UnixPath 0x1b6b605 "/mnt/xtdb-checkpoints/checkpoint-2219256-2026-01-14T08:10:23.274-00:00"], :xtdb.checkpoint/checkpoint-at #inst "2026-01-14T08:10:23.274-00:00"} to rocksdb/index-store-for-postgres-pg-67368c0-carbonlink-pilot.f.aivencloud.com-carbonlink-test
2026-01-14 10:25:35.980
2026-01-14 08:25:35,978 [main] INFO  xtdb.tx - Started tx-ingester
2026-01-14 10:35:20.173
time="14/01/2026 08:35:20.170624" severity=INFO message="Starting a garbage collection run." mount-id=pilot-xtdb-checkpoints-5595e368
2026-01-14 10:35:20.207
time="14/01/2026 08:35:20.205627" severity=INFO message="Garbage collection succeeded after deleted 0 objects in 34.887143ms." mount-id=pilot-xtdb-checkpoints-5595e368
Showing logs for last 2 days from 1/12/26, 10:35 AM to 1/14/26, 10:35 AM.

jussi 2026-01-14T08:52:54.552449Z

When looking at https://github.com/xtdb/xtdb/blob/1.x/core/src/xtdb/checkpoint.clj#L80

(defn checkpoint [{:keys [dir bus src store ::cp-format approx-frequency] :as checkpoint-opts}]
and then https://github.com/xtdb/xtdb/blob/1.x/core/src/xtdb/checkpoint.clj#L69C1-L69C76
(defn apply-retention-policy [{:keys [store ::cp-format retention-policy]}]
Am I correct to say that the retention arguments are not passed?

👀 1
jussi 2026-01-14T08:54:06.999479Z

:keys does not pickup :retention-policy keys?

refset 2026-01-14T09:02:38.008229Z

AFAICT the config should pass through okay

refset 2026-01-14T09:02:51.888019Z

just a thought: does the service account have delete permissions on the bucket?

jussi 2026-01-14T09:03:37.006989Z

It should.

jussi 2026-01-14T09:05:01.321909Z

I'm looking for this logging in my service logs, it should be there despite the permissions. https://github.com/xtdb/xtdb/blob/1.x/core/src/xtdb/checkpoint.clj#L77

(log/infof "Clearing up old checkpoint, %s, based on `retention-policy`" checkpoint-opts)
But it never shows up, and above it is a check
(when retention-policy
which is why I confused myself about those config opts being passed 🙈

👍 1
refset 2026-01-14T09:05:37.305029Z

if you have repl access you could see what available-checkpoints and calculate-deleteable-checkpoints return

refset 2026-01-14T09:06:42.359409Z

and I guess if you have repl access you could also monkey patch some more logging

refset 2026-01-14T09:07:39.438829Z

I can probably carve out some time tomorrow to try to repro it with a vanilla setup in GCP

👍🏻 1
jussi 2026-01-14T15:23:35.548409Z

Nope, we don't have repl access to prod.

jussi 2026-01-13T10:26:06.160579Z

Hmm, deployed this config yesterday. Still cannot see any logging regarding clearing checkpopints. How could I verify that it is effective? Should I just wait a tad longer for logging to appear?

:retention-policy {:retain-newer-than (java.time.Duration/ofDays 1) 
                   :retain-at-least 1}

refset 2026-01-13T10:30:20.934269Z

Hmm, yes worth waiting 2 days at most, but if it's not done anything after 1.5 days it's very likely not working. It would be good to verify the config separately. If you have a dev environment you can turn down all the numbers to experiment at the scale of minutes

jussi 2026-01-13T10:43:06.185249Z

If needed, yes I have and can 🙂

🤞 1
refset 2025-12-05T11:03:10.855289Z

Hey @jussi.mononen firstly, is this eviction being done for GDPR-type reasons? Are you using RocksDB? If not you might well be running into this outstanding issue: https://github.com/xtdb/xtdb/issues/1509

jussi 2025-12-05T11:10:55.794619Z

RocksDB for index strore, PostgreSQL for everything else.

jussi 2025-12-05T11:12:55.294899Z

The main reason is that customer wants to move from "one gargantuan chunk of data" to "lots of data per company in our concern structure". Ie, they want to split the data to the respective subsidiaries instead of treating it as one big chunk.

refset 2025-12-05T12:21:16.363119Z

Hmm, rather than evicting you might be better to 'decant' into a fresh database. Bulk eviction is not something we've optimised particularly, so there may be some low hanging performance gains, but it's hard to say if that will help you in any case. You can definitely at least stream the id fetching, see open-q

jussi 2025-12-05T12:23:46.501069Z

One instance of XTDB is shared with multiple customers, so can't easily decant them 😕 The streaming could be sufficient if it allows concurrent writes after the change, since evicting is rare and a background process that can run on its own for as long as it takes. Thanks!

refset 2025-12-05T13:59:43.994809Z

Okay, well let me know if your background batching/partitioning approach is still a struggle and I'll give this more thought next week