xtdb

jussi 2025-03-27T08:44:53.885439Z

Any ideas how to improve eviction performance? I have a situation where I may end up deleting > 10k documents while in development and eviction of a single document takes roughly 0.4 seconds, which is 6.5 minutes for 1000 documents. (PostreSQL, XTDBv1)

jussi 2025-03-27T09:13:41.205009Z

Hmm, this seems to be directly proportional to the history size of the document in question.

refset 2025-03-27T10:37:01.872719Z

Hey @jussi.mononen this is possibly related to this known issue https://github.com/xtdb/xtdb/issues/1509 - although you're using RocksDB, right? Are you using transaction functions in these transactions?

jussi 2025-03-27T10:45:04.555539Z

Yes to both RockDB and tx.

(defn tx-ops-succeeded?! [node ops]
  (let [tx (xt/submit-tx node ops)]
    (xt/await-tx node tx)
    (xt/tx-committed? node tx)))

(defn throw-if-ops-fail! [node ops]
  (when-not (tx-ops-succeeded?! node ops)
    (throw (ex-info "Transaction failed" {:type ::transaction-failed}))))
And then I might have several thousands of
[::xt/evict id]
operations executed with throw-if-ops-fail!

👍 1
refset 2025-03-27T12:06:08.173349Z

is it making the dev flow unviable as-is? I could potentially look into this soon if it's preventing you moving forward

jussi 2025-03-27T12:19:53.273859Z

It's kind of annoying, I usually defer these operations to the evening and just leave my processes running for the night 🙂

👍 1
jussi 2025-03-27T12:35:10.325069Z

What to do if (xt/sync node) takes ages? Fresh container, first start and very large existing database? We are using checkpoints.

jarohen 2025-03-27T12:41:18.682659Z

hey @jussi.mononen 👋 depending on how long it's taking, one quick fix might be to increase the checkpoint frequency

jarohen 2025-03-27T12:41:50.045559Z

otherwise, how's your indexing speed normally?

jussi 2025-03-27T12:42:16.695789Z

1 hour

jarohen 2025-03-27T12:42:29.189429Z

if this is the same env as the previous thread, then it might be that the significant volume of evictions is slowing your indexing down over time

jussi 2025-03-27T12:42:59.937799Z

Its the same env yes.

jussi 2025-03-27T12:43:20.219289Z

One of our containers is having hard time to start due to the long sync and startup probes fail

jarohen 2025-03-27T12:43:48.817979Z

eviction is really only intended for cases where you're legally obliged to erase data, the assumption being it's a small fraction of your overall data - it's not optimised for anything beyond that

jussi 2025-03-27T12:44:35.593639Z

Yeah, I'm aware of that, we have a case where the original data has to be replaced completely and derivate documents as well

👍 1
jussi 2025-03-27T12:45:44.947949Z

ie. we noticed that one integration provided incorrect data and when our API is idempotent the base data needs to be evicted before data with the same ids can be inserted 🙈

jarohen 2025-03-27T12:47:13.807599Z

ah, ok, I see 👍 🤔

jarohen 2025-03-27T12:47:43.096809Z

would a cross-time delete be sufficient, rather than an evict? those are likely to be much quicker

jussi 2025-03-27T12:48:05.606999Z

Maybe.

jussi 2025-03-27T12:52:12.227809Z

Regarding the slow starutp sync, are there ways to exepdite it? Does it read all chekcpoints or only the latest? Our latest checkpoint contains 81 files of which roughly 70 .sst files are 68MB each

jarohen 2025-03-27T12:53:13.367559Z

it'll only read the latest, yep

✅ 1
jussi 2025-03-27T13:10:00.216939Z

280 secs it took for a completely clean instance to reconstruct local indexes from prod db

jussi 2025-03-27T13:41:42.426689Z

Aaaaand that's the issue with our container, Google limits startup probe's maximum time to 240 seconds

jussi 2025-03-27T13:45:39.189389Z

Does the time it takes to sync grow linearly with the amount of data?

jarohen 2025-03-27T13:46:17.955349Z

shouldn't do - beyond the initial checkpoint download, it should be proportional to the data entered since the last checkpoint

jussi 2025-03-27T13:47:13.323579Z

👍🏻 good 😅

jussi 2025-03-27T13:47:37.031939Z

So I guess our hassle can be traced back to those numerous evictions done 🙈

jussi 2025-03-27T13:48:05.023819Z

and checkpointing not getting those updates fast enough due to the one hour update cycle

refset 2025-03-27T14:01:56.468599Z

ah okay, so those evictions aren't just a development-time problem for you... (they're in the prod data, if I'm understanding correctly?). If you're simply not able to workaround the startup limit (i.e. you're totally blocked) I can take a look at the eviction code and have a proper think soon

jussi 2025-03-27T14:18:04.350499Z

well, it escalated quickly from dev time issue to prod issue since our container decided to dump itself and it took a bit of time to figure out the start probe timeout was the culprit, not the time spent in syncing 😅

1
jussi 2025-03-27T14:19:06.330849Z

totally unrelated but the container that went bust couldn't be shutdown since the new instance version couldn't satisfy the startup probe 🌪️

jussi 2025-03-27T14:19:27.989199Z

(that's managed cloud services for you 😂 )

1