Fork me on GitHub
#xtdb
<
2023-03-17
>
jarohen08:03:21

Morning folks 👋 Unfortunately Docker Hub are https://news.ycombinator.com/item?id=35166317 free open-source organisations for any project 'with a pathway to commercialisation' (and deleting all of the images in 30 days time), so we've moved our images to GitHub's GHCR - see 'Packages' on the RHS of http://github.com/xtdb/xtdb for the new locations. If anyone needs images older than 1.23.1, please do give us a shout 🙂

👍 4
Emanuel Rylke16:03:33

We are debugging data loss after a recent upgrade to 1.23.1 and wonder if you've heart similar reports. * xtdb version 1.23.1 * clojure 1.11.1 * postgres + rocksdb On app start we rebuild the entire index (no snapshots). Data present before a recent restart was not present after. We have not been able to reproduce the issue. We have thawed the nippy objects in Postgres history but are not sure how to read them. Any pointers?

👀 2
refset16:03:41

Hi @U04R3BE5XR6 what was previous version where you saw the data in the indexes looking okay?

refset16:03:20

Was this in a pre-production environment? Did you downgrade and get things working as before?

refset16:03:10

> Data present before a recent restart was not present after. Was the Postgres instance running continuously as normal in the background throughout this restart process? Or was that part of the process also?

Emanuel Rylke19:03:27

All the missing data is from transactions already submitted with 1.23.1 (we've upgraded from 1.23.0 before that) yes we saw this issue in our staging environment. We did not try to downgrade as we couldn't yet reproduce the issue with the current version and so wouldn't be able to tell if an earlier version would make a difference. The Postgres instance was running continuously throughout

refset20:03:53

Thanks, and presumably there were no log errors?

refset20:03:36

Were the queries that saw (failed to see) the missing data involving pull or entity?

Emanuel Rylke11:03:53

There were some WARN [xtdb.tx:0] - Transaction function failed when originally evaluated: #xtdb/id b48d34a409405b2f78690ed13ed50f275034fd71 #uuid "920a6bff-1cc2-446b-b7df-b2ffd8068fda" {:crux.db.fn/exception java.lang.OutOfMemoryError, :crux.db.fn/message "Cannot reserve 131072 bytes of direct buffer memory (allocated: 243229838, limit: 243269632)", :crux.db.fn/ex-data nil} in the logs in the time frame between when we saw the data when we no longer saw it. we usually do a pull in our queries. I have to ask my collegues if tried without it.

refset12:03:55

Ah okay, that's useful information. How many nodes were running against this tx-log concurrently?

refset12:03:25

Do you have any idea what may have caused the OOM? Did this happen during the index rebuild whilst the rest of the application was inactive (i.e. it was only xtdb-internal activity creating the memory pressure)?

daveliepmann13:03:38

I need Emanuel to answer that, but another possibility we're investigating is an InterruptedException (inside xt/listen). Zooming out a bit, we'd like to better understand what happens after these failures (OOM & IE) are https://github.com/xtdb/xtdb/commit/4b17cc18a029048ec5289d97625684d2a957129f#diff-fc22ecc4cd13a74c82d57b24fc8c33c4d2eab71c5d912d2a53a34a26f45e3884R268-R269. Do we have it right that other failures (e.g. clojure.lang.ExceptionInfo) get written to both the tx-log and storage as failed, whereas OOMs and IEs are written as if they succeeded to storage, and the tx-log is not updated? For instance, to debug this issue, we intentionally threw an OOM in a transaction function. The data was not present on subsequent queries. We then removed our (throw (java.lang.OutOfMemoryError. "foo")) and restarted the application, after which the data was present. It appears per #1913 that this is intended xtdb behavior, but we don't have a mental model for when we should expect the data to be made available after such a failure. How should we be thinking about this?

Emanuel Rylke13:03:16

Those warnings were from when we deployed to staging. At that time there would have been 4 nodes running (2 from the old deploy, 2 from the new deploy (outside from deploys we have 2 nodes running) hard to say what caused the OOME originally. I'm interpreting the logged warning as saying that the OOM happened some unbounded time before.

👍 2
Emanuel Rylke14:03:05

We've diffed two sql dumps and saw a row in the docs table change. Is this expected or are those supposed to be append only?

refset14:03:59

Hi @U05092LD5 > Do we have it right that other failures (e.g. clojure.lang.ExceptionInfo) get written to both the tx-log and storage as failed, whereas OOMs and IEs are written as if they succeeded to storage, and the tx-log is not updated? As of 1.23.1 and https://github.com/xtdb/xtdb/issues/1913 - yes this is what happens. Previously OOMs and IEs were not reliably (re-)processed. Existing logs written to with an earlier version do not get migrated, so will exhibit the earlier behaviour until the point at which the 1.23.1-only writes begin. Manually throwing an OOM should work as you describe. Asides from reading the PRs I don't have a better way to convey a mental model right now, but we could setup a call to discuss if that would be easier?

refset14:03:52

> We've diffed two sql dumps and saw a row in the docs table change. Is this expected or are those supposed to be append only? docs are modified for eviction (replaced with tombstones) and after transaction function evaluation ('argument' documents are replaced by the resulting tx-ops, or an error)

👍 2