Fork me on GitHub
#xtdb
<
2023-02-07
>
richiardiandrea17:02:00

Hi all! We are playing with XTDB indexing and trying to delete the LMDB files (from a temporary dir) without removing the database, then we shut down the app and start it over to see what happens.. we basically see many errors now - is there an additional step to perform in order to "reindex"? if that makes sense?

richiardiandrea17:02:32

as far as I am reading, the indexer should replay and "catch" up - however there might be no way for us to know when re-indexing is done?

refset18:02:57

Hey! Re-indexing can be considered 'done' once the lag between latest-submitted-tx and latest-completed-tx is ~0

refset18:02:05

What errors are you seeing?

refset18:02:45

I wouldn't be surprised that deleting the files for a running DB causes problems 😅

richiardiandrea18:02:57

it's an application error (could not find an entity)

refset18:02:00

The exact behaviour may even be OS/FS specific

richiardiandrea18:02:13

maybe we need to run a check like that at startup, just to make sure

refset18:02:30

are you using LMDB for the doc store also? is it possible the configs are confused and you're inadvertently deleting the doc store also?

richiardiandrea18:02:51

yeah we actually deleted both just for fun

richiardiandrea19:02:52

well actually, we use postgres as document store

richiardiandrea19:02:45

so this is very weird...deleting the checkpointing file should not result in a missing entity?

richiardiandrea19:02:05

this is the conf

:xtdb/index-store {:kv-store (configure-index opts)}
              :xtdb/tx-log {:xtdb/module 'xtdb.jdbc/->tx-log :connection-pool ::connection-pool}
              :xtdb/document-store {:xtdb/module 'xtdb.jdbc/->document-store
                                    :connection-pool ::connection-pool}
where configure-index returns the :lmdb module in prod.. I think I am missing something at this point...

jarohen21:02:02

possibly silly question - are you deleting the files and then shutting down the node, or shutting down the node then deleting the files? if the former, I wonder whether we're then writing something on node close which is then confusing the node on startup

richiardiandrea21:02:03

This is what we did

stop app
rm -r /opt/run/appserver/xtdb/* (so indexes and checkpoints)
start app
/opt/run/appserver/xtdb/index exists with a data and lock file

richiardiandrea21:02:42

from my DevOps team

jarohen21:02:27

hmm, ok, so not that then. :thinking_face:

👍 2
jarohen22:02:25

> deleting the checkpointing file should not result in a missing entity? yep, that's correct - so long as you preserve the tx-log and doc-store, everything else is replaceable

refset10:02:53

> it's an application error (could not find an entity) to be sure, you mean something along the lines of: (1) a query is returning no results, or (2) entity is returning nil? Can you confirm that the latest-completed-tx is approximately the same before and after seeing this?

refset10:02:01

Which version are you running?

richiardiandrea16:02:58

@U899JBRPF I am on 1.23.0 and the specific query I see returning nil instead of the actual value is basically this

(-> (xt/db xtdb-node valid-time)
                     (xt/q '{:find [(pull ?e [*])]
                             :in [?patient-record-id]
                             :where [[?e :cohesic/type :patient-history]
                                     [?e :patient-history/patient-record-id ?patient-record-id]]}
                           patient-record-id))
It is very intermittent and happened in prod so we will have time to check`latest-completed-tx`

refset17:02:05

Thanks, that's useful to see. Which JDBC doc store are you using? I wonder if there's something problematic happening with the ad-hoc doc retrieval during pull (assuming there's a cache miss)

richiardiandrea17:02:56

We are using a postgresl doc and tx store using the same connection pool as the rest of the app, which is something like

(defn ->duct-database-sql-connection-pool
  {:doc
   "A function for converting from a Duct database sql component to an XTDB
  connection pool.

  For module documentation see:
    "
   :xtdb.system/deps {:dialect nil}
   :xtdb.system/args {:db {:doc "Instance of a running jdbc connection" :required? true}}}
  [{:keys [dialect db]}]
  (let [datasource (get-in db [:spec :datasource])]
    (try
      (xt.jdbc/setup-schema! dialect datasource)
      (catch Throwable t
        (logging/warn t "Error while setting up the schema" {})
        (.close ^Closeable datasource)
        (throw t)))
    (xt.jdbc/->HikariConnectionPool datasource dialect)))

👍 2
richiardiandrea17:02:48

The thing is I cannot repro locally cause I have a very little tx log

richiardiandrea17:02:08

unless there is a way to "stop" the indexing and induce a cache miss

refset17:02:48

you could probably manually override the document-store cache and configure its size to 0 to avoid all cache hits

richiardiandrea17:02:16

ok let me try to do that

richiardiandrea17:02:33

is https://github.com/xtdb/xtdb/blob/e2f51ed99fc2716faa8ad254c0b18166c937b134/core/src/xtdb/document_store.clj#L63 what you were thinking about? Cause we do not use that cached version, we use the vanilla 'xtdb.jdbc/->document-store

refset18:02:30

protocols all the way down...

richiardiandrea19:02:55

I am seeing some errors indeed with

:xtdb/document-store (xt.doc-store/->cached-document-store
                                    {:document-cache (xt.cache/->cache {:cache-size 0})
                                     :document-store
                                     {:xtdb/module 'xtdb.jdbc/->document-store
                                      :connection-pool ::connection-pool}})
I think we might be onto something

refset20:02:02

interesting, and thanks for sharing the override snippet

refset20:02:49

do you have any active instrumentation on postgres? you could try logging a few problematic queries (or prn them with some REPLing) and running them in isolation directly to see if they are similarly intermittent

❤️ 2
richiardiandrea21:02:05

I am going to try things now, I'll open an issue on Github if I find something, thanks for your (late) help Jeremy!

🙏 2
refset21:02:10

You're very welcome, and thanks in turn for your patience while we get to the bottom of it!

richiardiandrea22:02:04

Uhm no sure I am doing something odd but with that conf I get

2023-02-08 14:17:53,356 INFO                                                       xtdb.tx - Started tx-ingester
2023-02-08 14:17:53,390 ERROR                                                      xtdb.tx - Ingester error occurred
java.lang.IllegalStateException: missing docs: #{#xtdb/id "eadd50b4c1b9119f192493b6b7acf150afbae1be" #xtdb/id "4e501973abd620350274ed8252ce4b170ee1d882"}
	at xtdb.tx$strict_fetch_docs.invokeStatic(tx.clj:57)
	at xtdb.tx$strict_fetch_docs.invoke(tx.clj:52)
	at xtdb.tx$__GT_tx_ingester$fn__86588$txs_doc_fetch_fn__86612.invoke(tx.clj:626)
	at clojure.lang.AFn.applyToHelper(AFn.java:156)
	at clojure.lang.AFn.applyTo(AFn.java:144)
	at clojure.core$apply.invokeStatic(core.clj:667)
	at clojure.core$apply.invoke(core.clj:662)
	at xtdb.tx$__GT_tx_ingester$apply_if_not_done__86578.invoke(tx.clj:562)
	at xtdb.tx$__GT_tx_ingester$fn__86588$submit_job_BANG___86589$fn__86590.invoke(tx.clj:597)
	at clojure.lang.AFn.call(AFn.java:18)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
At node startup

refset10:02:52

and does that happen consistently at least if you restart a few times? Always the same two IDs?

refset11:02:44

I have just opened https://github.com/xtdb/xtdb/issues/1897 to keep track of what would ideally help with debugging here (feel free to comment there if it makes sense at all)

richiardiandrea18:02:19

> and does that happen consistently at least if you restart a few times? Always the same two IDs It does seem to be consistent, we also do consistently receive a nil from a couple of "startup" queries though, meaning a pull returns nil even if the tx/doc is present Consistent here means no indexfiles - the second time around when the files get created I only get the nils but not that exception. If I remove the index files again I see the exceptions again.

richiardiandrea18:02:11

The ids that are collected in the exception are numerous, seems like the whole transaction log

richiardiandrea18:02:54

Feel free to send any patch over and I'll try it out

refset18:02:26

Thanks for the explanations. Are you using transaction functions here (not at all / somewhat / exclusively)? Was the tx-log only written to with 1.23.0? Or is this effectively migrating from an earlier version? If so, do you know which version might have written these transactions?

richiardiandrea18:02:08

This happens on 1.23.0 in isolation, transactions coming from this version...LMDB producing indices for the same version

refset18:02:25

Hmm. Have you observed evidence of JDBC-related errors anywhere? I feel like we need to instrument something, /cc @U0GE2S1NH

richiardiandrea20:02:34

No JDBC error noticed in the logging, I too think that it might be a LMDB/Postgresql failure that is hidden and a nil is returned, I'll try to create a repro at some point

👍 2
wotbrew10:02:24

Sorry I am really late to this conversation, I understand you see issues when deleting the index, can I confirm you have a confirmed working doc store? Is there another node where the index has not been deleted, that behaves correctly?

richiardiandrea21:02:29

@U0GE2S1NH yes the docstore is intact and works perfectly, this is fully reproducible by transacting some data (LMDB index enabled) and then deleting the index files on disk (while the app is down) and restarting the app

🙏 2