This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2023-02-07
Channels
- # announcements (32)
- # asami (11)
- # babashka (5)
- # babashka-sci-dev (4)
- # beginners (65)
- # biff (11)
- # calva (35)
- # clerk (2)
- # clj-kondo (14)
- # clj-on-windows (4)
- # clojars (4)
- # clojure (122)
- # clojure-canada (1)
- # clojure-europe (31)
- # clojure-italy (6)
- # clojure-nl (1)
- # clojure-norway (7)
- # clojure-spec (3)
- # clojure-uk (2)
- # clojurescript (3)
- # core-async (7)
- # core-logic (1)
- # data-science (13)
- # datalog (3)
- # datavis (3)
- # datomic (15)
- # deps-new (4)
- # emacs (34)
- # figwheel-main (1)
- # fulcro (1)
- # funcool (1)
- # holy-lambda (10)
- # lsp (41)
- # malli (24)
- # membrane (5)
- # midje (1)
- # off-topic (5)
- # polylith (3)
- # proletarian (6)
- # re-frame (6)
- # reitit (6)
- # remote-jobs (4)
- # sci (1)
- # shadow-cljs (96)
- # sql (31)
- # testing (23)
- # xtdb (49)
Hi all! We are playing with XTDB indexing and trying to delete the LMDB files (from a temporary dir) without removing the database, then we shut down the app and start it over to see what happens.. we basically see many errors now - is there an additional step to perform in order to "reindex"? if that makes sense?
as far as I am reading, the indexer should replay and "catch" up - however there might be no way for us to know when re-indexing is done?
Hey! Re-indexing can be considered 'done' once the lag between latest-submitted-tx
and latest-completed-tx
is ~0
it's an application error (could not find an entity)
maybe we need to run a check like that at startup, just to make sure
are you using LMDB for the doc store also? is it possible the configs are confused and you're inadvertently deleting the doc store also?
yeah we actually deleted both just for fun
well actually, we use postgres as document store
so this is very weird...deleting the checkpointing file should not result in a missing entity?
this is the conf
:xtdb/index-store {:kv-store (configure-index opts)}
:xtdb/tx-log {:xtdb/module 'xtdb.jdbc/->tx-log :connection-pool ::connection-pool}
:xtdb/document-store {:xtdb/module 'xtdb.jdbc/->document-store
:connection-pool ::connection-pool}
where configure-index
returns the :lmdb
module in prod..
I think I am missing something at this point...possibly silly question - are you deleting the files and then shutting down the node, or shutting down the node then deleting the files? if the former, I wonder whether we're then writing something on node close which is then confusing the node on startup
This is what we did
stop app
rm -r /opt/run/appserver/xtdb/* (so indexes and checkpoints)
start app
/opt/run/appserver/xtdb/index exists with a data and lock file
from my DevOps team
> deleting the checkpointing file should not result in a missing entity? yep, that's correct - so long as you preserve the tx-log and doc-store, everything else is replaceable
> it's an application error (could not find an entity)
to be sure, you mean something along the lines of: (1) a query is returning no results, or (2) entity
is returning nil
?
Can you confirm that the latest-completed-tx
is approximately the same before and after seeing this?
@U899JBRPF I am on 1.23.0
and the specific query I see returning nil
instead of the actual value is basically this
(-> (xt/db xtdb-node valid-time)
(xt/q '{:find [(pull ?e [*])]
:in [?patient-record-id]
:where [[?e :cohesic/type :patient-history]
[?e :patient-history/patient-record-id ?patient-record-id]]}
patient-record-id))
It is very intermittent and happened in prod so we will have time to check`latest-completed-tx`Thanks, that's useful to see. Which JDBC doc store are you using? I wonder if there's something problematic happening with the ad-hoc doc retrieval during pull
(assuming there's a cache miss)
We are using a postgresl doc and tx store using the same connection pool as the rest of the app, which is something like
(defn ->duct-database-sql-connection-pool
{:doc
"A function for converting from a Duct database sql component to an XTDB
connection pool.
For module documentation see:
"
:xtdb.system/deps {:dialect nil}
:xtdb.system/args {:db {:doc "Instance of a running jdbc connection" :required? true}}}
[{:keys [dialect db]}]
(let [datasource (get-in db [:spec :datasource])]
(try
(xt.jdbc/setup-schema! dialect datasource)
(catch Throwable t
(logging/warn t "Error while setting up the schema" {})
(.close ^Closeable datasource)
(throw t)))
(xt.jdbc/->HikariConnectionPool datasource dialect)))
The thing is I cannot repro locally cause I have a very little tx log
unless there is a way to "stop" the indexing and induce a cache miss
you could probably manually override the document-store cache and configure its size to 0
to avoid all cache hits
ok let me try to do that
is https://github.com/xtdb/xtdb/blob/e2f51ed99fc2716faa8ad254c0b18166c937b134/core/src/xtdb/document_store.clj#L63
what you were thinking about? Cause we do not use that cached version, we use the vanilla 'xtdb.jdbc/->document-store
ah, so the jdbc document-store wraps the cached-document-store https://github.com/xtdb/xtdb/blob/32361297e9a15fc569ac694c8bd7951b4f52aa0a/modules/jdbc/src/xtdb/jdbc.clj#L141-L147
oh I see 😄
I am seeing some errors indeed with
:xtdb/document-store (xt.doc-store/->cached-document-store
{:document-cache (xt.cache/->cache {:cache-size 0})
:document-store
{:xtdb/module 'xtdb.jdbc/->document-store
:connection-pool ::connection-pool}})
I think we might be onto somethingdo you have any active instrumentation on postgres? you could try logging a few problematic queries (or prn
them with some REPLing) and running them in isolation directly to see if they are similarly intermittent
I am going to try things now, I'll open an issue on Github if I find something, thanks for your (late) help Jeremy!
You're very welcome, and thanks in turn for your patience while we get to the bottom of it!
Uhm no sure I am doing something odd but with that conf I get
2023-02-08 14:17:53,356 INFO xtdb.tx - Started tx-ingester
2023-02-08 14:17:53,390 ERROR xtdb.tx - Ingester error occurred
java.lang.IllegalStateException: missing docs: #{#xtdb/id "eadd50b4c1b9119f192493b6b7acf150afbae1be" #xtdb/id "4e501973abd620350274ed8252ce4b170ee1d882"}
at xtdb.tx$strict_fetch_docs.invokeStatic(tx.clj:57)
at xtdb.tx$strict_fetch_docs.invoke(tx.clj:52)
at xtdb.tx$__GT_tx_ingester$fn__86588$txs_doc_fetch_fn__86612.invoke(tx.clj:626)
at clojure.lang.AFn.applyToHelper(AFn.java:156)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invokeStatic(core.clj:667)
at clojure.core$apply.invoke(core.clj:662)
at xtdb.tx$__GT_tx_ingester$apply_if_not_done__86578.invoke(tx.clj:562)
at xtdb.tx$__GT_tx_ingester$fn__86588$submit_job_BANG___86589$fn__86590.invoke(tx.clj:597)
at clojure.lang.AFn.call(AFn.java:18)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
At node startupand does that happen consistently at least if you restart a few times? Always the same two IDs?
I have just opened https://github.com/xtdb/xtdb/issues/1897 to keep track of what would ideally help with debugging here (feel free to comment there if it makes sense at all)
> and does that happen consistently at least if you restart a few times? Always the same two IDs
It does seem to be consistent, we also do consistently receive a nil
from a couple of "startup" queries though, meaning a pull
returns nil
even if the tx/doc is present
Consistent here means no index
files - the second time around when the files get created I only get the nil
s but not that exception. If I remove the index files again I see the exceptions again.
The ids that are collected in the exception are numerous, seems like the whole transaction log
Feel free to send any patch over and I'll try it out
Thanks for the explanations. Are you using transaction functions here (not at all / somewhat / exclusively)? Was the tx-log only written to with 1.23.0
? Or is this effectively migrating from an earlier version? If so, do you know which version might have written these transactions?
This happens on 1.23.0
in isolation, transactions coming from this version...LMDB producing indices for the same version
Hmm. Have you observed evidence of JDBC-related errors anywhere? I feel like we need to instrument something, /cc @U0GE2S1NH
No JDBC error noticed in the logging, I too think that it might be a LMDB/Postgresql failure that is hidden and a nil
is returned, I'll try to create a repro at some point
Sorry I am really late to this conversation, I understand you see issues when deleting the index, can I confirm you have a confirmed working doc store? Is there another node where the index has not been deleted, that behaves correctly?
@U0GE2S1NH yes the docstore is intact and works perfectly, this is fully reproducible by transacting some data (LMDB index enabled) and then deleting the index files on disk (while the app is down) and restarting the app