Fork me on GitHub
#xtdb
<
2021-05-27
>
Jacob O'Bryant20:05:27

I've got some code that calls submit-tx, await-tx, and then tx-committed?. While importing data into a new app, I just got a NodeOutOfSyncException . Is this a crux bug? code: https://github.com/jacobobryant/biff/blob/master/libs/crux/src/biff/crux.clj#L389-L392 error:

May 27 19:57:14 findka-api run.sh[2007]: crux.api.NodeOutOfSyncException: Node out of sync - requested '{:crux.tx/tx-id 92, :crux.tx/tx-time #inst "2021-05-27T19:57:14.679-00:00"}', available '{:crux.tx/tx-time #inst "2021-05-27T19:57:13.513-00:00", :crux.tx/tx-id 91}'
May 27 19:57:14 findka-api run.sh[2007]: {:crux.error/error-type :node-out-of-sync, :crux.error/message "Node out of sync - requested '{:crux.tx/tx-id 92, :crux.tx/tx-time #inst \"2021-05-27T19:57:14.679-00:00\"}', available '{:crux.tx/tx-time #inst \"2021-05-27T19:57:13.513-00:00\", :crux.tx/tx-id 91}'", :requested {:crux.tx/tx-id 92, :crux.tx/tx-time #inst "2021-05-27T19:57:14.679-00:00"}, :available {:crux.tx/tx-time #inst "2021-05-27T19:57:13.513-00:00", :crux.tx/tx-id 91}}
May 27 19:57:14 findka-api run.sh[2007]:  at crux.error$node_out_of_sync.invokeStatic (error.clj:19)
May 27 19:57:14 findka-api run.sh[2007]:     crux.error$node_out_of_sync.invoke (error.clj:18)
May 27 19:57:14 findka-api run.sh[2007]:     crux.node.CruxNode.tx_committed_QMARK_ (node.clj:135)
May 27 19:57:14 findka-api run.sh[2007]:     biff.crux$submit_tx.invokeStatic (crux.clj:392)
May 27 19:57:14 findka-api run.sh[2007]:     biff.crux$submit_tx.invoke (crux.clj:376)

refset20:05:19

> Is this a crux bug? Hopefully not! It does seem weird though. Here are some questions: 1. Is your await-tx synchronously blocking and returning before tx-committed? is called? 2. Is there a timeout on the await-tx? 3. Is it the same result of submit-tx definitely being passed to both calls?

Jacob O'Bryant20:05:20

1. yes 2. no 3. yes

Jacob O'Bryant20:05:29

it works the vast majority of the time; I only got that exception once. perhaps there's a subtle race condition?

jarohen20:05:38

@U7YNGKDHA is that a local node?

Jacob O'Bryant20:05:05

(I'm guessing that explains it?)

jarohen20:05:57

alas, the other way around I'm afraid - we're aware of https://github.com/juxt/crux/issues/527 for remote nodes

jarohen20:05:29

what's the return value of the submit-tx/`await-tx` calls?

jarohen20:05:48

oh, submit-tx I'm assuming returns {:crux.tx/tx-id 92, :crux.tx/tx-time #inst "2021-05-27T19:57:14.679-00:00"}

👍 3
Jacob O'Bryant20:05:02

oh I see. I thought you meant standalone (it is a local standalone node).

👍 3
jarohen20:05:44

also, does it remain out of sync? i.e. if you were to call await-tx/tx-committed again, would the Crux node have caught up?

jarohen20:05:08

(looking for race condition vs stuck there)

Jacob O'Bryant20:05:39

I've since submitted more transactions and they worked fine

Jacob O'Bryant20:05:38

idk for sure if it would have caught up had I just called await-tx/tx-committed? without submitting another tx, but I could test that if the error comes up again

Jacob O'Bryant20:05:12

so my guess is race condition

jarohen21:05:08

mm, me too... :thinking_face:

Jacob O'Bryant21:05:00

actually would this be triggered if you submitted a tx that didn't actually change anything? i.e. put a bunch of documents that had already been putted.

Jacob O'Bryant21:05:53

When the exception was thrown, I was submitting documents from a query that was accidentally returning duplicates, and there's a decent chance the whole tx was just documents that I already imported

jarohen21:05:10

should be alright - at the very least we put the latest-completed-tx metadata

👍 3
jarohen21:05:45

so there's a few events happening there, kv/store which delegates through to the KV store's write-batch implementation, and only once that's returned do we notify the awaiting threads

jarohen21:05:41

I'm guessing we can't get hold of the result from await-tx, given it's a very intermittent issue?

jarohen21:05:16

we're dealing with tx-ids rather than tx-times, and in any event the two txs (91 and 92) happened at distinct times - if they were the same time and you were calling await-tx-time, that'd be a potential cause

jarohen21:05:52

hm. :thinking_face:

Jacob O'Bryant21:05:01

interesting I could try to reproduce the issue and print the result of await-tx

jarohen21:05:13

that'd be great, thanks 🙂 in the meantime, will raise an issue

Jacob O'Bryant21:05:01

:thumbsup: I'll go ahead and do that right now, so maybe I'll have something in a few minutes

jarohen09:05:03

thanks @U7YNGKDHA, good spot 🙏 have a fix for this running through CI at the moment, and releasing a dev-SNAPSHOT. assuming you're on 1.16, dev-SNAPSHOT would be an index version bump because of other changes going into 1.17 shortly