xtdb 2023-03-13 | Slack Archive

jmerrifield19:03:32

I just experienced some NodeOutOfSyncExceptions coming from tx-committed?- my code to submit a TX is basically this:

(let [submitted-tx (xt/submit-tx node ops)]
  (xt/await-tx node submitted-tx)
  (if (xt/tx-committed? node submitted-tx)
    nil
    (throw (Error. "TX not committed"))))

The message is

Node out of sync - requested '{:xtdb.api/tx-id 3970, :xtdb.api/tx-time #inst "2023-03-13T19:15:45.803-00:00"}', available '{:xtdb.api/tx-time #inst "2023-03-13T19:15:45.806-00:00", :xtdb.api/tx-id 3969}'

Am I doing something wrong when submitting the transaction?

jmerrifield19:03:27

If relevant, I'm using a JDBC tx-log and doc-store sharing a connection pool. This is a reasonably low-traffic app that in this scenario got a flurry of parallel requests (frontend making ~12 web requests at once to submit transactions)

jmerrifield19:03:20

Also relevant: the transactions did actually all get committed

refset20:03:57

Is XTDB running embedded in your app here?

jmerrifield21:03:51

Yes, it's a pedestal/lacinia app, I just xt/start-node at startup and attach it to the pedestal request map so that handlers/resolvers can interact with the DB

👍 2

refset21:03:57

I can't see anything obviously wrong with your code - maybe there's something happening in the connection pool that is causing the API to return slightly stale results :thinking_face:

refset21:03:49

If your app can tolerate it, I suspect a short Thread/sleep could largely mitigate the effect as a workaround for now

refset21:03:38

Similarly, you could try reducing the parallelism to mitigate also

refset21:03:20

Like if there was a way to force await-tx to use the exact same underlying JDBC connection as tx-committed? then this behaviour definitely shouldn't be possible.

refset21:03:13

which version are you using? can you share your connection-pool config?

jmerrifield22:03:32

On version 1.22.1, my whole config map is below, haven't done any connection pool tweaking yet. Perhaps it's worth catching that particular error and retrying tx-committed? after a short sleep.

{:xtdb.jdbc/connection-pool
     {:dialect {:xtdb/module 'xtdb.jdbc.psql/->dialect}
      :pool-opts {}
      :db-spec {:host (System/getenv "XTDB_PG_HOST")
                :dbname (System/getenv "XTDB_PG_DBNAME")
                :user (System/getenv "XTDB_PG_USER")
                :password (System/getenv "XTDB_PG_PASSWORD")}}

     :xtdb/tx-log
     {:xtdb/module 'xtdb.jdbc/->tx-log
      :connection-pool :xtdb.jdbc/connection-pool}

     :xtdb/document-store
     {:xtdb/module 'xtdb.jdbc/->document-store
      :connection-pool :xtdb.jdbc/connection-pool}

     :xtdb/index-store
     {:kv-store {:xtdb/module 'xtdb.rocksdb/->kv-store
                 :db-dir (io/file "/tmp/rocksdb-ix")
                 :checkpointer {:store {:xtdb/module 'xtdb.s3.checkpoint/->cp-store
                                        :bucket (System/getenv "XTDB_CHECKPOINT_BUCKET")}
                                                 
                                :approx-frequency (Duration/ofHours 1)}}}}

👍 2

refset22:03:08

Good idea, catch and retry should be enough to mitigate

jmerrifield23:03:59

OK I can add that logic! You think I should be concerned at all that this is cropping up under reasonably light amounts of parallelism? It's a new app that will hopefully get some regular traffic - although I doubt we'll need to scale beyond a single EC2 instance any time soon

sparkofreason10:03:09

Just my own learning: isn't :xtdb.api/tx-time for tx-id 3970 less than that for tx-id 3969? Is that expected?

refset12:03:11

> You think I should be concerned at all that this is cropping up under reasonably light amounts of parallelism? Only in the sense that XT should be accounting for this properly but that's on us. I've not seen anyone report anything similar to this before. If later requests show the transactions committed properly then I wouldn't be concerned about broader consistency troubles. > Just my own learning: isn't :xtdb.api/tx-time for tx-id 3970 less than that for tx-id 3969? Is that expected? Good spot! I don't think that is meant to be possible... tx-time should only ever be monotonically increasing alongside tx-id. This arguably retracts my "wouldn't be concerned" comment. Can you see these entries in open-tx-log reporting the exact same values?

refset12:03:02

Which Postgres setup is actually running behind the scenes? e.g. is this RDS or Aurora?

jmerrifield18:03:53

Yes, same situation reflected with open-tx-log:

:tx-id 3969
:tx-time #inst "2023-03-13T19:15:45.806-00:00"

:tx-id 3970
:tx-time #inst "2023-03-13T19:15:45.803-00:00"

This is an RDS Aurora cluster, configured as per screenshot, I suppose this situation is explained by the replication lag? To be honest I didn't give it much thought, just picked default options to get some managed Postgres up and running quickly!

refset11:03:22

Hey again, thanks for sharing that. I just opened up https://github.com/xtdb/xtdb/issues/1917 to keep track of this non-monotonic tx-time weirdness. We'll take at look at a resolution soon. Hopefully it's not a blocker for your use case in the meantime

refset11:03:47

Did your sleep logic resolve the initial problem? I raised an issue for that also https://github.com/xtdb/xtdb/issues/1918

jmerrifield22:03:47

Great, I will keep an eye on both issues! It doesn't seem that my sleep logic has been triggered yet so I haven't been able to see what happens. Is an Aurora cluster generally a good fit for XTDB or would it make more sense to run a single-node Postgres configuration? My traffic levels could easily be handled by a tiny single node DB for the foreseeable future.

refset22:03:06

Aurora should be fine, we've just not tested against it. When AWS claims "Postgres compatible" I'm generally inclined to believe them, but it's certainly quite a different internal architecture and I don't have a good grasp of what's happening inside their purpose-built driver (should you choose to use that instead of the regular driver) https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Connecting.html#Aurora.Connecting.AuroraPostgreSQL.JDBCDriverPostgreSQL

refset22:03:54

I'd be inclined to say stick with Aurora at this point assuming nothing else crops up, I'm unconvinced that either issue is Aurora-specific, although there may be something about the added latencies it introduces that is surfacing these things :thinking_face:

jmerrifield23:03:00

I suppose I was thinking less about Aurora quirks and more about generally using XTDB on top of a reader-writer pair with some unavoidable replication lag? The docs claim 'usually much less than 100ms' of lag but if my app is frequently trying to read back data immediately after writing it, I'd expect to see stale results somewhat regularly

jmerrifield23:03:30

Presumably I could connect directly to the writer instance rather than the cluster endpoint, or just scale down to a single writer with no readers, to definitely avoid any sort of lag

refset00:03:43

> 'usually much less than 100ms' of lag Interesting, I didn't see that in their docs before. 100ms would be plenty to explain the out-of-sync 😅 My intuition is that you shouldn't need more than one pg instance unless you have some fairly extreme requirements of the doc-store and don't want to keep the document-cache on the nodes very slim - so yeah drop the reader-writer pair setup to reduce confusion

refset00:03:04

by the time you need more scale than the single instance you may want to look at other options anyway (for cost and/or performance)

jmerrifield21:03:46

Interestingly, looks like I've experienced another few occurrences of the NodeOutOfSyncException even after downsizing the Aurora cluster to a single node. As before, it happened during a handful of parallel requests and shows tx-time going out of order. I really thought this would be impossible after reconfiguring the cluster :thinking_face:

refset09:03:11

Ah 😕 did your sleep not logic fire at all?

jmerrifield18:03:36

Yeah sleep logic was fine, the web handler was able to run to completion and the data landed in the DB successfully. I have a few more instances of out-of-sequence tx-time values now in the TX log I suppose. Just figured it was worth mentioning since I wasn't expecting to see it at all now that I have a single-node cluster!

refset21:03:40

Thanks for clarifying that. The out-of-sequence tx-time is looking like a legitimate bug (still as per https://github.com/xtdb/xtdb/issues/1917) - although we've not updated the issue again so far, we are actively investigating this still and should have a resolution soon. I will keep you posted but feel free to leave comments or questions on that issue as needed.

Thomas Moerman07:09:55

I've observed the monotonicity problem too: Snippet:

(let [tx (->> (xt/submit-tx node txs opts)
              (xt/await-tx node))]
     (if (xt/tx-committed? node tx)
       ...

Exception:

xtdb.api.NodeOutOfSyncException: Node out of sync - requested '{:xtdb.api/tx-id 88708, :xtdb.api/tx-time #inst "2023-09-20T20:27:34.074-00:00"}', available '{:xtdb.api/tx-time #inst "2023-09-20T20:27:34.075-00:00", :xtdb.api/tx-id 88707}'
    data: {:xtdb.error/error-type :node-out-of-sync,
           :xtdb.error/message
           "Node out of sync - requested '{:xtdb.api/tx-id 88708, :xtdb.api/tx-time #inst \"2023-09-20T20:27:34.074-00:00\"}', available '{:xtdb.api/tx-time #inst \"2023-09-20T20:27:34.075-00:00\", :xtdb.api/tx-id 88707}'",
           :requested
           {:xtdb.api/tx-id 88708,
            :xtdb.api/tx-time #inst "2023-09-20T20:27:34.074-00:00"},
           :available
           {:xtdb.api/tx-time #inst "2023-09-20T20:27:34.075-00:00",
            :xtdb.api/tx-id 88707}}

Setup: • local on MacOS, wired up via docker-compose • postgres:14.1-alpine

{:xtdb.jdbc/connection-pool {:dialect   {:xtdb/module 'xtdb.jdbc.psql/->dialect}
                               :db-spec   jdbc
                               :pool-opts {}}

   :xtdb/tx-log               {:xtdb/module     'xtdb.jdbc/->tx-log
                               :connection-pool :xtdb.jdbc/connection-pool}
   :xtdb/document-store       {:xtdb/module     'xtdb.jdbc/->document-store
                               :connection-pool :xtdb.jdbc/connection-pool}
   :xtdb/index-store          {:kv-store {:xtdb/module 'xtdb.rocksdb/->kv-store
                                          :db-dir      (io/file db-dir)}}}

refset08:09:53

hey @U052A8RUT - to confirm, this was observed using purely a local postgres running within Docker? :thinking_face:

refset08:09:02

can you repro it reliably?

Thomas Moerman09:09:23

Yes correct, a postgres running locally within docker. I haven't been able to reproduce it, it sporadically occurs. Next time I see it I'll try to capture the conditions.

👍 1

refset10:09:13

Thanks, and which version are you running? Are you seeing out-of-order timestamp entries in open-tx-log calls?

refset19:12:14

Hey folks, just following up to confirm these issues should now be resolved, per https://clojurians.slack.com/archives/CG3AM2F7V/p1701801414762909

2023-03-13

Channels