Fork me on GitHub
#xtdb
<
2021-01-13
>
nivekuil00:01:51

no idea what I did (it could have been flipping crux/s3 to {:mvn/version "20.12-1.13.0-beta"} from dev-SNAPSHOT, but I thought I did that earlier and it didn't make a difference..), but now the log messages from one instance (another container, the same docker image, still shows UnsafeBuffer) show the hash. The problem remains that I'm seeing f4911330946a WARN [crux.s3:326] - S3 key not found: 7d4e3c1cf1d16755f4c79a655194679b9f5a40f0 1-3 times every 100ms, and the offending query apparently stuck in this loop. Is it expected for crux to fail so inelegantly on a missing document? How can I tell crux to just accept that there was data loss?

refset21:01:59

Hmm, the default server-wide query timeout is 30s -are you saying the loop is lasting longer than this?

nivekuil22:01:51

longer than 30s for sure

nivekuil23:01:12

as a visual. the spikes every 30m are expected, monotonically increasing qps is not

refset15:01:47

Hey again @U797MAJ8M if you are able to test the new dev-SNAPSHOT version, we think this problem should now be resolved (based on a fix included with https://github.com/juxt/crux/pull/1367) ...if the cause is what we suspect it is. Are you running queries inside transaction functions? Or can you identify the related query to help us repro it?

nivekuil22:01:37

yup, queries inside tx fns. looks promising, will test :)

🙏 3
🤞 3
nivekuil02:01:08

no luck, though this could be a different problem. What I'm seeing is: 1. Removed checkpoints, restarted applications, crux nodes start indexing 2. Few thousands of txs are indexed, but soon all nodes start infinite looping with "S3 key not found: 38dee3cd44ca717b5795b86fd93ed422f9db6fb2" <- the same doc hash on all nodes 3. Try adding the doc manually by uploading that missing hash with some "seems legit" data to the S3 bucket. Now all nodes start looping on a new missing doc "S3 key not found: d989493daf11321f314dab21412a726e5778e6a2" Now, I know that these docs are really missing since I induced that data loss. I have neglected to mention that the tx log is also untrustworthy (redpanda, no idempotence, so expect messages to be duped/ooo) but the fact that the behavior is affected by adding the missing doc indicts the doc store (personally I would have expected tx log issues to have surfaced awhile ago with the amount of network partitions this has seen)

refset11:01:28

Damn, thanks for trying that. I wonder if there's some failures/errors being returned by the S3 putObject requests that we're not capturing properly. I'll take a closer look again on Monday

nivekuil15:01:06

I got it working again with this patch: https://github.com/nivekuil/crux/commit/02744d5d3ad382adf21a28bb2accd31de33e4d1d not making a PR as I think you probably want a more robust look at how to deal with missing docs. In summary the eventual consistency handling seems to be overly optimistic about the possibility of data loss