Fork me on GitHub
#xtdb
<
2021-01-21
>
R.A. Porter14:01:03

Having moved to a JVM-embedded node, I think I’ve made a mistake in my design. I’m storing the node I get back from crux.api/start-node in an atom which works great for shutdown, but I’m also sharing that node for all my queries by taking db snapshots with crux.api/db. Things seem to be working fine with a single thread, and even up to around 20 concurrent queries. But sometime after that, it just seems to fall over, bailing out without completing but not logging any errors. The queries are long-running (about 15 sec for a single invocation) and include a complex predicate. I did not have this problem with I connected over distinct remote connections (using crux.api/new-api-client), and am leaning toward it being my sharing of the node which is the issue. It could, instead, be that I’m now in a single JVM and am resource constrained but I’d expect to see some error logged to that effect. If it’s the node sharing, what is the right model to apply for an in-process Crux node to be queried by multiple, concurrent threads?

refset14:01:12

Hmm, nothing sounds strictly wrong to me, although note that crux.api/db isn't really a lower-level snapshot so might be the source of your issues - you may want to try crux.api/open-db - did you look at that already? How much memory have you allocated to the JVM and how much memory is still available outside the JVM? Are you using crux-rocksdb?

R.A. Porter15:01:03

I’m still in dev mode on Edge, so running with default heap under rebel. Switched to open-db and halved the number of times I was taking a snapshot as I didn’t need to take two in quick succession in a single thread for two sequential queries. No quick fix with that. I’m using RocksDB for document and index stores on my local disk and Kafka for txact store in Confluent. I wish I were seeing an error somewhere. I may try building a Capsule uberjar and testing against that. That’ll take a little bit.

R.A. Porter15:01:25

Well, I’m stupid. My testing of concurrent calls was simulated using futures. So there were uncaught exceptions that weren’t being logged. Namely, query timeouts. picard-facepalm

simple_smile 3
malcolmsparks19:01:55

I've been there - fun with async

refset20:01:21

Yeah, it's safe to say we've all been there before 🙂 some positive did come out of it though, I raised an issue that might help mitigate such confusion in future: https://github.com/juxt/crux/issues/1392 (thanks for the prompt!)

💯 6