Fork me on GitHub
#xtdb
<
2020-12-04
>
gklijs08:12:49

Anyone used Crux with event sourcing. Considering giving it a try.

ordnungswidrig10:12:27

Iā€™d be interested, too.

āž• 3
gklijs12:12:50

Not sure that will be the thing I use in the tutorial; at BOB Konferenz. But definitely event sourcing with Clojure. https://bobkonf.de/2021/en/

refset23:12:06

I've envisioned using Crux to store event logs (as document histories) and then using (mini) in-memory Crux nodes or DataScript to act as domain-level materialized views that get built & cached on-demand...but never yet tried to build anything šŸ™‚

nivekuil23:12:41

cool! why prefer history over discrete entities?

refset23:12:25

I've not benchmarked it, but I suspect it's slightly faster to do things that way round. Although if you ever want to be able to annotate or otherwise link together specific events then using discrete entities is a very wise strategy, and probably the better default choice!

Jorin09:12:13

What would be the advantage of storing events in Crux vs plain Kafka? I totally get the use case of lightweight Crux nodes as a downstream consumer of the event log though šŸ™‚

nivekuil23:12:52

I notice that at some point a malformed doc seems to have gotten transacted and now I see Transaction function failed when originally evaluated every time a node starts up, I guess when it indexes that bad transaction. Is it recommended to just evict it?

refset00:12:47

You shouldn't need to evict anything. When you say "every time a node starts up" do you mean from scratch? Or is that happening with persisted Rocks indexes?

nivekuil07:12:48

yes, from scratch -- newly created docker containers. I'm still using the 20.11 RC, can't recall if my previous test cluster was on this version or the old.

nivekuil07:12:32

I believe this is the offending tx fn:

{:crux.db/id :assoc                      :crux.db/fn                      '(fn [ctx eid attr new-value]                         (let [db     (crux.api/db ctx)                               entity (crux.api/entity db eid)]                           [[:crux.tx/put                             (assoc entity attr new-value)]]))}

nivekuil07:12:53

it looks like it was called on a nil entity, so the resulting doc was just the assoc'd value without a crux.db/id.

nivekuil07:12:04

the symptom is typical of broken indexing; I have 6 nodes running and all of them seem to be read-only. nothing else in the logs so I think it has to be crux.

nivekuil07:12:37

just restarted and reproduced. Here is the interesting log portion:

2020-12-04 23:47:43   at java.base/java.lang.Thread.run(Thread.java:832)   2020-12-04 23:47:43   at clojure.lang.AFn.run(AFn.java:22)   2020-12-04 23:47:43   at crux.tx$__GT_polling_tx_consumer$fn__68349.invoke(tx.clj:493)   2020-12-04 23:47:43   at crux.tx$index_tx_log.invoke(tx.clj:448)   2020-12-04 23:47:43   at crux.tx$index_tx_log.invokeStatic(tx.clj:450)   2020-12-04 23:47:43   at crux.tx$index_tx_log$fn__68328.invoke(tx.clj:458)   2020-12-04 23:47:43   at crux.tx$index_tx_log$fn__68328$fn__68333.invoke(tx.clj:470)   2020-12-04 23:47:43   at crux.tx.InFlightTx.abort(tx.clj:396)   2020-12-04 23:47:43   at crux.tx$index_docs.invoke(tx.clj:255)   2020-12-04 23:47:43   at crux.tx$index_docs.invokeStatic(tx.clj:257)   2020-12-04 23:47:43   at crux.error$illegal_arg.invoke(error.clj:3)   2020-12-04 23:47:43   at crux.error$illegal_arg.invokeStatic(error.clj:7)   2020-12-04 23:47:43   at crux.error$illegal_arg.invoke(error.clj:3)   2020-12-04 23:47:43   at crux.error$illegal_arg.invokeStatic(error.clj:12)   2020-12-04 23:47:43  Exception in thread "crux-polling-tx-consumer" crux.IllegalArgumentException: Missing required attribute :crux.db/id   2020-12-04 23:47:43  2020-12-05T07:47:43.667Z 73670e9a8d20 WARN [crux.tx:326] - Transaction function failed when originally evaluated: #crux/id fb5c548c8f8558c093ed35aa916c97e92b798c49 nil {:crux.db.fn/exception crux.IllegalArgumentException, :crux.db.fn/message "invalid tx-op: invalid entity id", :crux.db.fn/ex-data {:crux.error/error-type :illegal-argument, :crux.error/error-key :invalid-tx-op, :crux.error/message "invalid tx-op: invalid entity id", :op [:crux.tx/put {:entry/fresh? false}]}}   2020-12-04 23:47:24  2020-12-05T07:47:24.162Z 73670e9a8d20 INFO [crux.tx:326] - Started tx-consumer   2020-12-04 23:47:24  2020-12-05T07:47:24.042Z 73670e9a8d20 INFO [com.zaxxer.hikari.HikariDataSource:82] - HikariPool-1 - Start completed.   2020-12-04 23:47:23  2020-12-05T07:47:23.761Z 73670e9a8d20 INFO [com.zaxxer.hikari.HikariDataSource:80] - HikariPool-1 - Starting...

nivekuil07:12:18

being that the message is only a WARN I'm not sure if this is actually my problem, as it implies that it's routine.. but I'm still suspicious. regardless I think crux should never break like this, but if it does it should at least do everything it can to be loud about it

refset18:12:42

From what you've shared (thanks) I agree Crux shouldn't be breaking like this - indexing should continue despite errors like this. I will try to reproduce it. We do have various tests for this class of error but perhaps we've missed an edge case (particularly for "still working after errors"): https://github.com/juxt/crux/blob/master/crux-test/test/crux/tx_test.clj#L593-L644

nivekuil18:12:50

please, try to enjoy your weekend instead :) Really the more pressing issue on my mind is: what could I do to handle these types of situations (db going silently read-only) happening in production? Maybe crux exposes some metrics that could work? Or an await-tx heartbeat from my application?

nivekuil18:12:45

and after detection, what could I do to immediately remediate the issue? evict + restart? unfortunately a decent amount of downtime from how long everything takes to start up

refset23:12:27

> what could I do to handle these types of situations (db going silently read-only) happening in production? Really this should never happen. The most comprehensive heartbeat to check end-to-end functionality is to run an entity lookup on some well-known (and guaranteed to not change) entity

refset23:12:16

It's not clear to me how to remedy this particular situation yet (I don't think a normal evict could do it...), but I have managed to reproduce it locally now. We will have it fixed tomorrow hopefully and the fix will definitely be included in the imminent release. Thanks for your patience and for the report!

nivekuil23:12:42

Ah, good to hear it! I actually found another symptom: in the crux http console, the /_crux/sync endpoint will never resolve. Found that out while looking for a way to do healthchecks. > Really this should never happen in all honesty I've been trained to never take this line seriously šŸ™‚ I think a heartbeat is probably good enough for now though. It does seem challenging for crux itself to preemptively detect such situations

nivekuil23:12:59

It actually seems like that may have broken all indexing after that point.. not super sure. need to figure out how to forward an nrepl through swarm/traefik

refset00:12:41

Which version are you using? Do you have any clue as to what the offending transaction function might have been?