Fork me on GitHub

just experienced something odd with datomic in production, there was a stream of "transactor unavailable" exceptions in the log, and when i manually repled into that peer, and tried something like (d/sync conn) , i got the same error. we've seen this before ,and it never recovers


have any of you experienced this before?


we basically have to restart the jvm

Ben Kamphaus11:03:03

@greywolve: do you have metrics/monitoring (or logs you can grep)? One case where this can happen is with extremely large transaction sizes (1MB+).


we have the transactor logs, and it usually begins with this:


3-5 of those , and then everything goes to hell later


our txes are quite small, and we weren't under load when this happened


it's happened a couple of times now, next time we'll have some flight recorder metrics too


is there anything i can check the transactor for?, that's the only thing we have in the peer logs


and connection destroyed follows the above:


and after that the transactor is never available again


this is our onyx cluster, we have other peers up on our regular servers, and they seem fine


we haven't run into this issue there


(also the transactor metrics look perfectly fine throughout this ordeal)

Ben Kamphaus12:03:56

function metric-grep () {
  cat *.log | perl -n -e 'print "$1 $2\n" if /^(.*) INFO .* '"$1"' {.*?'"$2"' ([0-9]+).*?}/' | less

Ben Kamphaus12:03:08

metric-grep :TransactionBytes :hi

Ben Kamphaus12:03:50

or metrics (max over one minute), just to double check, what’s the largest transaction size?


datomic.transaction_bytes ?


0.41k is the highest during that period


highest over the past day is 12.03k


trouble started around ~8:00am


we had to restart at ~10:00am

Ben Kamphaus12:03:41

Ok, transaction size unlikely to be the issue then. Hmm, I’m not familiar enough with what Onyx is doing to reason about it much further difference wise yet. Have you done the basics lein deps :tree check for any dependency conflicts, etc.?


bkamphaus: onyx isn't really doing any more than reading from the log api (polling it), and using datomic's transact, that's about it, nothing fancy. i'll check the deps though to be safe simple_smile

Ben Kamphaus14:03:57

If there's a final tx from the transactor logs, it will be logged with a uuid - you can use that against the log API with tx-range to figure out which final transaction the peer made before failing. It's a key in the nested data structure, not something you can look up directly, and you need a reasonable t/tx/inst bound for the tx-range.

Ben Kamphaus14:03:41

On phone now, I can pull up a code example when I get back to a keyboard :)


bkamphaus: awesome, thanks! that's a good idea simple_smile


bkamphaus: code example would be welcome if you can

Ben Kamphaus14:03:42

@greywolve just pulled this out of another project and slightly refactored without testing in isolation (will test it and fix any typos if I get a chance later), so you may have to make a minor correction or two.


thanks so much simple_smile