Fork me on GitHub
#datomic
<
2016-03-12
>
greywolve08:03:32

just experienced something odd with datomic in production, there was a stream of "transactor unavailable" exceptions in the log, and when i manually repled into that peer, and tried something like (d/sync conn) , i got the same error. we've seen this before ,and it never recovers

greywolve08:03:42

have any of you experienced this before?

greywolve08:03:40

we basically have to restart the jvm

bkamphaus11:03:03

@greywolve: do you have metrics/monitoring (or logs you can grep)? One case where this can happen is with extremely large transaction sizes (1MB+).

greywolve11:03:33

we have the transactor logs, and it usually begins with this:

greywolve12:03:09

3-5 of those , and then everything goes to hell later

greywolve12:03:52

our txes are quite small, and we weren't under load when this happened

greywolve12:03:23

it's happened a couple of times now, next time we'll have some flight recorder metrics too

greywolve12:03:36

is there anything i can check the transactor for?, that's the only thing we have in the peer logs

greywolve12:03:13

and connection destroyed follows the above:

greywolve12:03:06

and after that the transactor is never available again

greywolve12:03:40

this is our onyx cluster, we have other peers up on our regular servers, and they seem fine

greywolve12:03:06

we haven't run into this issue there

greywolve12:03:16

(also the transactor metrics look perfectly fine throughout this ordeal)

bkamphaus12:03:56

function metric-grep () {
  cat *.log | perl -n -e 'print "$1 $2\n" if /^(.*) INFO .* '"$1"' {.*?'"$2"' ([0-9]+).*?}/' | less
}

bkamphaus12:03:08

metric-grep :TransactionBytes :hi

bkamphaus12:03:50

or metrics (max over one minute), just to double check, what’s the largest transaction size?

greywolve12:03:52

datomic.transaction_bytes ?

greywolve12:03:28

0.41k is the highest during that period

greywolve12:03:54

highest over the past day is 12.03k

greywolve12:03:25

trouble started around ~8:00am

greywolve12:03:35

we had to restart at ~10:00am

bkamphaus12:03:41

Ok, transaction size unlikely to be the issue then. Hmm, I’m not familiar enough with what Onyx is doing to reason about it much further difference wise yet. Have you done the basics lein deps :tree check for any dependency conflicts, etc.?

greywolve13:03:19

bkamphaus: onyx isn't really doing any more than reading from the log api (polling it), and using datomic's transact, that's about it, nothing fancy. i'll check the deps though to be safe simple_smile

bkamphaus14:03:57

If there's a final tx from the transactor logs, it will be logged with a uuid - you can use that against the log API with tx-range to figure out which final transaction the peer made before failing. It's a key in the nested data structure, not something you can look up directly, and you need a reasonable t/tx/inst bound for the tx-range.

bkamphaus14:03:11

^ @greywolve

bkamphaus14:03:41

On phone now, I can pull up a code example when I get back to a keyboard :)

greywolve14:03:56

bkamphaus: awesome, thanks! that's a good idea simple_smile

greywolve14:03:08

bkamphaus: code example would be welcome if you can

bkamphaus14:03:42

@greywolve https://gist.github.com/benkamphaus/7eaa6484a254a14f8f1f just pulled this out of another project and slightly refactored without testing in isolation (will test it and fix any typos if I get a chance later), so you may have to make a minor correction or two.

greywolve14:03:39

thanks so much simple_smile