Fork me on GitHub
#onyx
<
2017-06-20
>
Travis16:06:24

Hey guys have any of you seen this before. My guess is yes

java[13630]: Exception in thread "main" java.lang.Exception: Error starting media driver. This may be due
Jun 20 08:44:50  java[13630]: to a media driver data incompatibility between
Jun 20 08:44:50  java[13630]: versions. Check that no other media driver has
Jun 20 08:44:50  java[13630]: been started and then use -d to delete the directory
Jun 20 08:44:50  java[13630]: on startup

michaeldrogalis17:06:02

@camechis Have you recently upgraded?

Travis17:06:23

still on 0.9.15 . Seems to occur when I upgrade my job code

michaeldrogalis17:06:13

This one’s definitely taken care of in 0.10, though I can’t think of a particular instance of a bug that would case what you’re seeing. Have you tried clearing out your Aeron directory?

Travis17:06:31

that works fine if i delete it the data under

/dev/shm

Travis17:06:43

just curious wondering if there are any issues there

michaeldrogalis17:06:28

There could still be a bug running around on the 0.9 branch involving media driver restarts, but it would be the first we’ve heard of it in a while.

lucasbradstreet18:06:54

@camechis are you using the media driver from lib-onyx? Maybe the onyx/aeron dependency is a different version there.

Travis18:06:53

I have to double check but this occurs on the same version. No changes

Travis18:06:50

So startup and then a restart changing the peer code or just service code. No deps changes

lmergen19:06:14

so, i'm trying to figure out what causes my peers to time out — basically i'm doing some pretty heavy processing inside a single function somewhere in the pipeline, and this repeatedly causes a peer to timeout on heartbeats, causing onyx to restart the process… i've done some benchmarking, and it looks like in some occasions this single function can take a few seconds to execute the server in question handling this does have enough CPU capacity on other cores left to do "other stuff", so i wouldn't expect a peer to starve because of this is this expected behaviour ?

lmergen19:06:35

i'm running -rc2

lmergen19:06:42

i would like to note that the task that's timing out is not the heavy calculation task

lucasbradstreet20:06:49

The peer on task pipeline/coerce is actually timing out another peer 801f1651-3c3a-a266-74a7-a2cc8f59b785

lucasbradstreet20:06:20

do you know what task that peer is on?

lucasbradstreet20:06:58

If you’re using a big batch size (e.g. 200), and each segment takes, say, 100ms to complete, you could easily blow your 20s timeout budget processing a single batch

lucasbradstreet20:06:12

so you need to either increase the timeout or reduce the batch size in that case

lucasbradstreet20:06:24

but I’d want to make sure the peer being timed out (801f1) is actually the one that is blocked

lmergen20:06:43

i actually put batch size 1 everywhere

lmergen20:06:17

just to reduce the surface of things that could influence this

lucasbradstreet20:06:59

yeah, that’s a bit surprising that you’d be seeing a timeout then, assuming that single segment isn’t taking forever.

lmergen20:06:59

i see this, which appears to be telling me what this peer was running: Peer 801f1651-3c3a-a266-74a7-a2cc8f59b785 - Warming up task lifecycle {:id :pipeline/normalize-charset, :name :pipeline/normalize-charset, :ingress-tasks #{:event/in}, :egress-tasks-batch-sizes #:pipeline{:coerce 1}, :egress-tasks #{:pipeline/coerce}}

lucasbradstreet20:06:32

Yep, looks like it’s the pipeline/normalize-charset peer that isn’t responding.

lmergen20:06:49

well, that's not supposed to be the task that's doing the heavy duty processing

lucasbradstreet20:06:50

which seems like an odd one for it to be stuck on, based on the task name

lmergen20:06:51

or i'm crazy

lucasbradstreet20:06:43

Yeah, that’s definitely pretty odd.

lmergen20:06:17

if a batch size = 1, does that mean a peer will block forever if the next peer is busy ?

lucasbradstreet20:06:24

If you have onyx-peer-http-query setup on this project, you should curl the /metrics endpoint. It should give you a lot of information about how peers are heartbeating

lucasbradstreet20:06:06

the peer will keep trying to offer its message to the downstream peer, but the offer is non blocking so it will continue to heartbeat, etc in between offers

lmergen20:06:54

cool, metrics seem to work

lmergen20:06:58

i should hook this up with riemann

lmergen20:06:46

since_received_heartbeat_Max{job_name="81b4942_3f5d_cd1f_b342_1012641b187d", job_id="681b4942_3f5d_cd1f_b342_1012641b187d", task="pipeline_coerce", slot_id="2", peer_id="3358dbd7_190b_4c3c_9f3f_acb380785400"} 21519.38363

lmergen20:06:50

that seems fishy

lucasbradstreet20:06:17

there should be another metric for when each peer heartbeats

lucasbradstreet20:06:24

not just when they received one

lucasbradstreet20:06:30

what does that one look like

lmergen20:06:33

is that since_heartbeat ?

lmergen20:06:49

looks normal, all <500

lucasbradstreet20:06:15

Definitely fishy then. hmm

lmergen20:06:12

i should probably do some benchmarking with tufte around that normalise-charset function

lmergen20:06:28

to rule out that that is actually taking a long time on some weird dataset

lmergen20:06:37

thanks for your help @lucasbradstreet, will report back in case this problem persist and i have more data

lucasbradstreet20:06:06

Cool. Happy to help you look into it further once you have more info

lucasbradstreet20:06:31

I definitely want to make sure we get the heartbeating / liveness detection right