This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-06-20
Channels
- # beginners (106)
- # boot (25)
- # cider (2)
- # cljs-dev (100)
- # cljsjs (1)
- # cljsrn (8)
- # clojure (90)
- # clojure-brasil (1)
- # clojure-dev (7)
- # clojure-greece (2)
- # clojure-italy (4)
- # clojure-madison (1)
- # clojure-russia (15)
- # clojure-serbia (15)
- # clojure-spec (13)
- # clojure-uk (32)
- # clojurescript (88)
- # cursive (19)
- # datascript (13)
- # datomic (32)
- # defnpodcast (1)
- # dirac (43)
- # euroclojure (1)
- # graphql (5)
- # hoplon (11)
- # immutant (1)
- # jobs (6)
- # lein-figwheel (2)
- # liberator (2)
- # luminus (14)
- # lumo (22)
- # off-topic (12)
- # om (9)
- # onyx (49)
- # parinfer (45)
- # precept (4)
- # protorepl (2)
- # reagent (14)
- # ring-swagger (3)
- # sql (1)
- # test-check (58)
- # timbre (3)
- # untangled (86)
Hey guys have any of you seen this before. My guess is yes
java[13630]: Exception in thread "main" java.lang.Exception: Error starting media driver. This may be due
Jun 20 08:44:50 java[13630]: to a media driver data incompatibility between
Jun 20 08:44:50 java[13630]: versions. Check that no other media driver has
Jun 20 08:44:50 java[13630]: been started and then use -d to delete the directory
Jun 20 08:44:50 java[13630]: on startup
@camechis Have you recently upgraded?
This one’s definitely taken care of in 0.10, though I can’t think of a particular instance of a bug that would case what you’re seeing. Have you tried clearing out your Aeron directory?
There could still be a bug running around on the 0.9 branch involving media driver restarts, but it would be the first we’ve heard of it in a while.
@camechis are you using the media driver from lib-onyx? Maybe the onyx/aeron dependency is a different version there.
So startup and then a restart changing the peer code or just service code. No deps changes
so, i'm trying to figure out what causes my peers to time out — basically i'm doing some pretty heavy processing inside a single function somewhere in the pipeline, and this repeatedly causes a peer to timeout on heartbeats, causing onyx to restart the process… i've done some benchmarking, and it looks like in some occasions this single function can take a few seconds to execute the server in question handling this does have enough CPU capacity on other cores left to do "other stuff", so i wouldn't expect a peer to starve because of this is this expected behaviour ?
https://gist.githubusercontent.com/solatis/a316d18be0268127d155039144c72b53/raw/82bdd9e966fae8c850aa4c67119504edcf792713/gistfile1.txt ^ example logs, for completeness sake
i would like to note that the task that's timing out is not the heavy calculation task
The peer on task pipeline/coerce is actually timing out another peer 801f1651-3c3a-a266-74a7-a2cc8f59b785
do you know what task that peer is on?
If you’re using a big batch size (e.g. 200), and each segment takes, say, 100ms to complete, you could easily blow your 20s timeout budget processing a single batch
so you need to either increase the timeout or reduce the batch size in that case
but I’d want to make sure the peer being timed out (801f1) is actually the one that is blocked
yeah, that’s a bit surprising that you’d be seeing a timeout then, assuming that single segment isn’t taking forever.
i see this, which appears to be telling me what this peer was running: Peer 801f1651-3c3a-a266-74a7-a2cc8f59b785 - Warming up task lifecycle {:id :pipeline/normalize-charset, :name :pipeline/normalize-charset, :ingress-tasks #{:event/in}, :egress-tasks-batch-sizes #:pipeline{:coerce 1}, :egress-tasks #{:pipeline/coerce}}
Yep, looks like it’s the pipeline/normalize-charset peer that isn’t responding.
which seems like an odd one for it to be stuck on, based on the task name
Yeah, that’s definitely pretty odd.
if a batch size = 1, does that mean a peer will block forever if the next peer is busy ?
If you have onyx-peer-http-query setup on this project, you should curl the /metrics endpoint. It should give you a lot of information about how peers are heartbeating
the peer will keep trying to offer its message to the downstream peer, but the offer is non blocking so it will continue to heartbeat, etc in between offers
since_received_heartbeat_Max{job_name="81b4942_3f5d_cd1f_b342_1012641b187d", job_id="681b4942_3f5d_cd1f_b342_1012641b187d", task="pipeline_coerce", slot_id="2", peer_id="3358dbd7_190b_4c3c_9f3f_acb380785400"} 21519.38363
there should be another metric for when each peer heartbeats
not just when they received one
what does that one look like
Should be
Definitely fishy then. hmm
i should probably do some benchmarking with tufte around that normalise-charset
function
thanks for your help @lucasbradstreet, will report back in case this problem persist and i have more data
Cool. Happy to help you look into it further once you have more info
I definitely want to make sure we get the heartbeating / liveness detection right