This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-03-05
Channels
- # bangalore-clj (4)
- # beginners (16)
- # boot (4)
- # cljs-dev (1)
- # cljsrn (2)
- # clojure (177)
- # clojure-italy (2)
- # clojure-nl (1)
- # clojure-russia (41)
- # clojure-spec (3)
- # clojure-uk (21)
- # clojurescript (46)
- # code-art (1)
- # datomic (10)
- # hoplon (125)
- # leiningen (1)
- # luminus (2)
- # lumo (1)
- # off-topic (10)
- # onyx (69)
- # re-frame (22)
- # reagent (4)
- # ring (32)
- # rum (6)
- # specter (2)
- # untangled (5)
@jasonbell this is your guy http://www.onyxplatform.org/docs/cheat-sheet/latest/#lifecycle-calls/:lifecycle/handle-exception. You can set the lifecycle with task :all
to make sure all tasks restart
@jasonbell however… that may not be a transitory issue that would helped by restarting, because your buffers may not be big enough to copy the message after the reboot
You may need to increase the term buffer sizes, which would increase your shm requirements
Looks like you’re either using a batch size > 1, or your segment is > 2MB
Is there a control on the segment size as in this instance I'd expect some of these messages to go over 2mb.
max-message-size is (/ aeron.term.buffer.length 8)
. In this case you must be using 16MB, so your max message size is 2MB
Oh right, now that opens up a whole new level of checking to do. But this is very helpful.
@jasonbell by the way I'm adding a task offset metric that plugins can use to report offsets like the Kafka offset. It'll be in the next release which should be out in the next couple days
I've been wanting to add that for a long time and you have me the extra push
Out of interest, where are you pushing the metrics and how?
I'm mostly wondering if you're using an agent that scrapes the jmx metrics, or if you're polling the /metrics end point
Mesos uses the metrics endpoint as it's heartbeat, so it should be easy to have another agent query for offset etc
Though are any of the Onyx Kafka offset polls being updated to __consumer_offsets topic
or is everything handled in the /onyx/[id]/checkpoint
nodes?
The latter. The way we manage the offsets doesn't really lend itself to managing the offsets via Kafka
So my need for offset position in relation to the log becomes all the more important I'm sorry to say 🙂
And a small piece of information in the docs about how it's done. (I saw there was an issue logged for that too).
Absolutely. For context we need a way to have a consistent snapshot of the offsets in relation to the barrier messages so that we can restore the full job state to a consistent state.
Thanks. 0.10.0 still has a few holes in this regard but it's getting there
Thanks for the help @lucasbradstreet
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.subscriber:103] - Stopping subscriber [[#uuid "61ab2b5c-4d4c-f6e6-6c28-692d6000846c" :add-fields] -1 {:address "localhost", :port 40200, :aeron/peer-task-id nil}] :subscription 8511
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79150592, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "c90a997a-9c13-d2f0-ee23-cda3b9c65ea2", :dst-session-id 1407885110, :short-id 1, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79152128, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "1fb72f40-08d6-1c52-b5bc-2baabad9d354", :dst-session-id 1407885110, :short-id 2, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79152704, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "c822fd7e-5fe6-8aa5-58ea-89b8b386a530", :dst-session-id 1407885110, :short-id 0, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "3a534613-e92b-b78e-23d5-95cba6191ccd", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79153088, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :dst-session-id 1407885110, :short-id 8, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "6a4ba641-8931-c438-a83e-260b4907f68f", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79153088, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :dst-session-id 1407885110, :short-id 3, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.peer.virtual-peer:49] - Stopping Virtual Peer cb34a779-83b0-9502-bf23-b8432faa1eea
Okay so job happily processes the Kafka queue and then seems to idle and never pick up when new messages come in (and I know they are constantly coming in).
I’ll be honest, the peer-task-id is vestigial and isn’t currently used, but I wanted to use it again in the future so I left it in. I think I will just remove it.
On why it isn’t picking up new messages, I’m not sure.
These messages are logged after you stop the peer after it idles for a while?
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.peer.virtual-peer:49] - Stopping Virtual Peer cb34a779-83b0-9502-bf23-b8432faa1eea
I’m wondering why it’s being stopped here
Right. That’s good to know. No exceptions in the logs? It shouldn’t stop unless it’s either being timed out or an exception is being thrown somewhere
There's 5gb for the Mesos container and 2.5gb --shm-size so it's got room to breathe
The fact that the peer is being stopped at 09:40:07 suggest something happened and it is no longer idling
The exception flow conditions wouldn't stop the job, they should just log and carry on.
Me too. Do you have onyx-dashboard setup? You could check what the cluster coordination log shows activity wise
@lucasbradstreet FYI, this is where everything just stops for no reason
17-03-05 08:59:44 cc8c494c30ec INFO [onyx.thing.workflow.shared:33] - ***** GZIP CSV BATCH *****
17-03-05 08:59:44 cc8c494c30ec INFO [onyx.peer.task-lifecycle:782] - Job 61ab2b5c-4d4c-f6e6-6c28-692d6000846c {:job-id #uuid "61ab2b5c-4d4c-f6e6-6c28-692d6000846c", :job-hash "80db2587326bbaf7d370d058dac52628cdf891489031eedf06442e9ccc08160"} - Task {:id :cheapest-flight, :name :cheapest-flight, :ingress-tasks #{:add-fields}, :egress-tasks #{:out-processed :out-error}} - Peer 6a4ba641-8931-c438-a83e-260b4907f68f - Peer timed out with no heartbeats. Emitting leave cluster. {:fn :leave-cluster, :peer-parent #uuid "6a4ba641-8931-c438-a83e-260b4907f68f", :args {:id #uuid "f66b5d79-47c0-a5ef-cf3a-f44423b7ae0a", :group-id #uuid "bb610b02-a1c7-a755-c7fd-181025b72bf3"}}
17-03-05 08:59:44 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "f1967ad4-b74b-2341-1a64-c3b0bcbac830", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 536096, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "f66b5d79-47c0-a5ef-cf3a-f44423b7ae0a", :dst-session-id 1407885110, :short-id 5, :status-session-id 1407885113}
17-03-05 08:59:44 cc8c494c30ec INFO [onyx.peer.task-lifecycle:452] - Job 61ab2b5c-4d4c-f6e6-6c28-692d6000846c {:job-id #uuid "61ab2b5c-4d4c-f6e6-6c28-692d6000846c", :job-hash "80db2587326bbaf7d370d058dac52628cdf891489031eedf06442e9ccc08160"} - Task {:id :add-fields, :name :add-fields, :ingress-tasks #{:in}, :egress-tasks #{:out-raw :cheapest-flight :out-error}} - Peer f66b5d79-47c0-a5ef-cf3a-f44423b7ae0a - Fell out of task lifecycle loop
@jasonbell that makes more sense. Ok so what is happening there is that the peer is being timed out because it hasn't heartbeated in X ms. The peer should then be stopped and will either reboot (or not if it's dead)
see http://www.onyxplatform.org/docs/cheat-sheet/latest/#peer-config/:onyx.peer/publisher-liveness-timeout-ms and http://www.onyxplatform.org/docs/cheat-sheet/latest/#peer-config/:onyx.peer/subscriber-liveness-timeout-ms
if it’s possible that it takes over 5s to process that segment it’ll be timed out
5s is probably too low of a default. I thought the default was 10
oh it is 10
No worries. Let me know what you end up find. I'm off to sleep
@lucasbradstreet just to let you, extending the heartbeat timeouts has worked. I'll be keeping a close eye on it over the next 24h. Thanks for the help earlier.