Fork me on GitHub
#onyx
<
2017-03-05
>
lucasbradstreet08:03:38

@jasonbell this is your guy http://www.onyxplatform.org/docs/cheat-sheet/latest/#lifecycle-calls/:lifecycle/handle-exception. You can set the lifecycle with task :all to make sure all tasks restart

jasonbell08:03:16

thank you, especially since it's a Sunday

lucasbradstreet08:03:18

@jasonbell however… that may not be a transitory issue that would helped by restarting, because your buffers may not be big enough to copy the message after the reboot

lucasbradstreet08:03:25

You may need to increase the term buffer sizes, which would increase your shm requirements

lucasbradstreet08:03:56

Looks like you’re either using a batch size > 1, or your segment is > 2MB

jasonbell08:03:47

Batch size is 1

jasonbell08:03:19

Is there a control on the segment size as in this instance I'd expect some of these messages to go over 2mb.

lucasbradstreet08:03:53

max-message-size is (/ aeron.term.buffer.length 8). In this case you must be using 16MB, so your max message size is 2MB

jasonbell08:03:19

Oh right, now that opens up a whole new level of checking to do. But this is very helpful.

lucasbradstreet08:03:17

@jasonbell by the way I'm adding a task offset metric that plugins can use to report offsets like the Kafka offset. It'll be in the next release which should be out in the next couple days

jasonbell08:03:46

That's brilliant news, thanks

lucasbradstreet08:03:39

I've been wanting to add that for a long time and you have me the extra push

lucasbradstreet08:03:43

Out of interest, where are you pushing the metrics and how?

jasonbell08:03:05

Just raw logs at the moment but that will end up going to ELK

jasonbell08:03:22

Right now my issue is offset position and consumer group lag.

lucasbradstreet08:03:27

I'm mostly wondering if you're using an agent that scrapes the jmx metrics, or if you're polling the /metrics end point

jasonbell08:03:59

Mesos uses the metrics endpoint as it's heartbeat, so it should be easy to have another agent query for offset etc

jasonbell08:03:47

Though are any of the Onyx Kafka offset polls being updated to __consumer_offsets topic or is everything handled in the /onyx/[id]/checkpoint nodes?

lucasbradstreet08:03:35

The latter. The way we manage the offsets doesn't really lend itself to managing the offsets via Kafka

jasonbell08:03:04

So my need for offset position in relation to the log becomes all the more important I'm sorry to say 🙂

jasonbell08:03:45

And a small piece of information in the docs about how it's done. (I saw there was an issue logged for that too).

jasonbell08:03:56

great work though, I'm very impressed.

lucasbradstreet08:03:51

Absolutely. For context we need a way to have a consistent snapshot of the offsets in relation to the barrier messages so that we can restore the full job state to a consistent state.

lucasbradstreet08:03:21

Thanks. 0.10.0 still has a few holes in this regard but it's getting there

jasonbell08:03:06

All redeployed so I'll let it catch up and see how it looks later.

jasonbell09:03:24

@lucasbradstreet

17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.subscriber:103] - Stopping subscriber [[#uuid "61ab2b5c-4d4c-f6e6-6c28-692d6000846c" :add-fields] -1 {:address "localhost", :port 40200, :aeron/peer-task-id nil}] :subscription 8511
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79150592, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "c90a997a-9c13-d2f0-ee23-cda3b9c65ea2", :dst-session-id 1407885110, :short-id 1, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79152128, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "1fb72f40-08d6-1c52-b5bc-2baabad9d354", :dst-session-id 1407885110, :short-id 2, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79152704, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "c822fd7e-5fe6-8aa5-58ea-89b8b386a530", :dst-session-id 1407885110, :short-id 0, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "3a534613-e92b-b78e-23d5-95cba6191ccd", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79153088, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :dst-session-id 1407885110, :short-id 8, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "6a4ba641-8931-c438-a83e-260b4907f68f", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 79153088, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "cb34a779-83b0-9502-bf23-b8432faa1eea", :dst-session-id 1407885110, :short-id 3, :status-session-id nil}
17-03-05 09:40:07 cc8c494c30ec INFO [onyx.peer.virtual-peer:49] - Stopping Virtual Peer cb34a779-83b0-9502-bf23-b8432faa1eea

jasonbell09:03:08

Okay so job happily processes the Kafka queue and then seems to idle and never pick up when new messages come in (and I know they are constantly coming in).

jasonbell09:03:25

The :aeron/peer-task-id nil bothers me a little bit.

lucasbradstreet09:03:12

I’ll be honest, the peer-task-id is vestigial and isn’t currently used, but I wanted to use it again in the future so I left it in. I think I will just remove it.

lucasbradstreet09:03:02

On why it isn’t picking up new messages, I’m not sure.

jasonbell09:03:51

i'll keep looking

lucasbradstreet09:03:04

These messages are logged after you stop the peer after it idles for a while?

jasonbell09:03:23

the peer is running

lucasbradstreet09:03:40

17-03-05 09:40:07 cc8c494c30ec INFO [onyx.peer.virtual-peer:49] - Stopping Virtual Peer cb34a779-83b0-9502-bf23-b8432faa1eea

lucasbradstreet09:03:46

I’m wondering why it’s being stopped here

jasonbell09:03:47

I didn't stop it

lucasbradstreet09:03:41

Right. That’s good to know. No exceptions in the logs? It shouldn’t stop unless it’s either being timed out or an exception is being thrown somewhere

jasonbell09:03:02

it's perfectly well behaved, just not doing anything

jasonbell09:03:58

There's 5gb for the Mesos container and 2.5gb --shm-size so it's got room to breathe

lucasbradstreet09:03:19

The fact that the peer is being stopped at 09:40:07 suggest something happened and it is no longer idling

jasonbell09:03:18

output tasks are to s3, just in case that's helpful

jasonbell09:03:32

Kafka in, S3 out. A couple of transforms in the middle.

jasonbell09:03:02

The exception flow conditions wouldn't stop the job, they should just log and carry on.

jasonbell09:03:20

The handle-exception lifecycle against :all is holding up

jasonbell09:03:31

So I'm a little perplexed.

lucasbradstreet09:03:29

Me too. Do you have onyx-dashboard setup? You could check what the cluster coordination log shows activity wise

jasonbell09:03:56

no, I should deploy that under docker though.

jasonbell09:03:10

will look at that later on

jasonbell11:03:29

@lucasbradstreet FYI, this is where everything just stops for no reason

17-03-05 08:59:44 cc8c494c30ec INFO [onyx.thing.workflow.shared:33] - ***** GZIP CSV BATCH *****
17-03-05 08:59:44 cc8c494c30ec INFO [onyx.peer.task-lifecycle:782] - Job 61ab2b5c-4d4c-f6e6-6c28-692d6000846c {:job-id #uuid "61ab2b5c-4d4c-f6e6-6c28-692d6000846c", :job-hash "80db2587326bbaf7d370d058dac52628cdf891489031eedf06442e9ccc08160"} - Task {:id :cheapest-flight, :name :cheapest-flight, :ingress-tasks #{:add-fields}, :egress-tasks #{:out-processed :out-error}} - Peer 6a4ba641-8931-c438-a83e-260b4907f68f - Peer timed out with no heartbeats. Emitting leave cluster. {:fn :leave-cluster, :peer-parent #uuid "6a4ba641-8931-c438-a83e-260b4907f68f", :args {:id #uuid "f66b5d79-47c0-a5ef-cf3a-f44423b7ae0a", :group-id #uuid "bb610b02-a1c7-a755-c7fd-181025b72bf3"}}
17-03-05 08:59:44 cc8c494c30ec INFO [onyx.messaging.aeron.status-publisher:33] - Closing status pub. {:completed? false, :src-peer-id #uuid "f1967ad4-b74b-2341-1a64-c3b0bcbac830", :site {:address "localhost", :port 40200, :aeron/peer-task-id nil}, :blocked? false, :pos 536096, :type :status-publisher, :stream-id 0, :dst-channel "aeron:udp?endpoint=localhost:40200", :dst-peer-id #uuid "f66b5d79-47c0-a5ef-cf3a-f44423b7ae0a", :dst-session-id 1407885110, :short-id 5, :status-session-id 1407885113}
17-03-05 08:59:44 cc8c494c30ec INFO [onyx.peer.task-lifecycle:452] - Job 61ab2b5c-4d4c-f6e6-6c28-692d6000846c {:job-id #uuid "61ab2b5c-4d4c-f6e6-6c28-692d6000846c", :job-hash "80db2587326bbaf7d370d058dac52628cdf891489031eedf06442e9ccc08160"} - Task {:id :add-fields, :name :add-fields, :ingress-tasks #{:in}, :egress-tasks #{:out-raw :cheapest-flight :out-error}} - Peer f66b5d79-47c0-a5ef-cf3a-f44423b7ae0a - Fell out of task lifecycle loop

jasonbell11:03:05

I thought flow conditions shouldn't stop the whole process

lucasbradstreet11:03:05

@jasonbell that makes more sense. Ok so what is happening there is that the peer is being timed out because it hasn't heartbeated in X ms. The peer should then be stopped and will either reboot (or not if it's dead)

jasonbell11:03:53

cheers, I'll have a read

lucasbradstreet11:03:58

if it’s possible that it takes over 5s to process that segment it’ll be timed out

lucasbradstreet11:03:06

5s is probably too low of a default. I thought the default was 10

jasonbell11:03:24

default is 10, nothing should be takign that long

jasonbell11:03:28

I'll look into it

lucasbradstreet11:03:42

No worries. Let me know what you end up find. I'm off to sleep

jasonbell11:03:19

okay, thanks for the help.

jasonbell20:03:29

@lucasbradstreet just to let you, extending the heartbeat timeouts has worked. I'll be keeping a close eye on it over the next 24h. Thanks for the help earlier.