Fork me on GitHub
#onyx
<
2017-11-12
>
lmergen09:11:26

the amount of virtual peers required is equal to the amount of tasks in a job, right (assuming all tasks have max-peers=1) ?

lmergen09:11:50

i'm trying to figure out why certain jobs are not started, even when there should be plenty of peers available

lucasbradstreet09:11:02

That’s correct.

lmergen09:11:10

i have an integration test that is flaky, and it's because in some circumstances onyx is not starting all 3 jobs that are part of the test. adding an additional set of virtual peers solves the problem. guess i'll continue debugging replica state for a bit, trying to figure out whether there's any hints in there. i -think- it's triggered when first a few jobs are killed, so maybe they are not being cleaned up properly.

lucasbradstreet09:11:29

That sounds like a good idea. Let me know how you go

lmergen09:11:34

any hints on what i want to be looking for ? i have the job ids i expect to have running, and they're all in :allocations as expected. so that looks good.

lucasbradstreet09:11:26

Are they allocated peers in :allocations?

lmergen09:11:56

yes, all three jobs are in :allocations as expected

lmergen09:11:46

so that might mean that the actual job is crashing

lmergen09:11:58

but i'm not seeing it

lucasbradstreet10:11:40

Nothing in onyx.log? I’m off to sleep. Sorry I can’t help more

lmergen10:11:46

no problem, i'll keep figuring things out 🙂

lmergen10:11:45

interesting, apparently the issue resolves itself after 300 seconds

lmergen10:11:35

which is accompanied with a few {:died ...} log messages

lmergen10:11:06

hmmm i'm seeing a lot of 'unavailable network image' messages

lmergen11:11:46

so ultimately, the peer times out because of no heartbeats

lmergen11:11:59

i can see quite a few "peer X fell out of task lifecycle loop" messages

lmergen11:11:02

that is concerning

jasonbell11:11:41

Do you have lifecycle events in your catalog?

jasonbell11:11:20

Especially

(defn handle-exception [event lifecycle lifecycle-phase e]
  (println "Caught exception: " e)
  (println "Returning :restart, indicating that this task should restart.")
  :restart)
That one has beat me up a few times.

lmergen11:11:34

other than the ones i have to set for kafka, nop

jasonbell11:11:05

Okay, if there’s a specific peer where it’s falling apart it’s worth having the lifecycles there, if the task fails then the task will halt and the whole peer needs restarting.

jasonbell11:11:15

I’m assuming the Kafka messages are small (ie < 1mb)

lmergen11:11:39

yeah, they're in the order of several kbs each

jasonbell11:11:58

So the Aeron default buffer size is plenty then.

lmergen11:11:23

yep, but i'm still getting the 'unavailable network image' -- even right after starting a whole cluster clean

lmergen11:11:11

might be related to me running aeron in a dev environment (as in, not a separate process for the aeron media driver) ?

lmergen11:11:21

i'll add that exception handler there anyway

lmergen11:11:37

i feel i'm completely in the dark as to what is actually blocking

jasonbell11:11:00

I always run Aeron as a separate process (there’s an example of that in the Docker run scripts)

lmergen11:11:32

yeah, i should

jasonbell11:11:33

Yeah i had that problem too at the start, it can take quite a bit of tracing. I ended up doing my own logged with timbre.

lmergen11:11:29

yeah combined with the jmx metrics it helps to build a full view of what's going on

lmergen11:11:07

but i should probably better instrument my tasks

jasonbell11:11:09

It’s also worth trawling through the logs to see if the heartbeat is timing out.

lmergen11:11:19

yes the heartbeat is timing out

lmergen11:11:25

it's detected after 5 minutes

jasonbell11:11:31

:onyx.peer/subscriber-liveness-timeout-ms

lmergen11:11:44

and then things seem to recover

jasonbell11:11:11

That’s as much as I can think of right now, I only mention these as I’m wading through my slides for ClojureX 🙂

jasonbell11:11:21

“How I Bled All Over Onyx” 🙂

lmergen11:11:29

ah nice, you're giving a talk there ?

lmergen11:11:08

unfortunately can't attend then, but i wish you all the luck 🙂

jasonbell11:11:14

most kind, thank you

lmergen14:11:00

@lucasbradstreet well if you have some spare time, would love it if you could take a look for a few minutes. i'm lost what's actually going on. i have a "clean" session with timbre debug level logs and the accompanying onyx logs where a 30s heartbeat timeout + killing of some peers was necessary for the job to actually start processing.

lucasbradstreet19:11:09

Hi @lmergen, happy to help when you have some time.

eelke21:11:39

@lucasbradstreet In the amazon-s3 plugin there is a TODO: Need some way to control batch sizes. batch-timeout is not supported in ABS currently. Do you know when this will be done? I am indeed having issues controlling the batch size. It is pretty important since querying s3 data is more efficient when objects have a certain (quite large) size

lucasbradstreet22:11:55

I’m running out for a bit, but if you could write up how you would like it to work, that’d be good. I assume you mean the output plugin?

lucasbradstreet22:11:09

We’ve been using windowing and triggers to buffer up large objects before emitting them. I think it’s a preferable approach, but we could possibly move some of the logic into the output plugin to make it easy

eelke14:11:44

Do you maybe have an example: I am now doing this: ':flow-conditions [{:flow/from :batch-it :flow/to [:write-s3] :flow/short-circuit? true :flow/predicate ::triggered?}] :windows [{:window/id :collect-segments :window/task :batch-it :window/type :fixed :window/window-key :event/date-time :window/range [1 :hour] :window/aggregation :onyx.windowing.aggregation/conj}] :triggers [{:trigger/id :emit-part :trigger/window-id :collect-segments :trigger/post-evictor [:all] :trigger/on :onyx.triggers/segment :trigger/fire-all-extents? true :trigger/threshold [5000 :elements] :trigger/emit ::send!}]'

eelke14:11:54

With 500 or 1000 elements this works pretty well. But I am getting 'org.apache.curator.CuratorConnectionLossException' and eventually 'Log subscriber closed due to disconnection from ZooKeeper' exceptions when attempting to create bigger batches.

eelke14:11:38

You have some tips and tricks? 😉

lucasbradstreet18:11:41

@U258C7RB2 that’s exactly what I meant, however I have some tips 🙂

lucasbradstreet18:11:51

1. if you use the latest onyx 0.12 alphas you can actually put the window on the output task. This helps get around issues where you would be messaging very large segments downstream, as the task will be communicating with itself.

lucasbradstreet18:11:13

2. You must be using the ZooKeeper checkpoint implementation, which is only meant for testing, and can’t handle windows over 1MB.

eelke08:11:30

Hey Lucas. thank you for these tips 🙂

eelke08:11:48

With 2 you mean it is better to use s3?

eelke08:11:08

:onyx.peer/storage :s3

lucasbradstreet08:11:10

Sleep time, good luck!

eelke08:11:59

sleep well

eelke15:11:56

I can make it work but the batches do not seem to grow big ±5000 elements. What would a window and trigger together with a output task look like?

eelke08:11:16

Good evening (for you). I appreciate the reactions 🙂. I am almost there, just a bit and I am happy with the (onyx) job. Do you maybe have an answer to the latest question?

eelke08:11:07

I can show you the whole job

eelke13:11:58

Am right to think that a trigger on on output task will cause a doubling of the segments? Since you cannot use a flow-condition to ignore un-triggered segments?

eelke15:11:19

Nevermind the batch-size, I needed to increase the :onyx.messaging/term-buffer-size.segment and shm-size

eelke15:11:34

Still curious about the output task window issue

lucasbradstreet18:11:31

Hi, yes, placing it on the output job will prevent flow conditions, but you could chose to filter those out on your :trigger/emit function

lucasbradstreet18:11:03

It’ll also sort out your term-buffer-size issues because you will never have to worry about really big chunks being messaged between peers.

eelke11:11:55

No because the segments will pass the :trigger/emit function only once, and the segments will go to the output task without going through ;trigger/emit

lucasbradstreet22:11:54

Oh, I remember, there is one other thing you need to do to make this work

lucasbradstreet22:11:57

You need to use the new :onyx/type :reduce. It’s not documented yet as it’s still a testing feature, but it should be official pretty soon. The reduce type does not send regular segments to the output plugin - only the segments emitted from windows.

lucasbradstreet00:11:33

I keep seeing an unread message alert here, but no new messages ever show up. Ping me via PM if you’ve sent anything.