onyx 2016-12-22 | Slack Archive

okay, thank you

I'm having issues getting my job to start. Right now, I'm seeing things like Not enough virtual peers have warmed up to start the task yet, backing off and trying again... But it just spits out those messages repeatedly and doesn't seem to be doing any sort of backing off. Is there some setting I need to tweak?

jasonbell15:12:39

How many peers are you starting?

stephenmhopper15:12:13

I have 8 tasks, so I've been trying 8+

stephenmhopper15:12:28

it starts just fine if I swap out the kafka plugin reader stuff with core-async

stephenmhopper15:12:40

but right now I'm curious as to why the logging isn't actually backing off

jasonbell15:12:52

Kafka is one peer per partition

jasonbell15:12:07

But I’ve added 50% peers for safety and that ususally works for me.

stephenmhopper15:12:17

50%?

jasonbell15:12:29

8 + 4

stephenmhopper15:12:34

got it

stephenmhopper15:12:45

Any ideas why the log messages aren't actually backing off and retrying?

jasonbell15:12:56

There’s the odd time the same issue happens to me, no idea why.

jasonbell15:12:05

Christmas I’ve been delving into this a lot more.

jasonbell15:12:47

@stephenmhopper any better?

stephenmhopper15:12:53

nope

jasonbell15:12:51

how are you starting your job?

jasonbell15:12:57

cli or docker etc?

lucasbradstreet16:12:11

@stephenmhopper hmm, that’s weird if it spits out those messages repeatedly. It should only happen if a log message is being applied to the cluster coordination log

lucasbradstreet16:12:10

@stephenmhopper that implies that something might be changing in your cluster a lot (if it’s just peers starting up, that’s fine)

stephenmhopper16:12:33

Yeah, I'm not sure exactly what the issue was. I updated :onyx.peer/job-not-ready-back-off to be less aggressive. Right now, I'm doing development in a REPL, but ZK, Kafka, and Bookkeeper are all running locally in Docker. It's possible that Docker was running out of memory (I had only allocated 2GB of RAM for all three to share). I updated it to 3GB and everyone seems to be happy now. But I also killed the containers entirely and recreated them. It's also possible that the bookkeeper data was somehow corrupted as the new container couldn't start up until I removed the old bookkeeper mounted volume

lucasbradstreet16:12:20

Yeah, I could see you having some join churning going on if you were using a lot of peers

michaeldrogalis18:12:32

@stephenmhopper That message is typically emitted when a peer is beginning a task but can’t make initial connections, so it’s retrying.

2016-12-22

Channels