Fork me on GitHub
#onyx
<
2017-10-09
>
eriktjacobsen18:10:08

We are seeing an odd issue, we killed a running job with onyx.api/kill-job, that returned true, but the job still shows up in list of running jobs. We restarted the jar, still shows it as running. Tried to submit a new job, says not enough peers (presumably because first job still alive). Currently at a bit of loss on how to debug or get rid of that job, short of wiping our zookeeper cluster.

eriktjacobsen18:10:50

Tried running gc, its been running for 15+ minutes.

lucasbradstreet18:10:24

Did you kill job on the right tenancy-id?

eriktjacobsen19:10:08

It appears we only have one, a string "1". It's hardcoded in our peer-config so I don't think it would change, but is there a way to find the tenacy-id of a running job?

lucasbradstreet19:10:13

You have to follow the log on that tenancy.

lucasbradstreet19:10:48

I just thought you might have kill-job’d on another tenancy from somewhere else.

eriktjacobsen19:10:13

there are 6 jobs that show up from killed-jobs, so it seems that the kill-job functionality has been working on the cluster, except for this zombie job

lucasbradstreet19:10:40

Interesting. So the job is still in the :jobs key?

eriktjacobsen19:10:25

The job just disappeared and killed-jobs cleared as well. Possibly as the result of the 20min gc finishing. We have 5 m4.xlarge for our zookeeper, but conceivably zookeeper could have been having problems this time, we don't have good monitoring on our zookeeper performance yet.

lucasbradstreet19:10:38

Yeah, that was definitely a case of gc. The fact that the gc took 20 mins might mean that your log has grown huge

lucasbradstreet19:10:00

and maybe that kill-job hadn’t been read yet

eriktjacobsen19:10:01

Interesting. The kill-job was submitted perhaps 30 minutes before we tried the gc. Is there a command that would check the size of our log and what range should it be in? Thanks for the debug help.

michaeldrogalis19:10:14

There’s not at the moment. You could ls /onyx/<tenancy-id>/log on your ZooKeeper CLI, I think

michaeldrogalis19:10:34

That sequence of events sounds plausible anyway

eriktjacobsen20:10:10

So it looks like that zk command shows 188,122 items in the log. Is that a lot?

michaeldrogalis20:10:53

That’s quite high, yes. We’ve been clipping our log in Pyroclast around 10,000 entries at maximum.

michaeldrogalis20:10:14

We also rotate our tenancy on every deployment as well, and transfer our jobs over via resume points.

lucasbradstreet20:10:46

gc is another fine approach, though preferably before it gets to this point. Seems like you had a problem with peers rebooting themselves a lot along the way

eriktjacobsen20:10:42

I'm going to assume we are doing something horribly wrong because this is a dev instance with very little test data flowing through it. That size of 188k seems to be after attempting to run gc (its possible the gc never finished, I think the ssh timed out after 30m). Could you expand on how we might go about clipping our log as a temporary fix until we can diagnose why we're creating so many entries? This was after only 3 days, so we'll start rotating tenancy but doesn't look like it would prevent this.

lucasbradstreet20:10:33

You definitely have something crashing in your jobs. Check your onyx.log file

eriktjacobsen23:10:17

Going to ask a question that seems to have been asked 5+ times, but here we go: We are seeing this:

Warning: space is running low in /dev/shm (tmpfs) threshold=167,772,160 usable=49,143,808
and
io.aeron.exceptions.RegistrationException: Insufficient usable storage for new log of length=50332096 in /dev/shm (tmpfs)
. However, we are not using docker, and our /dev/shm has plenty of space with 4Gb:
tmpfs           3.9G   57M  3.9G   2% /dev/shm
I can verify aeron-root exists in /dev/shm/aeron-root, so it seems to be writing to the correct place. I'm also getting a lot of [onyx.messaging.aeron.subscriber:60] - Unavailable network image that may be related? I've seen the solutions are to bump up shm size (its 4gb) and also possibly to lower -Daeron.term.buffer.length ??? I'm running 11 virtual peers on a 2-core machine. Is there an option that limits the java process from using full space in /dev/shm that needs to be overridden?