This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-10-09
Channels
- # aleph (16)
- # bangalore-clj (1)
- # beginners (57)
- # cider (4)
- # clara (1)
- # cljs-dev (25)
- # cljsrn (12)
- # clojure (76)
- # clojure-dusseldorf (2)
- # clojure-italy (41)
- # clojure-russia (4)
- # clojure-spec (3)
- # clojure-uk (122)
- # clojurescript (101)
- # cursive (8)
- # data-science (30)
- # datomic (2)
- # emacs (2)
- # figwheel (10)
- # fulcro (53)
- # garden (5)
- # gorilla (6)
- # hoplon (1)
- # jobs (1)
- # juxt (14)
- # leiningen (12)
- # om (1)
- # om-next (1)
- # onyx (21)
- # pedestal (40)
- # perun (5)
- # portkey (2)
- # re-frame (16)
- # reagent (1)
- # ring-swagger (3)
- # rum (6)
- # shadow-cljs (239)
- # spacemacs (10)
- # specter (9)
- # uncomplicate (2)
- # unrepl (1)
- # vim (13)
- # yada (16)
We are seeing an odd issue, we killed a running job with onyx.api/kill-job
, that returned true, but the job still shows up in list of running jobs. We restarted the jar, still shows it as running. Tried to submit a new job, says not enough peers (presumably because first job still alive). Currently at a bit of loss on how to debug or get rid of that job, short of wiping our zookeeper cluster.
Tried running gc, its been running for 15+ minutes.
Did you kill job on the right tenancy-id?
It appears we only have one, a string "1". It's hardcoded in our peer-config so I don't think it would change, but is there a way to find the tenacy-id of a running job?
You have to follow the log on that tenancy.
I just thought you might have kill-job’d on another tenancy from somewhere else.
there are 6 jobs that show up from killed-jobs
, so it seems that the kill-job functionality has been working on the cluster, except for this zombie job
Interesting. So the job is still in the :jobs
key?
The job just disappeared and killed-jobs cleared as well. Possibly as the result of the 20min gc finishing. We have 5 m4.xlarge for our zookeeper, but conceivably zookeeper could have been having problems this time, we don't have good monitoring on our zookeeper performance yet.
Yeah, that was definitely a case of gc. The fact that the gc took 20 mins might mean that your log has grown huge
and maybe that kill-job hadn’t been read yet
Interesting. The kill-job was submitted perhaps 30 minutes before we tried the gc. Is there a command that would check the size of our log and what range should it be in? Thanks for the debug help.
There’s not at the moment. You could ls /onyx/<tenancy-id>/log
on your ZooKeeper CLI, I think
That sequence of events sounds plausible anyway
So it looks like that zk command shows 188,122 items in the log. Is that a lot?
That’s quite high, yes. We’ve been clipping our log in Pyroclast around 10,000 entries at maximum.
We also rotate our tenancy on every deployment as well, and transfer our jobs over via resume points.
gc is another fine approach, though preferably before it gets to this point. Seems like you had a problem with peers rebooting themselves a lot along the way
I'm going to assume we are doing something horribly wrong because this is a dev instance with very little test data flowing through it. That size of 188k seems to be after attempting to run gc (its possible the gc never finished, I think the ssh timed out after 30m). Could you expand on how we might go about clipping our log as a temporary fix until we can diagnose why we're creating so many entries? This was after only 3 days, so we'll start rotating tenancy but doesn't look like it would prevent this.
You definitely have something crashing in your jobs. Check your onyx.log file
Going to ask a question that seems to have been asked 5+ times, but here we go: We are seeing this:
Warning: space is running low in /dev/shm (tmpfs) threshold=167,772,160 usable=49,143,808
and io.aeron.exceptions.RegistrationException: Insufficient usable storage for new log of length=50332096 in /dev/shm (tmpfs)
. However, we are not using docker, and our /dev/shm has plenty of space with 4Gb: tmpfs 3.9G 57M 3.9G 2% /dev/shm
I can verify aeron-root exists in /dev/shm/aeron-root
, so it seems to be writing to the correct place. I'm also getting a lot of [onyx.messaging.aeron.subscriber:60] - Unavailable network image
that may be related? I've seen the solutions are to bump up shm size (its 4gb) and also possibly to lower -Daeron.term.buffer.length
??? I'm running 11 virtual peers on a 2-core machine. Is there an option that limits the java process from using full space in /dev/shm
that needs to be overridden?