Fork me on GitHub
#onyx
<
2017-10-10
>
lucasbradstreet00:10:42

term buffer length is the best place to start

lucasbradstreet00:10:10

hmm, but you do have space in /dev/shm

lucasbradstreet00:10:13

so that’s really weird

lucasbradstreet00:10:21

I think this is related to your big log

lucasbradstreet00:10:20

Could it have the wrong permissions to write to it? The whole picture doesn’t make a lot of sense.

eriktjacobsen00:10:32

There seems to be some thrashing component. Most of the time it is as I posted:

tmpfs           3.9G  105M  3.8G   3% /dev/shm
but I just caught it like this: (this is with -Daeron.term.buffer.length=33554432):
tmpfs           3.9G  3.9G   48M  99% /dev/shm
and errors are triggered, but then it cleans itself right back up.

lucasbradstreet00:10:48

Yeah, that makes a lot of sense. I would look out for other errors in your logs

eriktjacobsen00:10:45

The odd thing is that things run fine on local machine, using internal zookeeper, processing thousands of messages a minute, and when we deploy and use a 5-node zookeeper, it seems like aeron buffer balloons until job dies. So this is our first go at running this in the server environment, and not sure what logs are normal. Things are littered with low /dev/shm space and unavailable network image and some peers not responding to heartbeat (despite being on single node), but nothing stands out.

lucasbradstreet00:10:06

Does your job have a lot of tasks? Or lots of peers on each task? Seems like a lot of channels are being opened, which may be helped by reducing the term buffer size

eriktjacobsen00:10:11

9 tasks, 11 virtual peers:

{:workflow [[:in :conform-health-check-msg]
                             [:conform-health-check-msg :latest-status]
                             [:latest-status :update-state-graph]
                             [:in :update-state-graph]
                             [:update-state-graph :distribute-statuses]
                             [:distribute-statuses :save-component-status]
                             [:save-component-status :out]
                             [:update-state-graph :build-v0-json]
                             [:build-v0-json :v0-json-out]]

lucasbradstreet00:10:06

K that’s not too bad

chrisblom08:10:17

is it possible to get the job name in a lifecycle call?

lellis14:10:28

Hi all, i have some question about zookeeper and peer state, if i have 5 zookeeper' s machines and 1 of then going down, the peer stop working?

gardnervickers14:10:07

@lellis No, ZK will remain available as long as there’s a cluster majority, so with 5 ZK nodes you can lose 2 nodes and still remain available. 2/3 split.

chrisblom14:10:30

in the docs, its states that for a perf boost you disable assertions, but i can't get it to work

chrisblom14:10:53

in the repl i can run (set! *assert* false)

chrisblom14:10:24

in when i run it as a jar, i get:

Caused by: java.lang.IllegalStateException: Can't change/establish root binding of: *assert* with set

chrisblom14:10:52

what's the proper way to disable assertions?

michaeldrogalis16:10:20

@chrisblom (get-in event [:onyx.core/task-information :job-metadata]) should do it I think

michaeldrogalis16:10:00

Not sure about assertions - I’ve never seen Clojure throw that exception before.

michaeldrogalis16:10:04

Anyone else know?

lucasbradstreet16:10:31

@chrisblom one sec, I’ll get you the assert answer

lucasbradstreet17:10:23

@chrisblom are you uberjar’ing?

chrisblom17:10:10

@lucasbradstreet yes, only AOTing the main ns

chrisblom17:10:17

btw, i've also tried (alter-var-root #'*assert* (constantly false)), this throw something like java.lang.IllegalStateException: Can't change/establish root binding of: assert

chrisblom17:10:32

strange, as both approaches work in the repl

lucasbradstreet18:10:41

@chrisblom set :global-vars {*assert* false} in the profile where you uberjar

lucasbradstreet18:10:05

@chrisblom it’s enough to be set during the uberjar, where things are going wrong is that you can’t set it when you start up in the uberjar