Fork me on GitHub
#onyx
<
2017-03-16
>
hunter00:03:58

@michaeldrogalis or @lucasbradstreet : http://pastebin.com/5JNsXuEA this is onyx.log output running 0.10.x beta7 ... my job is constantly starting and stopping tasks in the registered topology ... as a result segments do not run through the complete workflow

lucasbradstreet00:03:56

@hunter is this with a persistent ZooKeeper? Could you bump the tenancy-id and try it again?

hunter00:03:36

yes, a zookeeper running in separate jvm

hunter00:03:48

i have changed tenancy many times and seen same result

hunter00:03:48

@lucasbradstreet to be clear, the log is from a locally running copy of onyx with a zookeeper running in a separate jvm (ubuntu installed) with a fresh tenancy-id

hunter00:03:32

i was able to produce the same results on 2 different computers at least 5 times

hunter00:03:48

changing tenancy-ids each time

hunter00:03:30

@lucasbradstreet btw, onyx-datomic 10 beta7 tx-log watcher is working great for me now

lucasbradstreet00:03:43

Glad to hear it.

lucasbradstreet00:03:13

@hunter one thing that jumps out at me in that log is how many peers you’re starting on a single node

lucasbradstreet00:03:23

I could see the jobs becoming starved with that many peers

lucasbradstreet00:03:36

ah yep, it’s due to peer heartbeating "17-03-16 00:11:05 ehr1 INFO [onyx.peer.task-lifecycle:782] - Job 04595f59-72d5-2117-636c-19b6c88188a2 {:job-id #uuid "04595f59-72d5-2117-636c-19b6c88188a2", :job-hash "8867d04f2c2c6d66ce1ec2fdb27387b8f9246685191d8e05943d59a6fd2d60"} - Task {:id :payload-to-config, :name :payload-to-config, :ingress-tasks #{:parley-HL7v2-serializer}, :egress-tasks #{:dux-http}} - Peer 3d3955cd-b6c4-5034-d175-5b9d5e18b092 - Peer timed out with no heartbeats. Emitting leave cluster. {:fn :leave-cluster, :peer-parent #uuid "3d3955cd-b6c4-5034-d175-5b9d5e18b092", :args {:id #uuid "60f6fcfa-6248-d059-e5fc-86b7545be06b", :group-id #uuid "e9d0e3c8-1491-957a-5e9c-7c0558143cf6"}} "

hunter00:03:07

@lucasbradstreet that might have been the problem ... we were having issues getting the balance right on number of peers to run on a single node. in 1.9.15 that number is fine without causing starvation

lucasbradstreet00:03:26

we now have the peers heartbeat to each other, rather than just to ZooKeeper, so it’s possible that’s why you’re getting more timeouts now. You could try increasing these http://www.onyxplatform.org/docs/cheat-sheet/latest/#/search/liveness

lucasbradstreet00:03:41

but my intuition is that it’s too many peers for a single node, assuming each peer will be doing much

hunter01:03:16

@lucasbradstreet it worked, each peer is not much, but starting with fresh tenancy-id, etc and 30 peers (quad-core 16gb machine) it performs as expected

hunter01:03:28

thank you very much, this was driving us crazy

lucasbradstreet01:03:29

Yeah, that’s probably a big part of it

lucasbradstreet01:03:23

I’ve actually changed the default timeouts to 20,000ms for the next release, to prevent these sorts of cases coming up quite so much.

lucasbradstreet01:03:37

it’s higher than what I would want to use myself, but it’s probably safer overall

yonatanel10:03:08

From your experience, when passing messages/segments/events around, do you have a flat map of application level plus transport (meta) data, or do you have the transport level envelope a nested app-level map?