This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-03-16
Channels
- # aws-lambda (3)
- # beginners (20)
- # boot (201)
- # cljs-dev (45)
- # cljsrn (9)
- # clojars (19)
- # clojure (141)
- # clojure-china (2)
- # clojure-dev (11)
- # clojure-greece (6)
- # clojure-italy (1)
- # clojure-new-zealand (1)
- # clojure-romania (1)
- # clojure-russia (55)
- # clojure-spec (58)
- # clojure-taiwan (1)
- # clojure-uk (97)
- # clojure-ukraine (40)
- # clojurescript (77)
- # core-async (5)
- # core-typed (1)
- # cursive (35)
- # datomic (9)
- # jobs (2)
- # jobs-rus (25)
- # juxt (8)
- # lein-figwheel (14)
- # luminus (24)
- # mount (16)
- # off-topic (56)
- # om (36)
- # onyx (22)
- # pedestal (3)
- # perun (14)
- # re-frame (111)
- # reagent (5)
- # remote-jobs (6)
- # ring-swagger (3)
- # slack-help (1)
- # specter (17)
- # unrepl (12)
- # untangled (56)
@michaeldrogalis or @lucasbradstreet : http://pastebin.com/5JNsXuEA this is onyx.log output running 0.10.x beta7 ... my job is constantly starting and stopping tasks in the registered topology ... as a result segments do not run through the complete workflow
@hunter is this with a persistent ZooKeeper? Could you bump the tenancy-id and try it again?
@lucasbradstreet to be clear, the log is from a locally running copy of onyx with a zookeeper running in a separate jvm (ubuntu installed) with a fresh tenancy-id
@lucasbradstreet btw, onyx-datomic 10 beta7 tx-log watcher is working great for me now
Glad to hear it.
@hunter one thing that jumps out at me in that log is how many peers you’re starting on a single node
I could see the jobs becoming starved with that many peers
ah yep, it’s due to peer heartbeating "17-03-16 00:11:05 ehr1 INFO [onyx.peer.task-lifecycle:782] - Job 04595f59-72d5-2117-636c-19b6c88188a2 {:job-id #uuid "04595f59-72d5-2117-636c-19b6c88188a2", :job-hash "8867d04f2c2c6d66ce1ec2fdb27387b8f9246685191d8e05943d59a6fd2d60"} - Task {:id :payload-to-config, :name :payload-to-config, :ingress-tasks #{:parley-HL7v2-serializer}, :egress-tasks #{:dux-http}} - Peer 3d3955cd-b6c4-5034-d175-5b9d5e18b092 - Peer timed out with no heartbeats. Emitting leave cluster. {:fn :leave-cluster, :peer-parent #uuid "3d3955cd-b6c4-5034-d175-5b9d5e18b092", :args {:id #uuid "60f6fcfa-6248-d059-e5fc-86b7545be06b", :group-id #uuid "e9d0e3c8-1491-957a-5e9c-7c0558143cf6"}} "
@lucasbradstreet that might have been the problem ... we were having issues getting the balance right on number of peers to run on a single node. in 1.9.15 that number is fine without causing starvation
we now have the peers heartbeat to each other, rather than just to ZooKeeper, so it’s possible that’s why you’re getting more timeouts now. You could try increasing these http://www.onyxplatform.org/docs/cheat-sheet/latest/#/search/liveness
but my intuition is that it’s too many peers for a single node, assuming each peer will be doing much
@lucasbradstreet it worked, each peer is not much, but starting with fresh tenancy-id, etc and 30 peers (quad-core 16gb machine) it performs as expected
Yeah, that’s probably a big part of it
I’ve actually changed the default timeouts to 20,000ms for the next release, to prevent these sorts of cases coming up quite so much.
it’s higher than what I would want to use myself, but it’s probably safer overall
From your experience, when passing messages/segments/events around, do you have a flat map of application level plus transport (meta) data, or do you have the transport level envelope a nested app-level map?
ooh, the iterative stuff coming up sounds good https://clojurians.slack.com/archives/onyx/p1489612908608389