This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2015-11-16
Channels
- # admin-announcements (9)
- # beginners (112)
- # boot (223)
- # cbus (10)
- # cider (19)
- # clara (2)
- # cljs-dev (81)
- # cljsjs (3)
- # cljsrn (45)
- # clojure (239)
- # clojure-conj (12)
- # clojure-poland (2)
- # clojure-russia (56)
- # clojure-taiwan (1)
- # clojurescript (57)
- # cursive (28)
- # datomic (5)
- # events (14)
- # immutant (1)
- # jobs (1)
- # ldnclj (8)
- # off-topic (28)
- # om (80)
- # onyx (121)
- # re-frame (10)
- # sneer-br (1)
- # spacemacs (40)
- # yada (44)
I accidently deleted the whole directory .After redeploying everything very fabric. No errors. all good.
Not the way, I wanted to solve the problem, but everything works now. I think maybe I have corrupted data in ZK.
Cool. Glad to hear. Let me know if it pops up again
@robert-stuttaford: what version of onyx-metrics are you using? I think you’re probably using a version that has a particularly bad memory leak in it. It might take you a while to hit it because the effect is moderately small, but you will definitely hit it.
lucasbradstreet: [org.onyxplatform/onyx-metrics "0.7.10" :exclusions [org.onyxplatform/onyx]]
Cool, one sec
Did you release new one?
I’m about to. Main problem is that you’re on 0.7 so I might need to back port it for now
K, looks like you’re susceptible to it.
I’ll push out a new 0.7 release for you. I wouldn’t recommend you upgrade to 0.8 quite yet, though things are looking pretty good there
lucasbradstreet: do you have not a backward compability 0.8 and 0.7?
You mean using metrics 0.8 on 0.7?
So, it's not simple to use 0.8 instead 0.7.10?
Well, it’d involve a bigger upgrade and we just released 0.8
So you might want to wait a few days before getting onto Onyx 0.8
whereas I think you should probably get this metrics fix in ASAP
Let me know when you finish back port
We’re pretty sure 0.8 is good, but we’re just going to do a full performance test first
Will do
lucasbradstreet: thanks!
Try 0.7.10.1
should we upgrade?
we have indeed been struggling with memory leak type issues
You should upgrade to 0.8.0 soon, but I think you should hold off for a bit
So I pushed a new metrics release which fixes this issue
is “Try 0.7.10.1” a response to me?
also, hello hope you had a great weekend!
To lowl4tency, and you heh
oh. right. that’s both of us, then.
Sorry about that issue
ok. i’ll push 0.7.10.1 right away. thank you!
ok. the fix is live. will keep a beady eye on it today
thank you @lucasbradstreet !
Cool, no worries. I’ve been trying to track this issue down all day
what was it?
My big scale benchmarks were slowing down 😕
I was tracking completions on non input tasks, and the timestamps that I was putting in maps never got cleared (because they never got completed, seeing as they were not input tasks)
So for every segment we were tracking a new timestamp that never got cleared
It actually exhibited itself as Aeron publications ending up closed. So I was looking in the wrong place for ages 😞
Of course that was only happening because GCs were taking a long time
programming!
Indeedly
workload dependent, we’ve had to restart every couple of days or so. just haven’t had the headspace to profile it ourselves. given that we’re shipping a shit-ton of metrics to riemann, my guess is this is why
Throughput is the main factor
I’m guessing this is your problem though
Please let me know if you keep having to restart at all
we’re removing the cron job now and will watch it closely
yes, i’m pretty sure we suffered it. our input task is given every datomic transaction we have.
lucasbradstreet: thanks a lot for quick fix
My bad on the memory leak 👻
15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [2158b39b-c4c5-45ff-a79e-52f81fa83e46] Peer chose not to start the task yet. Backing off and retrying... 15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [a817f591-2d84-4fba-a015-0837cedc1622] Peer chose not to start the task yet. Backing off and retrying... 15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [c76b12c6-a192-4b34-8a09-d75b313bf9e0] Peer chose not to start the task yet. Backing off and retrying... 15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [66f084f5-f530-437e-a7f1-aec697838fd8] Peer chose not to start the task yet. Backing off and retrying... 15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [4f818531-1ea0-45e6-9c37-cecdbb272441] Peer chose not to start the task yet. Backing off and retrying...
Are any retries logged by onyx-metrics?
It’ll look like: 15-Nov-16 18:14:09 lbpro INFO [onyx.lifecycle.metrics.timbre] - Metrics: {:job-id #uuid "5df1edbf-fe57-4905-96ae-0be6e24b0923", :task-id #uuid "97faf3bb-0718-463e-8a4b-e5428f5b6ffc", :task-name :inc1, :peer-id #uuid "496df455-f926-4d00-89de-31051c239d97", :service "[:inc1] 1s_retry-segment-rate", :window "1s", :metric :retry-rate, :value 0.0, :tags ["retry_segment_rate_1s" "onyx" ":inc1" "your-workflow-name"]}
If you use timbre logging
Sorry, I can’t help you much more until you show that :metric :retry-rate, :value 0.0 is always 0.0
Retries are the main reason why you’d be getting double executed tasks
I assume you mean that functions are called twice for a given segment?
15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: hi 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: 7 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: test 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: hi 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: new 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: 7 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: test 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: life 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: new 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: life 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: hi 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: 7 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: test 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: new 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: life 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: hi 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: 7 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: test 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: new 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: life
there is no ':metric :retry-rate, :value 0.0 is always 0.0’ related log entries. is this a kafka plugin specific setup error?
onyx-metrics
I guess Lucas's meant this https://github.com/onyx-platform/onyx-metrics
You should include this library to your code
Just setup onyx-metrics with timbre logging so that you can look at your logs and see whether any retries happen
Correct
You can use the dashboard if you want but for now just using the timbre logging and inspecting your logs will be enough
Could not find artifact org.onyxplatform:onyx-metrics:jar:0.8.0.1 in central (https://repo1.maven.org/maven2/) Could not find artifact org.onyxplatform:onyx-metrics:jar:0.8.0.1 in clojars (https://clojars.org/repo/)
That’s weird. Try 0.8.0.3
Will update the page
Must have been a release process issue
Should have auto updated
Sorry about that
Metrics: {:job-id #uuid "1f423693-ccd1-4f4e-85c8-baa277a7ff84", :task-id #uuid "a802c18e-7a94-4fba-81c6-f110e4551501", :task-name :my-dent, :peer-id #uuid "c4bc13d3-bbe1-43cb-9d42-165dd9bddf83", :service "[:my-dent] 1s_retry-segment-rate", :window "1s", :metric :retry-rate, :value 0.0, :tags ["retry_segment_rate_1s" "onyx" ":my-dent" "your-workflow-name"]}
Always 0?
Ok that's weird then
So my theory was that your messages weren't being acked and we're being retried
Yeah I believe it. You really need to look at the point where there's some throughput though
Any retries will be slightly before that
Yeah. So initially throughput will be positive. At the some point later for the second call, throughput will be positive again. I want to know whether retries were positive just prior too
Same error as last time huh
Sorry, that's my cue for sleep :/
We can work through it tomorrow at an earlier time
3am here :)
Is your main concern the error above or the multiple function calls?
Ok, so a few things make sense to me. Either you're reading messages and they're not getting acked. If this is happening then you'll get replays after onyx/pending-timeout which is 60s
By default
Second alternative is you're submitting jobs multiple times. Potentially using the same group-Id (which is how it commits the checkpoint)
Other thing I can think of is that you have multiple results on the input medium (i.e. You're accidentally writing your input messages to your Kafka topic twice)
Those are the main things that I think could go wrong