Fork me on GitHub
#onyx
<
2015-11-16
>
yusup06:11:44

I accidently deleted the whole directory .After redeploying everything very fabric. No errors. all good.

yusup06:11:12

Not the way, I wanted to solve the problem, but everything works now. I think maybe I have corrupted data in ZK.

lucasbradstreet06:11:45

Cool. Glad to hear. Let me know if it pops up again

lucasbradstreet08:11:21

@robert-stuttaford: what version of onyx-metrics are you using? I think you’re probably using a version that has a particularly bad memory leak in it. It might take you a while to hit it because the effect is moderately small, but you will definitely hit it.

Kira Sotnikov08:11:43

lucasbradstreet: [org.onyxplatform/onyx-metrics "0.7.10" :exclusions [org.onyxplatform/onyx]]

Kira Sotnikov08:11:02

Did you release new one? simple_smile

lucasbradstreet08:11:27

I’m about to. Main problem is that you’re on 0.7 so I might need to back port it for now

lucasbradstreet08:11:06

K, looks like you’re susceptible to it.

lucasbradstreet08:11:28

I’ll push out a new 0.7 release for you. I wouldn’t recommend you upgrade to 0.8 quite yet, though things are looking pretty good there

Kira Sotnikov08:11:59

lucasbradstreet: do you have not a backward compability 0.8 and 0.7?

lucasbradstreet08:11:16

You mean using metrics 0.8 on 0.7?

Kira Sotnikov08:11:21

So, it's not simple to use 0.8 instead 0.7.10?

lucasbradstreet08:11:39

Well, it’d involve a bigger upgrade and we just released 0.8

lucasbradstreet08:11:58

So you might want to wait a few days before getting onto Onyx 0.8

lucasbradstreet08:11:10

whereas I think you should probably get this metrics fix in ASAP

Kira Sotnikov08:11:39

Let me know when you finish back port simple_smile

lucasbradstreet08:11:43

We’re pretty sure 0.8 is good, but we’re just going to do a full performance test first

Kira Sotnikov08:11:59

lucasbradstreet: thanks!

robert-stuttaford09:11:37

should we upgrade?

robert-stuttaford09:11:54

we have indeed been struggling with memory leak type issues

lucasbradstreet09:11:57

You should upgrade to 0.8.0 soon, but I think you should hold off for a bit

lucasbradstreet09:11:06

So I pushed a new metrics release which fixes this issue

robert-stuttaford09:11:33

is “Try 0.7.10.1” a response to me?

robert-stuttaford09:11:38

also, hello simple_smile hope you had a great weekend!

lucasbradstreet09:11:39

To lowl4tency, and you heh

robert-stuttaford09:11:48

oh. right. that’s both of us, then.

lucasbradstreet09:11:13

Sorry about that issue

robert-stuttaford09:11:21

ok. i’ll push 0.7.10.1 right away. thank you!

robert-stuttaford09:11:35

ok. the fix is live. will keep a beady eye on it today

lucasbradstreet09:11:58

Cool, no worries. I’ve been trying to track this issue down all day

lucasbradstreet09:11:06

My big scale benchmarks were slowing down 😕

lucasbradstreet09:11:35

I was tracking completions on non input tasks, and the timestamps that I was putting in maps never got cleared (because they never got completed, seeing as they were not input tasks)

lucasbradstreet09:11:45

So for every segment we were tracking a new timestamp that never got cleared

lucasbradstreet09:11:16

It actually exhibited itself as Aeron publications ending up closed. So I was looking in the wrong place for ages 😞

lucasbradstreet09:11:32

Of course that was only happening because GCs were taking a long time

robert-stuttaford09:11:13

workload dependent, we’ve had to restart every couple of days or so. just haven’t had the headspace to profile it ourselves. given that we’re shipping a shit-ton of metrics to riemann, my guess is this is why

lucasbradstreet09:11:53

Throughput is the main factor

lucasbradstreet09:11:01

I’m guessing this is your problem though

lucasbradstreet09:11:09

Please let me know if you keep having to restart at all

robert-stuttaford09:11:17

we’re removing the cron job now and will watch it closely

robert-stuttaford09:11:02

yes, i’m pretty sure we suffered it. our input task is given every datomic transaction we have.

Kira Sotnikov09:11:52

lucasbradstreet: thanks a lot for quick fix

lucasbradstreet10:11:02

My bad on the memory leak 👻

yusup12:11:38

15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [2158b39b-c4c5-45ff-a79e-52f81fa83e46] Peer chose not to start the task yet. Backing off and retrying... 15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [a817f591-2d84-4fba-a015-0837cedc1622] Peer chose not to start the task yet. Backing off and retrying... 15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [c76b12c6-a192-4b34-8a09-d75b313bf9e0] Peer chose not to start the task yet. Backing off and retrying... 15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [66f084f5-f530-437e-a7f1-aec697838fd8] Peer chose not to start the task yet. Backing off and retrying... 15-Nov-16 12:37:46 wh01.c.tunlld-01.internal INFO [onyx.peer.task-lifecycle] - [4f818531-1ea0-45e6-9c37-cecdbb272441] Peer chose not to start the task yet. Backing off and retrying...

yusup12:11:35

sorry to mention it again. I am having double executed tasks again.

yusup12:11:43

There is no errors.

lucasbradstreet12:11:34

Are any retries logged by onyx-metrics?

yusup12:11:04

I havenot set up onyx-metrics .

lucasbradstreet12:11:20

It’ll look like: 15-Nov-16 18:14:09 lbpro INFO [onyx.lifecycle.metrics.timbre] - Metrics: {:job-id #uuid "5df1edbf-fe57-4905-96ae-0be6e24b0923", :task-id #uuid "97faf3bb-0718-463e-8a4b-e5428f5b6ffc", :task-name :inc1, :peer-id #uuid "496df455-f926-4d00-89de-31051c239d97", :service "[:inc1] 1s_retry-segment-rate", :window "1s", :metric :retry-rate, :value 0.0, :tags ["retry_segment_rate_1s" "onyx" ":inc1" "your-workflow-name"]}

lucasbradstreet12:11:23

If you use timbre logging

lucasbradstreet12:11:39

Sorry, I can’t help you much more until you show that :metric :retry-rate, :value 0.0 is always 0.0

lucasbradstreet12:11:54

Retries are the main reason why you’d be getting double executed tasks

lucasbradstreet12:11:04

I assume you mean that functions are called twice for a given segment?

yusup12:11:23

actually received twice.

yusup12:11:30

by kafka plugin

yusup12:11:02

15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: hi 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: 7 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: test 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: hi 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: new 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: 7 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: test 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: life 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: new 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.desy] - received message: life 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: hi 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: 7 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: test 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: new 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: life 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: hi 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: 7 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: test 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: new 15-Nov-16 12:37:51 wh02.c.tunlld-01.internal INFO [hyper.onyx.functions.sample-functions] - seg: life

yusup12:11:28

hyper.onyx.desy is deserialization function .

yusup12:11:31

there is no ':metric :retry-rate, :value 0.0 is always 0.0’ related log entries. is this a kafka plugin specific setup error?

yusup12:11:30

you mean , I should setup onyx-metrics on onyx-dashboard ?

Kira Sotnikov12:11:08

You should include this library to your code

lucasbradstreet12:11:09

Just setup onyx-metrics with timbre logging so that you can look at your logs and see whether any retries happen

lucasbradstreet12:11:32

You can use the dashboard if you want but for now just using the timbre logging and inspecting your logs will be enough

yusup13:11:09

Could not find artifact org.onyxplatform:onyx-metrics:jar:0.8.0.1 in central (https://repo1.maven.org/maven2/) Could not find artifact org.onyxplatform:onyx-metrics:jar:0.8.0.1 in clojars (https://clojars.org/repo/)

yusup13:11:41

github page needs an update.

lucasbradstreet13:11:43

That’s weird. Try 0.8.0.3

lucasbradstreet13:11:46

Will update the page

lucasbradstreet13:11:50

Must have been a release process issue

lucasbradstreet13:11:58

Should have auto updated

lucasbradstreet13:11:09

Sorry about that

yusup13:11:43

it’s ok . I am the sorry one with endless questions. simple_smile

yusup19:11:26

Metrics: {:job-id #uuid "1f423693-ccd1-4f4e-85c8-baa277a7ff84", :task-id #uuid "a802c18e-7a94-4fba-81c6-f110e4551501", :task-name :my-dent, :peer-id #uuid "c4bc13d3-bbe1-43cb-9d42-165dd9bddf83", :service "[:my-dent] 1s_retry-segment-rate", :window "1s", :metric :retry-rate, :value 0.0, :tags ["retry_segment_rate_1s" "onyx" ":my-dent" "your-workflow-name"]}

yusup19:11:51

and its printing this kind of info quite frequently .

lucasbradstreet19:11:52

Ok that's weird then

lucasbradstreet19:11:14

So my theory was that your messages weren't being acked and we're being retried

lucasbradstreet19:11:06

Yeah I believe it. You really need to look at the point where there's some throughput though

lucasbradstreet19:11:22

Any retries will be slightly before that

yusup19:11:48

you mean not 0?

yusup19:11:15

point where :value not 0?

lucasbradstreet19:11:35

Yeah. So initially throughput will be positive. At the some point later for the second call, throughput will be positive again. I want to know whether retries were positive just prior too

yusup19:11:48

I had this error on one of the nodes, the other one is free of errors..

lucasbradstreet19:11:06

Same error as last time huh

lucasbradstreet19:11:18

Sorry, that's my cue for sleep :/

lucasbradstreet19:11:30

We can work through it tomorrow at an earlier time

yusup19:11:53

what can I do on my part to investigate ?

yusup19:11:06

kinda the same time zone here too.

lucasbradstreet19:11:53

Is your main concern the error above or the multiple function calls?

yusup19:11:55

multiple function calls

yusup19:11:08

and sometimes as you see , it hangs .

yusup19:11:59

I will check my code and logs tonight . we can talk tomorrow when you have time.

lucasbradstreet19:11:55

Ok, so a few things make sense to me. Either you're reading messages and they're not getting acked. If this is happening then you'll get replays after onyx/pending-timeout which is 60s

lucasbradstreet19:11:12

Second alternative is you're submitting jobs multiple times. Potentially using the same group-Id (which is how it commits the checkpoint)

lucasbradstreet19:11:09

Other thing I can think of is that you have multiple results on the input medium (i.e. You're accidentally writing your input messages to your Kafka topic twice)

lucasbradstreet19:11:19

Those are the main things that I think could go wrong

yusup19:11:43

ok. I will go through one by one. Thanks

yusup19:11:17

I think it is probably caused by the second point you mentioned.