Fork me on GitHub
#onyx
<
2017-10-28
>
Travis19:10:18

@lucasbradstreet I have some flight recordings any thing I should be looking for ?

Travis19:10:43

I took one of the peer and media driver ( separate runs ) up until the job was killed

lucasbradstreet19:10:09

the first thing I’d have a look at is the memory tab / under GC times

lucasbradstreet19:10:45

I didn’t realise that your job is actually being killed. That suggests you might need a handle-exception lifecycle to stop it from being killed when it throws an exception (depending on the exception)

lucasbradstreet19:10:03

Any idea what the exception was in this case?

Travis20:10:56

the one we keep seeing is the media driver timeout and the onyx.messaging.aeron.utils:76] - Error stopping publication io.aeron.exceptions.RegistrationException: Unknown publication: 73```

lucasbradstreet20:10:16

Ah right, still that one. That’s still a more fundamental issue

lucasbradstreet20:10:23

OK, lets get into those jfrs then

Travis20:10:50

peer or driver ?

lucasbradstreet20:10:52

lets do peer first

lucasbradstreet20:10:20

if you want to send them to me we can step through them

Travis20:10:32

sure whats the best way to send them to you

lucasbradstreet20:10:47

you can PM the files to me

lucasbradstreet20:10:51

there’s a couple largish pauses but nothing too brutal

Travis20:10:57

yeah, thats what I was thinking as well

lucasbradstreet20:10:42

Is it possible that it’s only getting 1GB of heap?

lucasbradstreet20:10:24

If you switch to the memory tab and look at the green line

Travis20:10:51

I wouldn’t rule anything out but that is interesting

lucasbradstreet20:10:26

yeah, the maximum heap size concerns me

Travis20:10:30

I have been attempting to run with the new Cgroup memory option

lucasbradstreet20:10:45

Ah, so it should adjust based on how much memory the container gets

lucasbradstreet20:10:02

I’m not sure how that would change what you see in flight recording then

Travis20:10:37

and I am deploying with these kube settings on the Peer

Travis20:10:44

resources:
            limits:
              memory: 3106M
              cpu: "1"
            requests:
              memory: 1500M
              cpu: "1"

Travis20:10:49

been playing with the numbers

lucasbradstreet20:10:52

that’s my best guess based on what I’m seeing. The pauses don’t seem big enough to cause it though

lucasbradstreet20:10:05

there’s not that much happening cpu wise

lucasbradstreet20:10:09

if you could increase the RAM and CPUs they get to test, it might be worth it. it’s not being given a lot of resources so it might need some tuning to work under these settings

Travis20:10:51

Cool, its definitely a point to start. I have had a more complex version of this job running in Mesos and on a VM on 0.9,x running fine before so I would imagine it has something to do with this