Fork me on GitHub
#onyx
<
2015-11-22
>
robert-stuttaford18:11:04

as you can see, retries went up and then flat-lined, but the rest of the system went into a tailspin. notice chat_links, percent_complete, export_variables numbers; they’re in the millions of millseconds

robert-stuttaford18:11:10

the time period in that graph is the last two days

lucasbradstreet19:11:02

Both batch latency and complete latency are spiking?

robert-stuttaford19:11:32

the purples are 50% batch and the blues are 90% batch

robert-stuttaford19:11:01

the yellow top right is complete latency, but as you can see from the flat increases, we didn’t actually receive data for those for a long time

robert-stuttaford19:11:13

until the restarts to attempt to fix things

lucasbradstreet19:11:22

Yeah the fact that they're all getting bigger and bigger definitely suggests some kind of memory issue.

lucasbradstreet19:11:53

Ok, I'll give the metrics another look. You're certain you're using onyx-metrics 0.7.10.1

robert-stuttaford19:11:15

we’ll be implementing your JMC suggestions tomorrow

lucasbradstreet19:11:39

Ok. Hmm. I pushed millions of segments per second through metrics for hours and didn't have any issues, but it didn't hit any retries.

lucasbradstreet19:11:07

Ok, a flight recorder dump will tell us for sure

robert-stuttaford19:11:59

just checked; riemann and zk both totally ok during this time

lucasbradstreet19:11:25

Good to know. Figured Riemann was fine since the stats were coming through

robert-stuttaford19:11:45

we’re going to stress-test the system two ways: as it is now, and with metrics disabled. we’ll assess the transaction log’s timestamps in both cases to see what makes it in and in what mean timespan

robert-stuttaford19:11:13

since input tx timestamp -> tag input tx timestamp gives us a very clear measure

robert-stuttaford19:11:17

forgive me, but i can’t seem to locate the JMC settings you recommended. could I ask you to share them again?

lucasbradstreet19:11:29

Good idea. The flight recorder files will also make it very clear whether it's a memory leak problem. You should start seeing GCs take longer and longer in mission control

lucasbradstreet19:11:41

Sure. One moment

lucasbradstreet19:11:52

"-XX:+UnlockCommercialFeatures" "-XX:+FlightRecorder" "-XX:+UnlockDiagnosticVMOptions" "-XX:StartFlightRecording=duration=1080s,filename=localrecording.jfr"

lucasbradstreet19:11:08

You could choose a longer duration

robert-stuttaford19:11:07

so, it looks like we’ve encountered a new situation, because it held for a full business week this time

robert-stuttaford19:11:20

whereas previously we were only managing 48 hrs or so

lucasbradstreet19:11:30

Yeah, I suspect we fixed the main issue but there’s more

robert-stuttaford19:11:56

ok. i don’t know how quickly we’ll have actionable intel for you, but we’re going to put someone on it this week until we do

lucasbradstreet19:11:48

There’s also another setting to give you more heap statistics, but I don’t know how to setup the profile and run it on an external machine, see http://stackoverflow.com/questions/19056826/java-mission-control-heap-profile

lucasbradstreet19:11:50

I’ve used it before

robert-stuttaford19:11:52

and we’ll go back to daily restarts until we’ve identified the issue

lucasbradstreet19:11:08

I’ll see if I can trigger it with metrics locally

robert-stuttaford19:11:18

would you need us to do what you’ve just linked?

lucasbradstreet19:11:41

It’d help, but it can probably wait until we confirm it’s a memory leak issue, which would just need the other settings

robert-stuttaford19:11:51

ok. we’ll start with the first set

lucasbradstreet19:11:58

The other recording will at least tell us that it’s taking longer and longer to GC

robert-stuttaford19:11:24

thanks Lucas. more info soon simple_smile

lucasbradstreet19:11:34

No problem. We'll figure it out. I can't see anything immediate in metrics. It may be more likely that the new issue is in the datomic input plugin at this stage. We'll see though.

lucasbradstreet19:11:22

Just to check, no exceptions were logged?