This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2015-11-22
Channels
- # beginners (122)
- # boot (217)
- # cider (14)
- # cljs-dev (74)
- # cljsrn (22)
- # clojure (101)
- # clojure-nl (4)
- # clojure-russia (22)
- # clojure-taiwan (5)
- # clojurescript (87)
- # cursive (4)
- # datavis (2)
- # editors (3)
- # hoplon (2)
- # jobs (2)
- # ldnclj (1)
- # lein-figwheel (1)
- # luminus (1)
- # off-topic (1)
- # om (105)
- # onyx (37)
- # reagent (2)
- # spacemacs (2)
as you can see, retries went up and then flat-lined, but the rest of the system went into a tailspin. notice chat_links, percent_complete, export_variables numbers; they’re in the millions of millseconds
the time period in that graph is the last two days
Both batch latency and complete latency are spiking?
the purples are 50% batch and the blues are 90% batch
the yellow top right is complete latency, but as you can see from the flat increases, we didn’t actually receive data for those for a long time
until the restarts to attempt to fix things
Yeah the fact that they're all getting bigger and bigger definitely suggests some kind of memory issue.
Ok, I'll give the metrics another look. You're certain you're using onyx-metrics 0.7.10.1
absolutely
we’ll be implementing your JMC suggestions tomorrow
Ok. Hmm. I pushed millions of segments per second through metrics for hours and didn't have any issues, but it didn't hit any retries.
Ok, a flight recorder dump will tell us for sure
just checked; riemann and zk both totally ok during this time
Good to know. Figured Riemann was fine since the stats were coming through
we’re going to stress-test the system two ways: as it is now, and with metrics disabled. we’ll assess the transaction log’s timestamps in both cases to see what makes it in and in what mean timespan
since input tx timestamp -> tag input tx timestamp gives us a very clear measure
forgive me, but i can’t seem to locate the JMC settings you recommended. could I ask you to share them again?
Good idea. The flight recorder files will also make it very clear whether it's a memory leak problem. You should start seeing GCs take longer and longer in mission control
Sure. One moment
"-XX:+UnlockCommercialFeatures" "-XX:+FlightRecorder" "-XX:+UnlockDiagnosticVMOptions" "-XX:StartFlightRecording=duration=1080s,filename=localrecording.jfr"
You could choose a longer duration
so, it looks like we’ve encountered a new situation, because it held for a full business week this time
whereas previously we were only managing 48 hrs or so
Yeah, I suspect we fixed the main issue but there’s more
ok. i don’t know how quickly we’ll have actionable intel for you, but we’re going to put someone on it this week until we do
There’s also another setting to give you more heap statistics, but I don’t know how to setup the profile and run it on an external machine, see http://stackoverflow.com/questions/19056826/java-mission-control-heap-profile
I’ve used it before
and we’ll go back to daily restarts until we’ve identified the issue
No worries
I’ll see if I can trigger it with metrics locally
would you need us to do what you’ve just linked?
It’d help, but it can probably wait until we confirm it’s a memory leak issue, which would just need the other settings
ok. we’ll start with the first set
The other recording will at least tell us that it’s taking longer and longer to GC
thanks Lucas. more info soon
No problem. We'll figure it out. I can't see anything immediate in metrics. It may be more likely that the new issue is in the datomic input plugin at this stage. We'll see though.
Just to check, no exceptions were logged?