This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2015-11-24
Channels
- # admin-announcements (25)
- # beginners (132)
- # boot (89)
- # cider (26)
- # clara (12)
- # cljs-dev (10)
- # cljsrn (11)
- # clojure (151)
- # clojure-germany (8)
- # clojure-russia (1)
- # clojurescript (137)
- # cursive (33)
- # datavis (28)
- # datomic (3)
- # devcards (8)
- # hoplon (5)
- # immutant (11)
- # jobs (4)
- # ldnclj (58)
- # lein-figwheel (7)
- # off-topic (95)
- # om (114)
- # onyx (91)
- # parinfer (38)
- # portland-or (1)
- # re-frame (26)
- # reagent (1)
hey @lucasbradstreet i think i may have done something untoward in our timbre set up
nevermind, i see that it’s all in /var/log/syslog now
@robert-stuttaford: ah, may have overwritten the existing configuration rather than merging?
hi @lucasbradstreet, if we see this message a lot: https://github.com/onyx-platform/onyx-metrics/blob/master/src/onyx/lifecycle/metrics/riemann.clj#L30, and also see lots of these:
15-Nov-24 06:23:27 production-HighstormStack-77PF3UB8TROI-i-bba8320b WARN [onyx.lifecycle.metrics.riemann] -
ESC[37mjava.lang.Thread.runESC[m ESC[32m Thread.java: 745ESC[m
ESC[37mjava.util.concurrent.ThreadPoolExecutor$Worker.runESC[m ESC[32mThreadPoolExecutor.java: 617ESC[m
ESC[37mjava.util.concurrent.ThreadPoolExecutor.runWorkerESC[m ESC[32mThreadPoolExecutor.java: 1142ESC[m
ESC[37mjava.util.concurrent.FutureTask.runESC[m ESC[32m FutureTask.java: 266ESC[m
ESC[37m...ESC[m ESC[32m ESC[m
ESC[33mclojure.core/binding-conveyor-fn/ESC[1;33mfnESC[m ESC[32m core.clj: 1916ESC[m
ESC[33monyx.lifecycle.metrics.riemann/riemann-sender/ESC[1;33mfnESC[m ESC[32m riemann.clj: 21ESC[m
ESC[33monyx.lifecycle.metrics.riemann/riemann-sender/fn/ESC[1;33mfnESC[m ESC[32m riemann.clj: 24ESC[m
ESC[33mriemann.client/ESC[1;33msend-eventESC[m ESC[32m client.clj: 72ESC[m
ESC[37mcom.aphyr.riemann.client.RiemannClient.sendEventESC[m ESC[32m RiemannClient.java: 115ESC[m
ESC[37mcom.aphyr.riemann.client.RiemannClient.sendMessageESC[m ESC[32m RiemannClient.java: 110ESC[m
ESC[37mcom.aphyr.riemann.client.TcpTransport.sendMessageESC[m ESC[32m TcpTransport.java: 259ESC[m
ESC[37mcom.aphyr.riemann.client.TcpTransport.sendMessageESC[m ESC[32m TcpTransport.java: 289ESC[m
ESC[1;31mjava.io.IOExceptionESC[m: ESC[3mno channels availableESC[m
does that mean that our riemann instance is under provisioned?
(on the logging, it’s quite possible. i reverted to the previous setup for now)
we had a comedy of errors last night. ZK failed because it couldn’t write anything to disk. it was because we filled the disk up with a LOT of these riemann errors
of course, the other problem is that we didn’t actually assign all the ssd hard drive space to the OS, which caused us to hit this problem very quickly
Hmm. Maybe I shouldn't log every one of those.
It should've taken 5s per message to time out
is this an actual error, or something we can ignore? we shouldn’t get these warnings at all, ideally?
we get many of these per second
Ideally you shouldn't get them at all, though I'm sure you'll occasionally hit a retry which is why we resend.
I guess you have one for each task running
So 15 tasks at 5 second timeouts would mean 3 per second
27 peers total, now
6ish then
is it because riemann is already saturated?
trying to understand where the bottleneck is
It shouldn't be. It's not sending all that many messages
ok. so it’s probably more likely a network issue between the nodes
Or Riemann having problems
ok. is it perhaps possible to squash some of that logging in the next release of -metrics, please?
Yes, I think you're right they it only compounds the problem
how would you change this?
log less often, or catch and log a normal message, or a combination?
It's a little tricky unless we log a periodic message with retry stats instead
that sounds good
The question then: do I send those stats to Riemann? :p
i’m guessing not
Is it working better now?
It shouldn't be a load issue because benches show Riemann can handle 100K/sec and we're only pushing out 8ish per second
it is working better now
we’re down to 25 max-pending now
Seems reasonable since segments can take a second+ for some tasks
no shortage of learning here!
Our defaults aren't very appropriate for that
if we have a latency spike on a task with zero throughput, what could that mean?
Most likely is that it isn't 0 throughput
ah, lies, damn lies, and statistics?
12K is probably too high to be caused by a GC pause
Maybe that Riemann stat timed out and was retried later. Not sure
Or the throughput is being rounded off? Would it show 10 instead of XK in that case?
we often see fractions
i’ve got a 0.01k
so this would have been less than 5
bizarre
ok. will stop pestering you until we have done our homework on our code. probably chat later on
Oh, I might have an idea about what it is
I asked you a while back about the number of peers you use per task. The reason why is that we weren’t segmenting metrics by the peer id as well as the task name. This meant that you would have two peers outputting the same statistics, which means it’ll undercount throughput
It’s fixed in the 0.8 metrics
ahhh. yes. we’re on multiple peers for this task now: 4
For example, if you have two peers on the same task, :in, and both tasks output a throughput of 10, then the task will report 10 rather than 20
@robert-stuttaford: timeout logging has been coalesced in the latest metrics
@robert-stuttaford: we’ll be releasing onyx-0.8.1 soon, which might be a good point for you to upgrade everything
how soon? we’re on a hangout talking about upgrading literally right now haha
Next couple days?
I could probably even do so tomorrow if it makes a difference
I'm going to release and alpha today
s/and/an
ok. we’re going to be busy with code for a couple days anyway. i’m happy to do it when you have a release ready
No worries.
found a good way to watch for input latency
the shaded area is the transactor cloudwatch metric for transaction count over time
if we see a significant delay, or significant increase in that delay, we know that our max-pending is too low
Ah that's very cool
we’ve also split out the graphs for the input
Yeah, I think that's a good idea. The input task is the most important to keep an eye on
@lucasbradstreet: on the riemann issue, would increasing :metrics/buffer-capacity from 10k help?
@robert-stuttaford: which issue? Probably not though.
the one where riemann errors gatecrash the onyx log
@lucasbradstreet: looking ahead to 0.8.1. the splitting of metrics; does this mean we’d have totally independent metrics per peer, allowing us to graph them separately?
Ah, I coallesced them into a single info message per second with a timeout count
The message doesn’t include the payload now
So it shouldn’t fill up the logs as much too
yes, you’d have totally independent metrics per peer, and then you can aggregate them by task if you like
that’s awesome!
i’d like to graph them separately to begin with
yeah, it’ll make it easier to make sure the work is being split up
upgrading looks straightforward for us. just need to fix the ports for aeron, and update our datadog dash once we have metrics
Yep, should be.
I released the alpha (and a few other alphas to fix up issues in our automated plugin release process)
props for standardising the changelog locations in each repo
We’ll do a full release tomorrow or the next day
Is there a list of companies using Onyx?
I'm hoping to sell some people on using it, but that's always easier with a pitch along the lines of "look at these other companies that are successfully using it"