onyx 2015-11-24 | Slack Archive

hey @lucasbradstreet i think i may have done something untoward in our timbre set up

nevermind, i see that it’s all in /var/log/syslog now

@robert-stuttaford: ah, may have overwritten the existing configuration rather than merging?

hi @lucasbradstreet, if we see this message a lot: https://github.com/onyx-platform/onyx-metrics/blob/master/src/onyx/lifecycle/metrics/riemann.clj#L30, and also see lots of these:

robert-stuttaford06:11:30

15-Nov-24 06:23:27 production-HighstormStack-77PF3UB8TROI-i-bba8320b WARN [onyx.lifecycle.metrics.riemann] -
                               ESC[37mjava.lang.Thread.runESC[m  ESC[32m            Thread.java:  745ESC[m
 ESC[37mjava.util.concurrent.ThreadPoolExecutor$Worker.runESC[m  ESC[32mThreadPoolExecutor.java:  617ESC[m
  ESC[37mjava.util.concurrent.ThreadPoolExecutor.runWorkerESC[m  ESC[32mThreadPoolExecutor.java: 1142ESC[m
                ESC[37mjava.util.concurrent.FutureTask.runESC[m  ESC[32m        FutureTask.java:  266ESC[m
                                                ESC[37m...ESC[m  ESC[32m                             ESC[m
                ESC[33mclojure.core/binding-conveyor-fn/ESC[1;33mfnESC[m  ESC[32m               core.clj: 1916ESC[m
   ESC[33monyx.lifecycle.metrics.riemann/riemann-sender/ESC[1;33mfnESC[m  ESC[32m            riemann.clj:   21ESC[m
ESC[33monyx.lifecycle.metrics.riemann/riemann-sender/fn/ESC[1;33mfnESC[m  ESC[32m            riemann.clj:   24ESC[m
                          ESC[33mriemann.client/ESC[1;33msend-eventESC[m  ESC[32m             client.clj:   72ESC[m
   ESC[37mcom.aphyr.riemann.client.RiemannClient.sendEventESC[m  ESC[32m     RiemannClient.java:  115ESC[m
 ESC[37mcom.aphyr.riemann.client.RiemannClient.sendMessageESC[m  ESC[32m     RiemannClient.java:  110ESC[m
  ESC[37mcom.aphyr.riemann.client.TcpTransport.sendMessageESC[m  ESC[32m      TcpTransport.java:  259ESC[m
  ESC[37mcom.aphyr.riemann.client.TcpTransport.sendMessageESC[m  ESC[32m      TcpTransport.java:  289ESC[m
ESC[1;31mjava.io.IOExceptionESC[m: ESC[3mno channels availableESC[m

robert-stuttaford06:11:00

does that mean that our riemann instance is under provisioned?

robert-stuttaford06:11:15

(on the logging, it’s quite possible. i reverted to the previous setup for now)

robert-stuttaford06:11:16

we had a comedy of errors last night. ZK failed because it couldn’t write anything to disk. it was because we filled the disk up with a LOT of these riemann errors

robert-stuttaford06:11:05

of course, the other problem is that we didn’t actually assign all the ssd hard drive space to the OS, which caused us to hit this problem very quickly

lucasbradstreet06:11:11

Ouch.

lucasbradstreet06:11:26

Hmm. Maybe I shouldn't log every one of those.

lucasbradstreet06:11:31

It should've taken 5s per message to time out

robert-stuttaford06:11:42

is this an actual error, or something we can ignore? we shouldn’t get these warnings at all, ideally?

robert-stuttaford06:11:13

we get many of these per second

lucasbradstreet06:11:16

Ideally you shouldn't get them at all, though I'm sure you'll occasionally hit a retry which is why we resend.

lucasbradstreet06:11:30

I guess you have one for each task running

lucasbradstreet06:11:44

So 15 tasks at 5 second timeouts would mean 3 per second

robert-stuttaford06:11:57

27 peers total, now

lucasbradstreet06:11:22

6ish then

robert-stuttaford06:11:54

is it because riemann is already saturated?

robert-stuttaford06:11:04

trying to understand where the bottleneck is

lucasbradstreet06:11:23

It shouldn't be. It's not sending all that many messages

robert-stuttaford06:11:50

ok. so it’s probably more likely a network issue between the nodes

lucasbradstreet06:11:25

Or Riemann having problems

robert-stuttaford06:11:27

ok. is it perhaps possible to squash some of that logging in the next release of -metrics, please?

lucasbradstreet06:11:21

Yes, I think you're right they it only compounds the problem

robert-stuttaford07:11:08

how would you change this?

robert-stuttaford07:11:26

log less often, or catch and log a normal message, or a combination?

lucasbradstreet07:11:52

It's a little tricky unless we log a periodic message with retry stats instead

robert-stuttaford07:11:18

that sounds good

lucasbradstreet07:11:40

The question then: do I send those stats to Riemann? :p

robert-stuttaford07:11:52

-grin-

robert-stuttaford07:11:59

i’m guessing not

lucasbradstreet08:11:58

Is it working better now?

lucasbradstreet08:11:12

It shouldn't be a load issue because benches show Riemann can handle 100K/sec and we're only pushing out 8ish per second

robert-stuttaford08:11:15

it is working better now

robert-stuttaford08:11:31

we’re down to 25 max-pending now

lucasbradstreet08:11:21

Seems reasonable since segments can take a second+ for some tasks

robert-stuttaford08:11:40

yeah

robert-stuttaford08:11:46

no shortage of learning here!

lucasbradstreet08:11:51

Our defaults aren't very appropriate for that

robert-stuttaford08:11:23

if we have a latency spike on a task with zero throughput, what could that mean?

lucasbradstreet09:11:48

Most likely is that it isn't 0 throughput

robert-stuttaford09:11:09

ah, lies, damn lies, and statistics?

lucasbradstreet09:11:12

12K is probably too high to be caused by a GC pause

lucasbradstreet09:11:38

Maybe that Riemann stat timed out and was retried later. Not sure

lucasbradstreet09:11:32

Or the throughput is being rounded off? Would it show 10 instead of XK in that case?

robert-stuttaford09:11:14

we often see fractions

robert-stuttaford09:11:34

i’ve got a 0.01k

robert-stuttaford09:11:45

so this would have been less than 5

robert-stuttaford09:11:28

bizarre

robert-stuttaford09:11:15

ok. will stop pestering you until we have done our homework on our code. probably chat later on

lucasbradstreet09:11:53

Oh, I might have an idea about what it is

lucasbradstreet09:11:14

I asked you a while back about the number of peers you use per task. The reason why is that we weren’t segmenting metrics by the peer id as well as the task name. This meant that you would have two peers outputting the same statistics, which means it’ll undercount throughput

lucasbradstreet09:11:39

It’s fixed in the 0.8 metrics

robert-stuttaford09:11:55

ahhh. yes. we’re on multiple peers for this task now: 4

lucasbradstreet09:11:09

For example, if you have two peers on the same task, :in, and both tasks output a throughput of 10, then the task will report 10 rather than 20

lucasbradstreet10:11:40

@robert-stuttaford: timeout logging has been coalesced in the latest metrics

lucasbradstreet10:11:11

@robert-stuttaford: we’ll be releasing onyx-0.8.1 soon, which might be a good point for you to upgrade everything

robert-stuttaford10:11:55

how soon? we’re on a hangout talking about upgrading literally right now haha

lucasbradstreet10:11:42

Next couple days?

lucasbradstreet10:11:38

I could probably even do so tomorrow if it makes a difference

lucasbradstreet10:11:48

I'm going to release and alpha today

lucasbradstreet10:11:03

s/and/an

robert-stuttaford10:11:10

ok. we’re going to be busy with code for a couple days anyway. i’m happy to do it when you have a release ready

lucasbradstreet10:11:09

No worries.

robert-stuttaford10:11:32

found a good way to watch for input latency

robert-stuttaford10:11:06

the shaded area is the transactor cloudwatch metric for transaction count over time

robert-stuttaford10:11:32

if we see a significant delay, or significant increase in that delay, we know that our max-pending is too low

lucasbradstreet10:11:20

Ah that's very cool

robert-stuttaford10:11:30

we’ve also split out the graphs for the input

lucasbradstreet11:11:21

Yeah, I think that's a good idea. The input task is the most important to keep an eye on

robert-stuttaford18:11:24

@lucasbradstreet: on the riemann issue, would increasing :metrics/buffer-capacity from 10k help?

lucasbradstreet18:11:18

@robert-stuttaford: which issue? Probably not though.

robert-stuttaford18:11:26

the one where riemann errors gatecrash the onyx log

robert-stuttaford18:11:03

@lucasbradstreet: looking ahead to 0.8.1. the splitting of metrics; does this mean we’d have totally independent metrics per peer, allowing us to graph them separately?

lucasbradstreet18:11:33

Ah, I coallesced them into a single info message per second with a timeout count

lucasbradstreet18:11:40

The message doesn’t include the payload now

lucasbradstreet18:11:49

So it shouldn’t fill up the logs as much too

lucasbradstreet18:11:19

yes, you’d have totally independent metrics per peer, and then you can aggregate them by task if you like

robert-stuttaford18:11:35

that’s awesome!

robert-stuttaford18:11:51

i’d like to graph them separately to begin with

lucasbradstreet18:11:38

yeah, it’ll make it easier to make sure the work is being split up

robert-stuttaford18:11:55

upgrading looks straightforward for us. just need to fix the ports for aeron, and update our datadog dash once we have metrics

lucasbradstreet18:11:41

Yep, should be.

lucasbradstreet19:11:03

I released the alpha (and a few other alphas to fix up issues in our automated plugin release process)

robert-stuttaford19:11:25

props for standardising the changelog locations in each repo

lucasbradstreet19:11:43

We’ll do a full release tomorrow or the next day

robert-stuttaford19:11:37

super!

surreal.analysis22:11:36

Is there a list of companies using Onyx?

surreal.analysis22:11:02

I'm hoping to sell some people on using it, but that's always easier with a pitch along the lines of "look at these other companies that are successfully using it"

2015-11-24

Channels