Fork me on GitHub
#onyx
<
2015-11-30
>
robert-stuttaford09:11:27

@lucasbradstreet: curious - is a metric that shows the current pending segment count available? if not, is it possible?

robert-stuttaford09:11:55

that’s the gas pedal we’re using to control whether we retry or not. we’ve found a number of ways to optimise our code, but at some point, we’re going to reach the scaling limits of our system again. thinking of ways to predict outages rather than just respond to them after-the-fact, and i’m thinking that monitoring the pending count might help

lucasbradstreet09:11:28

I’d really like to do that as part of metrics. The main problem that I can see is that it’ll depend on all the input plugins putting the pending-messages map in the same place

lucasbradstreet09:11:56

Actually, I think I have a better idea

lucasbradstreet09:11:13

We just use a counter that is increased as we read a batch, and decremented as we ack and retry

lucasbradstreet09:11:10

I’ve been wanting to do this for a long time

lucasbradstreet10:11:08

Something else you might want to do is monitor how many txes the log reader is behind the current basis-t

lucasbradstreet10:11:03

I would suggest doing that as a measure of txes rather than time because you’d need perfect clock sync

lucasbradstreet10:11:41

Tho I guess you could look up the time of your current tx and compare it to the latest tx, then both timestamps would be transactor timestamps

lucasbradstreet10:11:51

@robert-stuttaford: I'll implement a pending count and I'd accept a patch for the latter

lucasbradstreet10:11:16

@robert-stuttaford: how’s the 0.8.0 conversion going? Are you in prod yet?

robert-stuttaford10:11:54

yes, i want to monitor that time gap as well. we’ve been co-plotting read-log throughput with cloudwatch transaction sum, but they rarely correlate at all

robert-stuttaford10:11:49

0.8.2 is looking good. we’re still in stress-testing. hope to be shipping the new code tomorrow

robert-stuttaford10:11:23

really just want to establish that the new code is indeed faster, and by how much

robert-stuttaford10:11:42

(i’m 100% certain it is faster, but not at all sure by how much)

lucasbradstreet10:11:02

No worries, just not trying to rush you at all. It’s just good to know when a version is in real use.

lucasbradstreet10:11:15

I’ll try to get the pending count done tonight so you can try to include it in your next release

lucasbradstreet10:11:58

assuming the technique I mentioned will work

robert-stuttaford10:11:12

the inc-when-read, dec-when-ack approach is good

robert-stuttaford10:11:21

we need to look into how we might do the second one

lucasbradstreet10:11:03

There are a few different ways. Happy to chat with you about them when you start thinking about doing it

robert-stuttaford10:11:30

the read-log plugin would have to do it, right. if, on reading a batch from the tx-log, it could log a metric of how many txes there are beyond it’s ‘to’ value, we could monitor that count

robert-stuttaford10:11:02

e.g. 25 new txes, it reads 15, the count is 10

lucasbradstreet10:11:39

I’m thinking what we do is track it via the commit loop, which will track the highest completely acked tx

lucasbradstreet10:11:43

Which we already do

lucasbradstreet10:11:32

Then periodically (in another thread or something), we get the basis-t from the db and compare it to the highest completely acked tx, and output a metric to the channel used by the metrics plugin

lucasbradstreet10:11:46

You could do both highest acked and highest read if you want

lucasbradstreet10:11:12

(- latest-basis-t highest-acked) is a measure of how far you’ve completed

lucasbradstreet10:11:34

(- latest-basis-t highest-read) is a measure of how far you’ve started

robert-stuttaford10:11:29

i’d be totally ok with just the first number - how far we’ve completed

robert-stuttaford10:11:46

as the ones in ‘the middle’ would be covered by the pending count we’ve already discussed

lucasbradstreet16:11:13

@robert-stuttaford: onyx-metrics 0.8.2.4 has been released with support for pending message count

robert-stuttaford16:11:17

looks like we have some dashboard changes too - . > _

lucasbradstreet16:11:29

Yep, that's right

robert-stuttaford16:11:53

@lucasbradstreet: i’d like you and @michaeldrogalis to consider adding bounded retry

robert-stuttaford16:11:08

i worry that we’re always going to be on the back foot, the way it’s set up now

robert-stuttaford16:11:01

is it possible to do so in a manner that makes it optional?

robert-stuttaford16:11:25

the reason why is whenever we encounter retries, we end up having to restart and lower our max-pending

robert-stuttaford16:11:51

and, in that situation, the process almost never exits gracefully

robert-stuttaford16:11:12

meaning a kill -9 and then a manual cleanup to find and re-process the stuff that was busy

lucasbradstreet16:11:13

I think the best way to do it would be via a lifecycle call that’s similar to the restart fn call. Then you can decide whether to keep it, and if not write it out somewhere

robert-stuttaford16:11:33

that’s great, if we can say that it happens after N retries

lucasbradstreet16:11:48

Yeah, you’d code some kind of monitoring into it

lucasbradstreet16:11:58

It’d be up to you though

lucasbradstreet16:11:04

I’m also not sold on the idea yet

robert-stuttaford16:11:05

when we have retries, our cpu is totally hosed (understandably) but it never really recovers

lucasbradstreet16:11:15

You should probably increase your pending-timeout

lucasbradstreet16:11:36

If you’re just getting data that is going unprocessed because it’s hard, you have a pending-timeout problem

lucasbradstreet16:11:55

(or just a general performance problem)

robert-stuttaford17:11:14

here’s the thing though, it feels like a couple retries immediately compounds the problem

lucasbradstreet17:11:33

Yeah, I understand

robert-stuttaford17:11:33

that is, the CPU is ok until it starts retrying, and then very quickly it’s poked

robert-stuttaford17:11:38

know what i mean?

lucasbradstreet17:11:45

because the pipeline starts backing up and up and up

lucasbradstreet17:11:04

That probably means that you can barely handle even the max-pending you have now

robert-stuttaford17:11:06

look at how sharply these mountains form

robert-stuttaford17:11:20

that’s a 24hr time period

robert-stuttaford17:11:34

yeah. i’ll agree to that.

lucasbradstreet17:11:49

OK, here’s how we should look at it

lucasbradstreet17:11:54

You’re basically using core.async messaging

lucasbradstreet17:11:58

Since you’re on a single box

robert-stuttaford17:11:10

so the pending count will help a lot. because if it approaches max-pending at all, we’re headed for trouble

lucasbradstreet17:11:29

Which means that your only source of retries is that you’re going too slow for all the messages that are currently being processed

robert-stuttaford17:11:55

ok, that makes sense

lucasbradstreet17:11:55

(because retries are mostly because you could lose messages along the way, and you’re not because you have a perfect medium)

lucasbradstreet17:11:23

So that means max-pending is probably too high, or pending-timeout is probably too low, or some mixture of the two

lucasbradstreet17:11:01

For example, if you increase max-pending, you probably need to increase pending-timeout to give all the segments more time because there’s going to be more queuing

robert-stuttaford17:11:11

the big mountain was at 250, the small mountain at 50 max-pending

robert-stuttaford17:11:35

ok. to increase max-timeout we also need to increase ack-timeout

lucasbradstreet17:11:38

If you increase pending-timeout to a big enough figure, you won’t get a retry storm

lucasbradstreet17:11:56

but you will get things starting to back up if your tasks are slow

lucasbradstreet17:11:39

We increased the default ack-daemon-timeout to 480 seconds in 0.8.0

lucasbradstreet17:11:52

it was dumb to have it equivalent to the pending timeout as it gives you no room at run time

robert-stuttaford17:11:27

ah, i was just about to ask

robert-stuttaford17:11:28

:onyx.messaging/ack-daemon-timeout

michaeldrogalis17:11:24

@robert-stuttaford: A lot of the pain you're experiencing comes from the fact that your cluster is 1 machine. Normally I'd say add more machines to off-set the load. I know why that's currently a problem, but that's the source of a lot of the friction that's coming up.

robert-stuttaford17:11:26

ok. i think if we can alert on pending reaching some % of max-pending .. e.g. 60% or 80%, once we’ve established sensible workaday performance baselines, that’ll help a lot

lucasbradstreet17:11:47

The other thing you need to consider is what part of your job is realtime and which part is not

lucasbradstreet17:11:13

For example, your slow links calculations, could maybe be persisted and calculated in another job with different constraints

lucasbradstreet17:11:33

thus decoupling the hard work from the realtime user connected work

robert-stuttaford17:11:39

hey Michael! yeah. the constraint that created that situation has been lifted. we just need to get to stability and then we’ll start work on going multiple instance

lucasbradstreet17:11:49

Yeah, they bumped their machine simple_smile

robert-stuttaford17:11:51

we’ve actually factored most of the work in the links task out

michaeldrogalis17:11:11

Oh nice. Did you end up increasing your Datomic peer limit then?

robert-stuttaford17:11:15

we did bump the machine, but we also have more Datomic peers now

lucasbradstreet17:11:09

@robert-stuttaford: "we’ve actually factored most of the work in the links task out”. Do you see the issues in this code?

robert-stuttaford17:11:10

we have two tasks (move scheduler to separate app, externalise aeron) and i have one question: if we multi-peer a task, there’s no ordering guarantees right?

robert-stuttaford17:11:24

yes. one ridiculous one, and a couple subtle ones

lucasbradstreet17:11:24

There’s no ordering guarantees regardless

robert-stuttaford17:11:37

ok. i think we need to take advantage of windows and triggers

lucasbradstreet17:11:42

sorry, I mean do you see the retry issue in that code?

lucasbradstreet17:11:51

Once you’ve decoupled it, improved it

robert-stuttaford17:11:08

oh, no, not at all

lucasbradstreet17:11:26

OK, so I guess what you’re asking for is a way for things to degrade more gracefully

lucasbradstreet17:11:31

Until you can fix the underlying issues?

robert-stuttaford17:11:43

the new code is tearing through the same work much faster than the current prod. all this discussion is in preparation for next time when we hit our next scaling issue

lucasbradstreet17:11:45

I think most of the issues you’ve hit are underlying issues

lucasbradstreet17:11:58

However, I do understand why you’d want a way for things to degrade more gracefully

michaeldrogalis17:11:02

We can work on better backpressure strategies, but we're definitely not going to do bounded retry. That has to come at the user level. Onyx guarantees that it's going to process the data corpus to completion - completion being defined by your app.

lucasbradstreet17:11:32

Yes, I forgot to mention that we have a new backpressure mechanism in the works

lucasbradstreet17:11:08

The old one only really worked with high throughput workloads

robert-stuttaford17:11:15

that’s totally cool, Michael, if i have a way to decide what to do at each retry - that i can know it’s a retry

lucasbradstreet17:11:20

You have a low throughput, high CPU usage workload

michaeldrogalis17:11:28

I get what you need though, @robert-stuttaford. It's going to be a combination of continued tuning, understanding the fault tolerance model, and us having a few more helpful options.

michaeldrogalis17:11:45

@robert-stuttaford: That's sensible. I'll see what we can do there.

lucasbradstreet17:11:55

Yeah, we’ll think about it some more

robert-stuttaford17:11:00

btw, aside from the mild panic, this is actually supremely fascinating

robert-stuttaford17:11:15

my respect for you and all the people who build such systems grows every day

michaeldrogalis17:11:36

Hah. Yeah, tuning is hard. I've never had an easy time doing this as the user of any system.

robert-stuttaford17:11:57

35 degrees celcius here 😓

michaeldrogalis17:11:07

Well, feel free to let us know if you can think of any more helpful metrics. I'm almost always in favor of adding more visibility.

robert-stuttaford17:11:46

cool! the one Lucas added earlier should help a crapton

lucasbradstreet17:11:59

I think the new planned backpressure mechanism will sort probably help your case too

lucasbradstreet17:11:00

We currently use the same one as Twitter's Heron but I think it’s very particular to certain workloads and requires tuning

michaeldrogalis17:11:15

@robert-stuttaford: If I were in your shoes, I would use a CloudWatch Alarm to spin up more machines for Onyx when pending-segments-count goes over a certain threshold.

lucasbradstreet17:11:57

Yeah, something like that is the end game

michaeldrogalis17:11:51

What exactly sets off the retries for you guys? Slower than expected Datomic transactions?

robert-stuttaford17:11:20

slower than expected queries that create transactions

robert-stuttaford17:11:27

once the txes reach the :output it’s fast

robert-stuttaford17:11:39

queries -> magic -> transaction data

lucasbradstreet17:11:52

Then once the queries were slow they’d end up retrying, which ends up with bigger batches

michaeldrogalis17:11:55

Ah. Yup, the answer is scaling that specific piece when things get bad.

lucasbradstreet17:11:59

The bigger batches would then definitely take too long

lucasbradstreet17:11:25

I’m not so sure scaling would solve that issue though. The issue there is partially due to too large batch sizes

robert-stuttaford17:11:25

yeah. so we increased the peer count on those, but very likely didn’t dial the max-pending down enough

lucasbradstreet17:11:57

It’ll help slightly that there’s less CPU contention

michaeldrogalis17:11:05

@lucasbradstreet: If more peers were reading from the upstream tasks, they'd get fewer segments in their batch since they're concurrently racing to read.

lucasbradstreet17:11:35

You’re mostly right, except for the fact that from A->B, a whole batch -> new batch is sent to B, regardless of how big it is

lucasbradstreet17:11:51

You’re right in that the next batch will probably be sent to another peer

lucasbradstreet17:11:20

The problem is that if you get the batches starting to be big enough that when the whole batch is sent to the next peer

lucasbradstreet17:11:35

And it’s too big to be processed in less than pending-timeout then you’re screwed

robert-stuttaford17:11:06

that sounds familiar 😄

lucasbradstreet17:11:25

We have an issue for this

michaeldrogalis17:11:31

We could stripe the segments over the peer set to alleviate that.

lucasbradstreet17:11:33

We now have write queues, so the next step should be pretty easy

lucasbradstreet17:11:47

We just need to chunk it

michaeldrogalis17:11:10

Totally got this under control, @robert-stuttaford 😛

lucasbradstreet17:11:13

I’ll put it on an 0.8 milestone, because I think it’s a priority

lucasbradstreet17:11:00

@michaeldrogalis is right in that you’re probably oversubscribing your tasks vs your number of cores, and all the retries are going to cause all your cores to be overly active

robert-stuttaford17:11:26

27 peers, 8 cores

robert-stuttaford17:11:32

very likely oversubscribed

robert-stuttaford17:11:48

the code fixes will buy us the time we need to go multi-instance

lucasbradstreet17:11:50

Yeah, under normal circumstances it’s fine but once you start getting your bad behaviour all of them will start being used at a single time

lucasbradstreet17:11:10

Yep, I’d still suggest increasing your pending-timeout

lucasbradstreet17:11:38

Note that it normally comes at a cost of fault tolerance latency

lucasbradstreet17:11:52

i.e. a message gets lost and then takes a long time to end up being processed

lucasbradstreet17:11:19

I don’t think this is a problem for you currently because you’re not on a lossy medium, but it may have costs later, so I would probably comment the entry in the task

robert-stuttaford17:11:22

if it prevents retry storms

robert-stuttaford17:11:45

and allows us to eventually be consistent without manual effort, i prefer it

lucasbradstreet17:11:05

Yep, I would also monitor how badly you’re lagging behind the latest basis-t

robert-stuttaford18:11:12

we realised that datadog puts statsd on every instance, so we’re going to use that instead of riemann. Nikita is already using it for C3 metrics, it works great

lucasbradstreet18:11:00

You mentioned that. Have you written the sender function for that yet?

lucasbradstreet18:11:06

It should be really easy

robert-stuttaford18:11:16

not yet. was waiting for Nikita to do his 😄

lucasbradstreet18:11:22

Ha, good idea 😄

robert-stuttaford18:11:23

@lucasbradstreet, @michaeldrogalis: odd one: Capacity must be a positive power of 2 + TRAILER_LENGTH: capacity=1048448 when starting peer group

robert-stuttaford18:11:43

seen this before?

robert-stuttaford18:11:46

we have an odd number of peers (27)

lucasbradstreet18:11:01

you have to trash your aeron dir

lucasbradstreet18:11:12

It should be listed a few lines up

lucasbradstreet18:11:23

They’ve started versioning their data recently, so it should be more obvious

robert-stuttaford18:11:27

just murder the whole /dev/shm/aeron-deploy ?

lucasbradstreet18:11:04

I would suggest adding rm -rf that dir to your startup script

lucasbradstreet18:11:29

We best effort delete it on shutdown but when you kill -9 there’s not much we can do

lucasbradstreet18:11:16

We can’t do it on startup because there can be multiple JVMs using the same media driver

lucasbradstreet18:11:22

But you’ll only have once

robert-stuttaford18:11:41

startup script, you mean instance startup?

lucasbradstreet18:11:51

Whatever starts your peers

robert-stuttaford18:11:10

ok. we have an upstart script that starts our jar, and the jar starts the peers on startup

lucasbradstreet18:11:10

Because presumably you’ve started your peers, killed them, and started them again on this machine

lucasbradstreet18:11:47

Actually, deleting after you kill the peers is probably the right approach

lowl4tency18:11:06

peer == app with onyx?

lowl4tency18:11:12

ok, will add to start script this removing

robert-stuttaford19:11:34

@lucasbradstreet: is this how the metrics are named now?

lucasbradstreet19:11:11

Whoa, the double _ is weird

robert-stuttaford19:11:22

i presume these ids change on every startup

robert-stuttaford19:11:52

this going to make dashboards hella hard to manage

lucasbradstreet19:11:52

I can see where this is going

lucasbradstreet19:11:11

In Grafana things are more dynamic than that

robert-stuttaford19:11:17

trying to see if i can wildcard it

lucasbradstreet19:11:31

Yeah, the problem is that we need some way to segment the individual peers

robert-stuttaford19:11:06

chatting to ddog support about it

lucasbradstreet19:11:09

Otherwise when :read-log sends two 5000 measurements, we don’t know whether it’s saying that it’s 5000 throughput, and it’s still 5000 throughput

robert-stuttaford19:11:10

we can’t be the first

lucasbradstreet19:11:20

or whether it’s two peers outputting 5000 throughput each

lucasbradstreet19:11:58

Cool, I’m happy to make changes there as long as we can have two peers outputting 5000 = 5000 + 5000

robert-stuttaford19:11:05

yup, totally understand that

robert-stuttaford19:11:32

i’m guessing numbering them in a monotonic and deterministic fashion is not so easy

robert-stuttaford19:11:38

esp considering there might be 1000s

lucasbradstreet19:11:51

I suspect this is the answer

lucasbradstreet19:11:17

We currently assign tags to the metrics, but I have no idea how you get at them

robert-stuttaford19:11:36

INFO [2015-11-30 13:59:12,554] defaultEventExecutorGroup-2-1 - riemann.config - #riemann.codec.Event{:host wheee, :service [:semaphore/send-email]_c15c58e3-bf85-4671-9354-741f9f1a0f9a 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :semaphore/send-email highstorm], :time 1448909952, :ttl nil}

lucasbradstreet19:11:04

yep, those tags look good

lucasbradstreet19:11:31

So, you probably need to figure out how to get at them in datadog

robert-stuttaford19:11:12

for reference, what does grafana let you do?

lucasbradstreet19:11:27

I have no idea how to get at the tags in grafana either

lucasbradstreet19:11:35

It’s getting better but honestly I’m not a fan

lucasbradstreet19:11:48

I was considering trying out datadog

robert-stuttaford19:11:43

"The solution here would be to submit all data under the same metric name” -cry-

robert-stuttaford19:11:57

“and add the UUID as a tag for this metric” hmm?

lucasbradstreet19:11:12

That’s kinda weird. It’s basically the reverse

robert-stuttaford19:11:47

would it work for the problem you’re solving with the uuids in the metric name?

lowl4tency19:11:37

as far as i understand ddog allows create graphs only by metric names and it allows to separate graphs with same metric name by tags

robert-stuttaford19:11:40

an aside: pending messages count is working simple_smile

robert-stuttaford19:11:02

yup lowl4tency i’m on with datadog support on irc. that’s what they just said too

lowl4tency19:11:09

For example metric A: allows create 2 separate graphs with tags env:prod and env:staging

lucasbradstreet19:11:21

Blah. As far as I understand that’s the opposite of how Grafana works

lowl4tency19:11:37

robert-stuttaford: hah, you are quicker simple_smile

robert-stuttaford19:11:50

i’m doing nothing else right now heh

lucasbradstreet19:11:18

With the datadog method, I don’t know how it adds everything up. I guess you give it a period to aggregate over?

lucasbradstreet19:11:26

So, say you have 5 measurements over 1 second

lucasbradstreet19:11:30

They’re summed?

robert-stuttaford19:11:09

#datadog on freenode if you want to watch simple_smile

lowl4tency19:11:24

Hm, one way are in my mind, push datadog api for create a graph with actual name of metric with start of app

lowl4tency19:11:46

New name become with each start of app?

robert-stuttaford19:11:49

yeah that could work but it creates big time dependencies between our systems

lowl4tency19:11:01

It's a huge uglyhack 😞

robert-stuttaford19:11:02

highstorm shouldn’t even know datadog exists

lucasbradstreet19:11:54

Whereas I think Grafana may get the most recent within some time limit for each service with some sub metric, and sums them

lucasbradstreet19:11:35

OK, I think our main problem is we were just using a shitty version of Grafana

lucasbradstreet19:11:47

I see group by time period in the latest version

lucasbradstreet19:11:05

@michaeldrogalis is pushing our benchmark suite forward now, which means a later version of grafana

lucasbradstreet19:11:22

It looks like we can group by 1s and them sum, which is exactly what we want

lucasbradstreet19:11:06

We don’t have it up and running yet, but I’m happy to revert the [task-name]_peer-id changes

lucasbradstreet19:11:10

and just put the peer-id in the tags

robert-stuttaford19:11:21

that is a tremendous relief, thank you Lucas

lucasbradstreet19:11:31

I’m glad we have users

robert-stuttaford19:11:33

"Ok, in the API case, if more than one datapoint is submitted under the same name with the exact same tag set during the same split second 9:21 MartinDdog only the last datapoint will be stored"

lucasbradstreet19:11:49

same split second? i.e. two measurements in one second?

lucasbradstreet19:11:53

Yeah, that’s bad

robert-stuttaford19:11:19

well riemann is pushing on a beat so it’d always do this

lucasbradstreet19:11:28

yeah, but say you have two :in tasks

lucasbradstreet19:11:49

not same service

robert-stuttaford19:11:49

if they’re tagged, we control how they’re combined

robert-stuttaford19:11:57

"So regarding your question, if the 5 values are sent under different tags set, if you use the sum space-aggregator (i.e. query sum:your_metric{your_scope}) the 5 values will be added together"

lucasbradstreet19:11:46

right so if you have two values, on service task-name, tags: [task-name peer-1] [task-name peer-2]

lucasbradstreet19:11:50

in the same second

lucasbradstreet19:11:57

you should have two unique values to aggregate

robert-stuttaford19:11:16

once i learn how, heh

robert-stuttaford19:11:00

ok. so should we avg or sum them?

lucasbradstreet19:11:10

sum for throughput

lucasbradstreet19:11:21

Average or independently plot for latency

robert-stuttaford19:11:32

we’ve had avg for both

lucasbradstreet19:11:05

Hah, it must have just finished the deploy

lucasbradstreet19:11:25

@robert-stuttaford: I just had to deal with that /dev/shm issue myself

lucasbradstreet19:11:44

rm -rf on media-driver startup is the right approach

lucasbradstreet19:11:00

Since you don’t run an external media driver since you’re on a single machine

robert-stuttaford19:11:04

yeah, we’re putting it in before the jar start

lucasbradstreet19:11:06

you can just do it before you start the peers

lowl4tency19:11:56

robert-stuttaford: i'll send a PR shortly

lowl4tency19:11:33

lucasbradstreet: rm -rf /dev/shm/aeron-deploy or rm -rf /dev/shm/aeron-deploy/* better?

robert-stuttaford19:11:56

the whole folder is fine

robert-stuttaford19:11:43

Lucas, thank you SO much for all your help today. truly appreciated. i must go and rest now. seeing double!

lucasbradstreet19:11:08

Alrighty, let us know how you go when it’s all up and running simple_smile

lucasbradstreet19:11:25

Thanks for the constructive feedback

robert-stuttaford19:11:33

certainly will do. now just have to get a test server metrics dash running and smoke-test it all

robert-stuttaford19:11:19

we’re still getting retries at a low max-pending, but it’s recovering from them

robert-stuttaford19:11:34

which tells me we’re just doing way too much work and asking it to do too many at a time

lucasbradstreet19:11:00

pending-timeout will help with the retries but it’ll still cause lag

lucasbradstreet19:11:24

at the end of the day you need to make your tasks faster or scale up your boxes to cope with it

lowl4tency20:11:26

lucasbradstreet: 15-Nov-30 15:21:18 http://highstorm.cognician.com INFO [onyx.peer.task-lifecycle] - [370d49e7-1925-453d-96ff-28cffb28f114] Not enough virtual peers have warmed up to start the task yet, backing off and trying again...

lowl4tency20:11:45

could you describe this? I've got these messages and lost riemann ._.

lucasbradstreet20:11:59

something is very wrong

lucasbradstreet20:11:04

Do you see any other exceptions?

lowl4tency20:11:54

15-Nov-30 15:23:46 http://highstorm.cognician.com WARN [onyx.lifecycle.metrics.riemann] - Lost riemann connection 45.55.69.56 5555 java.lang.Thread.run Thread.java: 745 java.util.concurrent.ThreadPoolExecutor$Worker.run ThreadPoolExecutor.java: 617 java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java: 1142 java.util.concurrent.FutureTask.run FutureTask.java: 266 ... clojure.core/binding-conveyor-fn/fn core.clj: 1916 onyx.lifecycle.metrics.riemann/riemann-sender/fn riemann.clj: 36 onyx.lifecycle.metrics.riemann/riemann-sender/fn/fn riemann.clj: 39 riemann.client/send-event client.clj: 72 com.aphyr.riemann.client.RiemannClient.sendEvent RiemannClient.java: 115 com.aphyr.riemann.client.RiemannClient.sendMessage RiemannClient.java: 110 com.aphyr.riemann.client.TcpTransport.sendMessage TcpTransport.java: 259 com.aphyr.riemann.client.TcpTransport.sendMessage TcpTransport.java: 289 java.io.IOException: no channels available

lowl4tency20:11:02

Lots like this

lucasbradstreet20:11:09

Is that IP correct?

lowl4tency20:11:13

Quick google says it's network issue

lucasbradstreet20:11:20

Yeah, seems like it can’t connect

lowl4tency20:11:26

But I've got riemann zk and onyx on same host

lowl4tency20:11:44

Port are available, I can connect via telnet

lucasbradstreet20:11:56

Hmm, that was my next suggestion

lowl4tency20:11:58

And I've tried localhost and interface ip for connects

lucasbradstreet20:11:16

That’s very weird

lowl4tency20:11:20

Riemann config was not changed

lucasbradstreet20:11:24

telnetted from same machine?

lucasbradstreet20:11:37

I mean, the machine the peers are running on

lowl4tency20:11:50

[email protected]:~$ telnet localhost 5555 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'.

lucasbradstreet20:11:02

Try: 45.55.69.56

lowl4tency20:11:08

zk, onyx and riemann. All staff are on one machine simple_smile

lowl4tency20:11:22

[email protected]:~$ telnet 45.55.69.56 5555 Trying 45.55.69.56... Connected to 45.55.69.56. Escape character is '^]'.

lucasbradstreet20:11:28

Hmm, it should have logged more of the exception

lowl4tency20:11:34

15-Nov-30 15:23:46  WARN [onyx.lifecycle.metrics.riemann] - Lost riemann connection 45.55.69.56 5555
                                                 java.lang.Thread.run                      Thread.java:  745
                   java.util.concurrent.ThreadPoolExecutor$Worker.run          ThreadPoolExecutor.java:  617
                    java.util.concurrent.ThreadPoolExecutor.runWorker          ThreadPoolExecutor.java: 1142
                                  java.util.concurrent.FutureTask.run                  FutureTask.java:  266
                                                                  ...
                                  clojure.core/binding-conveyor-fn/fn                         core.clj: 1916
                     onyx.lifecycle.metrics.riemann/riemann-sender/fn                      riemann.clj:   36
                  onyx.lifecycle.metrics.riemann/riemann-sender/fn/fn                      riemann.clj:   39
                                            riemann.client/send-event                       client.clj:   72
                     com.aphyr.riemann.client.RiemannClient.sendEvent               RiemannClient.java:  115
                   com.aphyr.riemann.client.RiemannClient.sendMessage               RiemannClient.java:  110
                    com.aphyr.riemann.client.TcpTransport.sendMessage                TcpTransport.java:  259
                    com.aphyr.riemann.client.TcpTransport.sendMessage                TcpTransport.java:  277
                        org.jboss.netty.channel.AbstractChannel.write             AbstractChannel.java:  248
                               org.jboss.netty.channel.Channels.write                    Channels.java:  671
                               org.jboss.netty.channel.Channels.write                    Channels.java:  704
                 com.aphyr.riemann.client.TcpHandler.handleDownstream                  TcpHandler.java:   60
                               org.jboss.netty.channel.Channels.write                    Channels.java:  686
                               org.jboss.netty.channel.Channels.write                    Channels.java:  725
org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream             OneToOneEncoder.java:   59
        org.jboss.netty.handler.codec.oneone.OneToOneEncoder.doEncode             OneToOneEncoder.java:   66
        org.jboss.netty.handler.codec.protobuf.ProtobufEncoder.encode             ProtobufEncoder.java:   68
                  com.google.protobuf.AbstractMessageLite.toByteArray         AbstractMessageLite.java:   64
                        com.aphyr.riemann.Proto$Msg.getSerializedSize                       Proto.java: 4195
             com.google.protobuf.CodedOutputStream.computeMessageSize           CodedOutputStream.java:  628
        com.google.protobuf.CodedOutputStream.computeMessageSizeNoTag           CodedOutputStream.java:  865
                      com.aphyr.riemann.Proto$Event.getSerializedSize                       Proto.java: 2102
         com.google.protobuf.UnmodifiableLazyStringList.getByteString  UnmodifiableLazyStringList.java:   68
                com.google.protobuf.LazyStringArrayList.getByteString         LazyStringArrayList.java:  187
                 com.google.protobuf.LazyStringArrayList.asByteString         LazyStringArrayList.java:  231
java.lang.ClassCastException: java.util.UUID cannot be cast to [B
         java.io.IOException: Write failed.

lucasbradstreet20:11:56

Ok, that’s fay more helpful

lucasbradstreet20:11:11

I’ll have a fix for your in a couple mins

lowl4tency20:11:44

lucasbradstreet: thanks a lot!

lucasbradstreet20:11:40

It’s going through our release process now

lowl4tency20:11:04

All tests are passed

lucasbradstreet20:11:09

We really need to test against a real riemann instance

lucasbradstreet20:11:23

I mocked that part out and it doesn’t coerce things to strings

lucasbradstreet20:11:17

@lowl4tency: let me know if you hit any other issues

lowl4tency20:11:01

Gotcha, let me build and try

lowl4tency20:11:33

lucasbradstreet: onyx-metrics?

lowl4tency20:11:07

Just FYI: seems it works pretty good

lowl4tency20:11:17

My metrics are alive again

lucasbradstreet20:11:31

Remember to aggregate over 1s periods now

lowl4tency21:11:55

I afraid I'm out of the contex

lucasbradstreet21:11:41

Ah, Rob will know what to do. Basically you’re going to get multiple 1s throughput messages within a 1s time block

lucasbradstreet21:11:03

For throughput these measurements should be added together

lucasbradstreet21:11:09

Also for retries

lucasbradstreet21:11:20

For latency they should either be averaged or graphed independently

lowl4tency21:11:18

Ok, I guess we will discuss it tomorrow simple_smile

lucasbradstreet21:11:29

Yeah, as long as it’s submitting metrics

lowl4tency21:11:32

Now all works, will tune it simple_smile

lowl4tency21:11:53

15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1 15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1 15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1 15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1 15-Nov-30 16:27:16 http://highstorm.cognician.com INFO [onyx.lifecycle.metrics.metrics] - Message send timeout count: 1

lowl4tency21:11:15

hm. timeout, but in rieman logs I'm watching metrics

lowl4tency21:11:29

does it mean I can miss several metrics?

lucasbradstreet21:11:45

It will retry them until the buffer is full at which point it will drop them

lucasbradstreet21:11:15

If you don't see them constantly then you're probably safe

lucasbradstreet21:11:55

I could probably improve the message to make it more obvious that it was retried

lucasbradstreet21:11:21

Are you getting timeouts reasonably consistently? Are you using a Riemann instance on the same network?

lucasbradstreet21:11:58

I think we should probably be batching our Riemann sends too

lucasbradstreet21:11:12

Anyway, occasional timeouts are ok as long, as long as they're not happening every second (there's a sender for each task so they can happen many times a second though)

lowl4tency22:11:15

Actually it's riemann on same machine simple_smile

lucasbradstreet22:11:13

Yeah, you really shouldn't be getting timeouts there

lucasbradstreet22:11:24

Let me know if they're consistent. I'm off to sleep though