Fork me on GitHub
#onyx
<
2015-12-01
>
robert-stuttaford06:12:45

@lucasbradstreet: thanks so much for pushing that last fix!

robert-stuttaford09:12:49

one thing i noticed is that all the individual task throughput numbers appear to be zeroed out

robert-stuttaford09:12:03

the input is still reporting though

robert-stuttaford09:12:30

we’re only tracking max latency now, and those are coming through as well

robert-stuttaford09:12:04

yeah, pretty sure somethings off

robert-stuttaford09:12:32

the system is processing fine, and the input task metrics are fine - i see complete-latency, throughput, pending count. the rest of the task latencies are flat, and the rest of the task throughputs are flat

robert-stuttaford09:12:11

system is definitely working though

robert-stuttaford09:12:13

not sure if it’s just doing nothing of consequence or if there’s an actual change to worry about

lucasbradstreet09:12:18

Tried it out here, what it outputs looks fairly correct

lucasbradstreet09:12:07

service names for throughput look like this: "[:in] 60s_throughput"

robert-stuttaford09:12:21

we do have the right ones because we are seeing numbers

robert-stuttaford09:12:27

they’re just … flat simple_smile

lucasbradstreet09:12:59

Could it have to do with grouping?

robert-stuttaford09:12:19

deploy at vertical line

robert-stuttaford09:12:03

we’re not using any grouping at all

robert-stuttaford09:12:09

got a meeting, chat soon simple_smile

lucasbradstreet09:12:55

Hmm, weird, all looks pretty fine. OK, catch you.

greywolve09:12:31

"For latency they should either be averaged or graphed independently" -> probably always want the latter (and the max there) ?

greywolve09:12:50

that way at least you aren't lying to yourself, hehe

thomas14:12:34

I had a generic Onyx question....

thomas14:12:05

can Onyx be used for what some people call Complex Event Processing?

thomas14:12:17

ie. take action X when event Y happens?

thomas14:12:37

and what about take action X when event Y happened just now and event Z happened 5 minutes ago?

thomas14:12:00

(the 5 minutes being an arbitrary choice here)

robert-stuttaford14:12:34

windows and triggers

lucasbradstreet14:12:46

Sure. You would use our new windowing/state management features to maintain state in tasks that would allow you to determine that

robert-stuttaford14:12:23

lucas, does Onyx offer any durability for the state implicit in such windows?

robert-stuttaford14:12:38

ie. could you put everything in ZK so that you’re resilient to deploys and other restarts?

robert-stuttaford14:12:07

(my questions make huge assumptions about what ZK can actually do)

lucasbradstreet14:12:38

That's what bookkeeper is for. When you definite your state management functions you basically emit a state transition update which is written to a bookkeeper log

thomas14:12:26

sounds good.

lucasbradstreet14:12:10

You don't have to know anything about bookkeeper though. You write functions to generate the state transition, and to apply them to a given state

lucasbradstreet14:12:32

Onyx manages all the fault tolerance / writing / recovery etc

robert-stuttaford15:12:14

@lucasbradstreet: the graphs post deploy have been super boring and flat simple_smile

thomas15:12:32

Sounds very good, thanks for the info and for the fantastic work you guys put into it.

lucasbradstreet15:12:24

@robert-stuttaford: ha! Excellent. Are you getting non-zero throughput figures yet?

robert-stuttaford15:12:49

still can’t be sure it’s actually a problem. haven’t had sufficient load to be totally sure

lucasbradstreet15:12:31

Yeah, but throughput should surely be non-zero from time to time?

lucasbradstreet15:12:51

Have you seen complete latency drop much with your improvements? Just curious

robert-stuttaford15:12:21

INFO [2015-12-01 15:09:08,896] defaultEventExecutorGroup-2-1 - riemann.config - #riemann.codec.Event{:host production-HighstormStack-77PF3UB8TROI-i-e4f5ed53, :service [:chats/chat-response-count] 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :chats/chat-response-count highstorm 7389dda2-75af-4c64-be90-8cbf019d88d5], :time 1448982562, :ttl nil}
INFO [2015-12-01 15:09:09,651] defaultEventExecutorGroup-2-2 - riemann.config - #riemann.codec.Event{:host production-HighstormStack-77PF3UB8TROI-i-e4f5ed53, :service [:chats/chat-response-count] 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :chats/chat-response-count highstorm c78767b4-89cb-417f-8b5b-7d86f0ecadc5], :time 1448982563, :ttl nil}
INFO [2015-12-01 15:09:10,164] defaultEventExecutorGroup-2-2 - riemann.config - #riemann.codec.Event{:host production-HighstormStack-77PF3UB8TROI-i-e4f5ed53, :service [:chats/chat-response-count] 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :chats/chat-response-count highstorm cda00880-ab85-4da1-a582-cf60b2ad1e4b], :time 1448982563, :ttl nil}
INFO [2015-12-01 15:09:10,499] defaultEventExecutorGroup-2-1 - riemann.config - #riemann.codec.Event{:host production-HighstormStack-77PF3UB8TROI-i-e4f5ed53, :service [:chats/chat-response-count] 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :chats/chat-response-count highstorm 7389dda2-75af-4c64-be90-8cbf019d88d5], :time 1448982564, :ttl nil}

robert-stuttaford15:12:34

complete latency is way lower

lucasbradstreet15:12:35

There has to be something wrong if segments are being processed

robert-stuttaford15:12:31

as you can see, the system is doing stuff

lucasbradstreet15:12:01

Oh so you are seeing throughput in some tasks

robert-stuttaford15:12:19

the only one reporting throughput is the :input task

lucasbradstreet15:12:44

That's a bit worrying.

robert-stuttaford15:12:28

and a random sample of the rest

lucasbradstreet15:12:55

I’m also confused by the pending message count in input. It never goes above 0.8

lucasbradstreet15:12:44

I’m kinda worried that none of it is working

robert-stuttaford15:12:52

here’s the pending using SUM instead of AVG

robert-stuttaford15:12:44

we are seeing little latency spikes - we’re plotting max- rather than 99_9th-

lucasbradstreet15:12:03

thing is I don’t see how it could spike to 0.4

lucasbradstreet15:12:08

for pending message count

lucasbradstreet15:12:19

1 is the minimum non zero value it could jump to

michaeldrogalis15:12:52

I love the smell of metrics debugging in the morning

robert-stuttaford15:12:15

i’m watching tail -f /var/log/riemann/riemann.log | grep pending to see if i can see any non-zero reports

lucasbradstreet15:12:15

I’m worried that you always have 0 throughput on everything other than your input task

robert-stuttaford15:12:51

ok i just saw a pending of 1

lucasbradstreet15:12:53

It seems like metrics is doing something wrong or your flow conditions are dropping everything, or your tasks are returning empty vectors

lucasbradstreet15:12:04

ok for that one it might just be averaging over time a bit

robert-stuttaford15:12:07

so it’s datadog that’s averaging the numbers somehow

robert-stuttaford15:12:39

well, when i watch the logs, all the work is performed

lucasbradstreet15:12:56

OK, it could just be a datadog presentation issue

lucasbradstreet15:12:11

You may need to group by in your charts

robert-stuttaford15:12:42

2015-12-01 15:20:38.628 INFO  - cognician.highstorm.tracks - => [chats/change-choice] t 30443532 inst 15-12-01 15:20:21.890 [:chat.event/event-type :chat.event.type/change-choice]
2015-12-01 15:20:38.851 INFO  - cognician.highstorm.datomic.onyx - Transacted for :chat.event/event-type t 30443532 . [:bundle-stats/completed-chats :bundle-stats/completed-cogs :bundle-stats/last-active :bundle-stats/response-count :chat/complete :chat/percent-complete :chat/response-count :cog-stats/completed-chats :cog-stats/last-active :cog-stats/max-percent-complete :cog-stats/response-count :group/completed-chats]

robert-stuttaford15:12:46

it’s definitely working

lucasbradstreet15:12:59

Cool, what events got sent to riemann?

robert-stuttaford15:12:58

for a 1s throughput for one of the tasks involved in producing that transaction

lucasbradstreet15:12:33

What about 60s throughput. If that’s all 0s then something is definitely wrong. You’ll need to look at both peer-ids though

robert-stuttaford15:12:41

we’ve got a pretty good stream of source events right now, but it never goes above 0

lucasbradstreet15:12:45

Because one peer wouldn’t have processed it

robert-stuttaford15:12:59

i see all 3 peers’ uuids

robert-stuttaford15:12:09

tail -f /var/log/riemann/riemann.log | grep percent | grep 60s_through

lucasbradstreet15:12:40

Ok. I guess something is wrong with metrics then. Super confused. Are they all output tasks?

robert-stuttaford15:12:17

only transact is an output, and read_log is an input. the rest are POPFs (plain ole pure functions)

lucasbradstreet15:12:26

OK, I’ll look into it

robert-stuttaford15:12:46

cool. it’s not urgent for us, the metrics that work right now are sufficient for us to know the system is working

lucasbradstreet15:12:52

Yeah, input task is the main one

robert-stuttaford15:12:52

and there’s always tail

robert-stuttaford15:12:11

we’ll be doing the statsd thing tomorrow as well

robert-stuttaford15:12:42

as ever, thank you, Lucas simple_smile

lucasbradstreet15:12:41

You’re welcome

lucasbradstreet19:12:53

@robert-stuttaford: perils of giving you late night metrics fixes. I introduced a bug last night where throughput wasn’t recorded for non-input tasks. I guess your new stats really are that good!

lucasbradstreet19:12:33

@robert-stuttaford: I’ve released metrics 0.8.2.7. Feel free not to upgrade until you get all that statsd work done.

lucasbradstreet19:12:15

For the statsd work, I’d suggest you implement a sender thread function in your own code, and iterate on it until you think it’s ready. That way you’re not dependent on me for releases. Then send me a PR and I’ll add it to onyx-metrics