This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2015-12-01
Channels
- # admin-announcements (20)
- # aws (24)
- # beginners (323)
- # boot (60)
- # business (1)
- # cider (23)
- # clara (7)
- # cljs-dev (38)
- # cljsrn (12)
- # clojure (302)
- # clojure-canada (5)
- # clojure-dev (26)
- # clojure-miami (1)
- # clojure-nl (13)
- # clojure-russia (64)
- # clojurecup (1)
- # clojurescript (202)
- # clojurex (4)
- # code-reviews (5)
- # core-async (23)
- # cursive (39)
- # datavis (26)
- # datomic (34)
- # devcards (5)
- # editors (19)
- # emacs (4)
- # events (6)
- # funcool (55)
- # hoplon (5)
- # ldnclj (3)
- # lein-figwheel (1)
- # luminus (15)
- # om (159)
- # omnext (7)
- # onyx (107)
- # slack-help (2)
- # testing (3)
@lucasbradstreet: thanks so much for pushing that last fix!
@lucasbradstreet: new system is live
Excellent :)
one thing i noticed is that all the individual task throughput numbers appear to be zeroed out
the input is still reporting though
we’re only tracking max latency now, and those are coming through as well
yeah, pretty sure somethings off
the system is processing fine, and the input task metrics are fine - i see complete-latency, throughput, pending count. the rest of the task latencies are flat, and the rest of the task throughputs are flat
system is definitely working though
not sure if it’s just doing nothing of consequence or if there’s an actual change to worry about
Tried it out here, what it outputs looks fairly correct
service names for throughput look like this: "[:in] 60s_throughput"
we do have the right ones because we are seeing numbers
they’re just … flat
Could it have to do with grouping?
deploy at vertical line
we’re not using any grouping at all
got a meeting, chat soon
Hmm, weird, all looks pretty fine. OK, catch you.
"For latency they should either be averaged or graphed independently" -> probably always want the latter (and the max there) ?
Yeah I agree
Yes, absolutely
and what about take action X when event Y happened just now and event Z happened 5 minutes ago?
windows and triggers
Sure. You would use our new windowing/state management features to maintain state in tasks that would allow you to determine that
lucas, does Onyx offer any durability for the state implicit in such windows?
ie. could you put everything in ZK so that you’re resilient to deploys and other restarts?
(my questions make huge assumptions about what ZK can actually do)
That's what bookkeeper is for. When you definite your state management functions you basically emit a state transition update which is written to a bookkeeper log
You don't have to know anything about bookkeeper though. You write functions to generate the state transition, and to apply them to a given state
Onyx manages all the fault tolerance / writing / recovery etc
@lucasbradstreet: the graphs post deploy have been super boring and flat
Sounds very good, thanks for the info and for the fantastic work you guys put into it.
No worries
@robert-stuttaford: ha! Excellent. Are you getting non-zero throughput figures yet?
not yet
still can’t be sure it’s actually a problem. haven’t had sufficient load to be totally sure
Yeah, but throughput should surely be non-zero from time to time?
Have you seen complete latency drop much with your improvements? Just curious
INFO [2015-12-01 15:09:08,896] defaultEventExecutorGroup-2-1 - riemann.config - #riemann.codec.Event{:host production-HighstormStack-77PF3UB8TROI-i-e4f5ed53, :service [:chats/chat-response-count] 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :chats/chat-response-count highstorm 7389dda2-75af-4c64-be90-8cbf019d88d5], :time 1448982562, :ttl nil}
INFO [2015-12-01 15:09:09,651] defaultEventExecutorGroup-2-2 - riemann.config - #riemann.codec.Event{:host production-HighstormStack-77PF3UB8TROI-i-e4f5ed53, :service [:chats/chat-response-count] 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :chats/chat-response-count highstorm c78767b4-89cb-417f-8b5b-7d86f0ecadc5], :time 1448982563, :ttl nil}
INFO [2015-12-01 15:09:10,164] defaultEventExecutorGroup-2-2 - riemann.config - #riemann.codec.Event{:host production-HighstormStack-77PF3UB8TROI-i-e4f5ed53, :service [:chats/chat-response-count] 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :chats/chat-response-count highstorm cda00880-ab85-4da1-a582-cf60b2ad1e4b], :time 1448982563, :ttl nil}
INFO [2015-12-01 15:09:10,499] defaultEventExecutorGroup-2-1 - riemann.config - #riemann.codec.Event{:host production-HighstormStack-77PF3UB8TROI-i-e4f5ed53, :service [:chats/chat-response-count] 60s_throughput, :state ok, :description nil, :metric 0.0, :tags [throughput_60s onyx :chats/chat-response-count highstorm 7389dda2-75af-4c64-be90-8cbf019d88d5], :time 1448982564, :ttl nil}
0.0 all the way
complete latency is way lower
There has to be something wrong if segments are being processed
mostly 0.5s
as you can see, the system is doing stuff
Oh so you are seeing throughput in some tasks
the only one reporting throughput is the :input
task
That's a bit worrying.
here’s :output
and a random sample of the rest
I’m also confused by the pending message count in input. It never goes above 0.8
I’m kinda worried that none of it is working
here’s the pending using SUM instead of AVG
above?
we are seeing little latency spikes - we’re plotting max- rather than 99_9th-
thing is I don’t see how it could spike to 0.4
for pending message count
1 is the minimum non zero value it could jump to
I love the smell of metrics debugging in the morning
i’m watching tail -f /var/log/riemann/riemann.log | grep pending
to see if i can see any non-zero reports
I’m worried that you always have 0 throughput on everything other than your input task
yeah, me too
ok i just saw a pending of 1
It seems like metrics is doing something wrong or your flow conditions are dropping everything, or your tasks are returning empty vectors
and of 2
ok for that one it might just be averaging over time a bit
so it’s datadog that’s averaging the numbers somehow
well, when i watch the logs, all the work is performed
OK, it could just be a datadog presentation issue
You may need to group by in your charts
2015-12-01 15:20:38.628 INFO - cognician.highstorm.tracks - => [chats/change-choice] t 30443532 inst 15-12-01 15:20:21.890 [:chat.event/event-type :chat.event.type/change-choice]
2015-12-01 15:20:38.851 INFO - cognician.highstorm.datomic.onyx - Transacted for :chat.event/event-type t 30443532 . [:bundle-stats/completed-chats :bundle-stats/completed-cogs :bundle-stats/last-active :bundle-stats/response-count :chat/complete :chat/percent-complete :chat/response-count :cog-stats/completed-chats :cog-stats/last-active :cog-stats/max-percent-complete :cog-stats/response-count :group/completed-chats]
it’s definitely working
Cool, what events got sent to riemann?
For that task
for a 1s throughput for one of the tasks involved in producing that transaction
What about 60s throughput. If that’s all 0s then something is definitely wrong. You’ll need to look at both peer-ids though
we’ve got a pretty good stream of source events right now, but it never goes above 0
Because one peer wouldn’t have processed it
i see all 3 peers’ uuids
tail -f /var/log/riemann/riemann.log | grep percent | grep 60s_through
Ok. I guess something is wrong with metrics then. Super confused. Are they all output tasks?
only transact
is an output, and read_log is an input. the rest are POPFs (plain ole pure functions)
OK, I’ll look into it
cool. it’s not urgent for us, the metrics that work right now are sufficient for us to know the system is working
Yeah, input task is the main one
and there’s always tail
we’ll be doing the statsd thing tomorrow as well
as ever, thank you, Lucas
Ah cool
You’re welcome
@robert-stuttaford: perils of giving you late night metrics fixes. I introduced a bug last night where throughput wasn’t recorded for non-input tasks. I guess your new stats really are that good!
@robert-stuttaford: I’ve released metrics 0.8.2.7. Feel free not to upgrade until you get all that statsd work done.
For the statsd work, I’d suggest you implement a sender thread function in your own code, and iterate on it until you think it’s ready. That way you’re not dependent on me for releases. Then send me a PR and I’ll add it to onyx-metrics