This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2015-11-23
Channels
- # admin-announcements (38)
- # announcements (1)
- # aws (1)
- # beginners (195)
- # boot (1)
- # business (3)
- # cider (100)
- # cljsrn (37)
- # clojure (99)
- # clojure-russia (69)
- # clojure-switzerland (1)
- # clojurescript (120)
- # datavis (26)
- # datomic (23)
- # devcards (4)
- # editors (38)
- # hoplon (18)
- # ldnclj (27)
- # lein-figwheel (28)
- # off-topic (50)
- # om (329)
- # onyx (255)
- # portland-or (1)
- # re-frame (8)
- # reagent (18)
- # spacemacs (5)
- # testing (1)
@lucasbradstreet: zero exceptions
@lucasbradstreet: hi question about :onyx/input-retry-timeout
vs :onyx/pending-timeout
the first is default 1000, the second is 60000. is it possible that we’re getting retries because we’re missing that first 1000ms window?
So the way it works is that the messages are put into buckets of pending-timeout/input-retry-timeout time periods
So 60 buckets by default. Every input-retry-timeout a bucket will expire.
ok, i see. the reason i ask is because we have this situation right now:
retries a lot sooner than a minute, from what i can tell
You say that because you match up the read log throughput with when the retries happen?
right after the vertical line on the graph is a jvm restart
we had retries very soon after the system came back up. are retries somehow stateful in ZK?
forgive potentially silly questions - i’m a little anxious
No problem.
Definitely seems odd to immediately get retries. Could've been within a minute on that graph though?
You shouldn't get them immediately. I could see getting them after the first minute if the system reads in too much initially
it’s quite possible
What's the scale on the input retries graph? Is that 8, as in actually 8?
that’s 8 retries
right now we’re at a batch count of 250, 100ms batch timeout. two tasks in particular always have very high latencies when the system becomes stressed, but those high numbers aren’t reflected in the INPUT complete latency numbers
my anxiety comes from not understanding how this situation can arise
Yeah I don't understand it either. Very curious about what is happening in chat links
is it that chat_links is in some sort of retry loop?
and it just keeps doing more and more work without ever finishing anything?
Oh I know why complete latency isn't reflected. Chat links takes too long. Then they get retried before completion
But retries aren't really included in the complete latency statistics, tho it should probably be maintained (currently no good way to do this)
Yes that seems to make sense
ok. so our current :onyx/pending-timeout value, the 60s default is, for whatever reason, too low
Retry loop would make sense. You could increase the pending-timeout to give it more time to finish but you should probably figure out what chat links is doing
And maybe reduce the batch size
are retries at the task or the workflow layer?
So at least some finish
Input task layer. Part of every plugin
Reduce batch size for chat links
I mean
Is it all calculation in chat links or is it hitting up something IO related
so in our case it’d be something like input > prepare > tracks > chat links FAIL input > prepare > tracks > chat links FAIL input > prepare > tracks > chat links FAIL
it’s all datomic queries + producing a transaction for commit-tx to eventually do
Ah. Unless you're using the batching functionality to do multiple queries at once you should probably reduce the batch size then. And maybe optimise he the query.
If you're doing the query from onyx/fn then batching it with a high batch size could be hurting you by preventing any of the segments making it through in time
ok so a batch size of 250 we’re asking it to do 250 concurrent sets of queries in 60s
seems obvious now 😩
If you're doing it in onyx/fn you're asking it to do 250 sequential queries within 60 seconds
no wonder
You may be better off by increasing the number of peers dedicated to that task
And reducing the batch size
Oh boy I'm here for once when the action is happening. Hi guys. 😄
That'll give you a bit of parallelism
i’m starting with tripling the timeout and reducing the batch to 50 again. healthy first, then fast
Sounds reasonable
And maybe see if you can improve those queries later
Oliver is stress-testing from tomorrow. we’ll optimise what work we do for what events, and how efficiently we do that work, once we have that data
End diagnosis looks like you hit some load, so things queue up so you hit a big batch, which isn't finished before the pending timeout, which causes retries and it to all happen all over again
@lucasbradstreet: i set an :onyx/pending-timeout
of 180000
, and when i restarted the system, it didn’t read from the datomic log at all
removing the value from the catalog, it reads from the log again
@robert-stuttaford: input retry timeout of 1000 still?
Oh I know why
It should have thrown an exception about http://www.onyxplatform.org/cheat-sheet.html#peer-config/:onyx.messaging/ack-daemon-timeout
Because they interact and ack-daemon-timeout has to be bigger than all the pending timeouts in your input tasks
ok. didn’t get any exceptions in Yeller
Hmm k. You're using the yeller timbre appender?
i don’t think so
we have a flow condition set up, and a Thread/setDefaultUncaughtExceptionHandler
set up as well
Hmm. We catch a lot of exceptions and log them
For example you submit a job that has had parameters like the above one and we'll catch it and log it
clojure.lang.ExceptionInfo: Pending timeout cannot be greater than acking daemon timeout clojure.lang.ExceptionInfo: Pending timeout cannot be greater than acking daemon timeout clojure.lang.ExceptionInfo: Pending timeout cannot be greater than acking daemon timeout
there they are
ok. is it possible to have these go to Yeller? it sounds like it might be?
what’s the largest i could make this timeout without violating the acking daemon limit?
60000. You need to increase the acking daemon timeout via the peer config
ok, thank you
If you add the above dependency and then configure the log config as discussed here http://onyx.readthedocs.org/en/latest/user-guide/logging/
is there a trick to display all sections of http://www.onyxplatform.org/cheat-sheet.html at the same time, to make search easy?
Unfortunately not atm. I'll add that though. I'm also adding tags soon
I'm going to discuss making the acking daemon timeout bigger with Mike. There should probably be some headroom for jobs to choose a bigger pending-timeout at runtime without a redeploy
great. we’ll stick with 60s for now
Yeah I think reducing the batch size will help reduce the pathological behaviour anyway. I assume that means you're not doing a redeploy to make these changes then?
we did have to, as I had to parameterise the pending-timeout
regarding the logging thing - is there a specific configuration setting to add to enable the yeller appenders?
-facepalm- of course, https://github.com/yeller/yeller-timbre-appender#setup this stuff, right
@lucasbradstreet: increasing the peer-count, this is via min-peers and max-peers, or more simply n-peers, right? so to double the concurrency for pressured tasks, simply give them :onyx/n-peers 2
(and make sure the (onyx/start-peers count group)
receives enough to satisfy it)
correct
though parallelism is probably a better word
At least it’s making some progress?
yes, it’s keeping up
i’m busy increasing the peer-count for those 2 laggard tasks
when you increase the peer-count, you should consider reducing the batch-size for the task prior to it
The reason is that if you have a task that has a batch of 50, and it produces 500 segments, all 500 will be sent to a single task, because we don’t currently chop up the outgoing batch
so if you have a prior task that generates a lot of outgoing segments in one batch, then you may still end up with all your segments on a single peer
we have plans to change how this works, but it’s something to keep in mind
ok. we don’t have that situation, i think
we almost always pluck out one datom from a transaction, and that turns into a segment. we then run it through several tasks, each one appending some tx-data for commit-tx at the end, and each one using datomic’s d/with
with the upstream tx-data to perform queries
no problem then
the only task to receive more than 1 item is the last one; commit-tx
What we currently do generally works but it was still worth mentioning
for sure
Yeah, commit-tx is the one which really gets the benefit from batching so that’s good anyway
600,000 ms for a batch of 250 means your queries are taking 2+ seconds per segment
is that reasonable?
I realise you’re in triage mode currently
so no need to respond right away
it’s possible. it seems unlikely, but it’s possible
the task is clearly doing too much
pretty sure it’s in retry-nado again, w b-c of 50
Yeah, seems like either you need more computing power behind it, or it could be written better, or you need to revise what it’s doing and when it’s doing
same task exhibiting the high batch latency?
it just seems so damn odd that so little work can stress it so much
the regular spikes in word_count, response_count, init_stats are interesting
but yeah, the two culprits are chat_links and percent_complete
mmm, it’s 100K now, but that’s clearly too high still
thinking about it
100K = 100s
and 100K is around 1/5th of what you had before
with 250
so it’s clearly still too high
current batch + timeout gives each task 12.5 seconds to do its thing
which is way plenty enough
Agreed
But, you were getting 600K, as in 600s per batch
for chat_links
which is over 2s per segment
sorry, this batch count is now at 50
Yeah, so you reduced batch size from 250 to 50, so assuming full batches
it should take 1/5 the time if the batch_latency holds right
Which is roughly what you’re seeing at 600K -> 100K
which is still too high for the segments to be finished in time
with the amount of time that these queries are taking you could probably reduce batch size to 1 for that task and you wouldn’t have any perf impact
your main cost is the query you’re doing in that function
the 60s throughput numbers are 20, 30
hmm ok
reducing batch size further may help somewhat there then but in that case your chat_links performance is abysmal
it is abysmal
we should monitor average batch sizes too
it’s actually retrying the same 30 over and over
are you hitting any hot keys?
Or more generally, ids/queries that have a lot of activity where the queries might take longer
I’d say reduce batch size on that task to 10 and increase the acking-daemon-timeout
in your peer config
and figure out what’s going on with that task. Parallelism will help a bit but I’d want to figure out whether I could improve perf for that task first
thanks Lucas
we’ve got our homework to do
your advice is, as always, tremendously helpful and much appreciated
no worries. I probably should’ve pointed our particular tasks as the issue earlier. After the metrics issue, I was a bit too focused on issues in our code this time around 😄
haha yes i know that feeling
thats part of the issue - knowing where the issue actually is!
we’re slowly building up a flow chart of sorts
for possible things to look for
one thing i do think might be useful is a way to limit the amount of times segments are retried
because now the only way to deal with these 30 is to restart
and hope that they make it through
(until such time as we can do all the longer term fixing)
Yeah, I’m hesitant to do that by default because it means data loss
but I do understand why you might want that in the short term
if we could specify a max retry count, that would help us to keep a stable system up for the rest of the data
ah, yes, true
I think some kind of exponential backoff might be better
in our world, we already have all the data. we’d just find and re-queue missed txes
though even that is troublesome
we can’t suffer data loss (thank datomic!)
yeah, with kafka it’s similar
at least you can replay that part of the log
we have various improvements in the works that will make this work better
for example, we’re going to track the times that segments are put on their queues, so if queues lag and segments are already being tried we’ll skip them
i’m sure that’d help al ot
it’ll make things recover a lot better in the case of retry storms
we’ll get there
so, you mention data loss. wouldn’t users of less immutable input queues lose data anyway, due to also having to restart?
most of the plugins have some level of durability
for example, with kafka, we checkpoint the where we are in the log so after restart we start back up there
sql we checkpoint column ranges to ZK as we finish them. This is okay, though the DB isn’t immutable so things may change underneath us
nothing we can do about that though
ok - so how does infinite retry vs finite retry play into that?
with finite retry we’d essentially have to checkpoint that the entry was finished
I mean even with datomic you’d still have to do some programming work to fill in the gaps with finite retries
we give you that for free with the plugin if you use infinite retries
it’s the same with kafka
are retries somehow stateful in ZK?
just restarted now, and got 5 retries almost immediately
so, the way it works is that when we ack messages as complete
just waiting for another metrics tick and then i’ll share a graph
we update an index in ZK which is basically the highest fully acked log id
so say you have txes 1-1000 and 1-994, 997, 998 are all acked
then we update in ZK that up to 994 has been acked
if you restart at this point then we set the log reader plugin to start at 995
ok. i think that’s probably exactly what happened now
what did you change?
just a restart
no config changes
What’s your priority? getting passed these hard entries?
yes. presumably having them retrying is reducing our capability to process novelty
OK, if you have any control over you job submission data without a retry then reduce max-pending really low for a bit
I mean if you want to sort out your temporary issues
that would force the system to process a very small workload at a time, right?
backpressuring all the way to the read-log
ok, giving that a go
my system is constipated 😞
Haha, yeah. I realised that maybe you just want to fix your issue first
yes, thanks
ok, now we have flat-lined retries and chat-links
waiting another minute for another tick to be sure
Cool, you’ll want to increase it again at some point otherwise latency for your users might go up
It’ll be invisible metrics wise because unread log entries won’t contribute to the stats
instead of going to 1000, i’m going back up to 250
for max-pending
at least until we can get more cores and peers underneath it all
that’s reasonable
with how long your tasks are taking and the fact that you’re on core.async (so latency is super low), you could probably reduce it even further and not lose out to throughput
makes sense overall though
do you have any Onyxen in production, Lucas?
if so, i’m curious what you’re using for alerts ala http://pagerduty.com etc
nothing currently. No time for side projects at the moment 😞
@lucasbradstreet: just confirming that the default peer count per task is 1, assuming there are enough peers to go around?
If you have exactly the same number of peers as tasks then yes. Exactly one per task. Otherwise it'll be based on the scheduler and max-peers/min-peers
perfect, thanks
i want to keep our config nimble, so i’m going to set defaults for the lot, and specifics where needed
what’s more idiomatic - to set max-peers, or n-peers? e.g. for our input or send-email tasks
n-peers is more specific. I would generally use n-peers unless you want some scheduler flexibility. For your input task you can only use one peer regardless so it doesn't make much difference.
I'd make the number of peers you start a parameter, or make it high enough to cover some run time flexibility
thanks does it make sense to provide 3 n-peers, or should it scale 1, 2, 4
i want to give the ones that had moderate pressure 2, and the two that had large pressure 3 or 4
You can use any number assuming there aren't any plugin restrictions (using input plugins)
fyi, @lucasbradstreet Read log tasks must set :onyx/max-peers 1
got yeller+timbre logging in, an 8 core box with 14gb ram for the JVM process, and distributed 10 additional peers into the tasks that were most impacted
Ah with n-peers? Ok. I should improve the validation for that plugin
we’ll continue to do daily restarts until we’ve made some headway with you on the leak
n-peers is a new option
Alrighty
but those changes should dramatically raise our performance ceiling
interesting… in these graphs, we’re on more ram and cores, but not yet on additional peers. see the throughput spike top right? that’s 200 txes a min, and it barely blinked
You were oversubscribed before
So that kinda makes sense
i guess our earlier perf issues that lead to retries had to do with the size of the datasets involved in the queries for those particular events
well, i think we’re in a much, much better place now
thank you for your patience, Lucas. it takes a little time for the things that are probably very clear to you to sink in over here, some times
it’s going to be a tremendously interesting week as we work on the queries and sim-test
I'm glad it's working out. I'll definitely be interested in how your query improvements go