Fork me on GitHub
#onyx
<
2015-11-23
>
robert-stuttaford08:11:52

@lucasbradstreet: zero exceptions simple_smile

robert-stuttaford12:11:38

@lucasbradstreet: hi simple_smile question about :onyx/input-retry-timeout

robert-stuttaford12:11:47

vs :onyx/pending-timeout

robert-stuttaford12:11:13

the first is default 1000, the second is 60000. is it possible that we’re getting retries because we’re missing that first 1000ms window?

lucasbradstreet12:11:25

So the way it works is that the messages are put into buckets of pending-timeout/input-retry-timeout time periods

lucasbradstreet12:11:00

So 60 buckets by default. Every input-retry-timeout a bucket will expire.

robert-stuttaford12:11:29

ok, i see. the reason i ask is because we have this situation right now:

robert-stuttaford12:11:00

retries a lot sooner than a minute, from what i can tell

lucasbradstreet12:11:14

You say that because you match up the read log throughput with when the retries happen?

robert-stuttaford12:11:41

right after the vertical line on the graph is a jvm restart

robert-stuttaford12:11:12

we had retries very soon after the system came back up. are retries somehow stateful in ZK?

robert-stuttaford12:11:07

forgive potentially silly questions - i’m a little anxious simple_smile

lucasbradstreet12:11:17

Definitely seems odd to immediately get retries. Could've been within a minute on that graph though?

lucasbradstreet12:11:40

You shouldn't get them immediately. I could see getting them after the first minute if the system reads in too much initially

robert-stuttaford12:11:48

it’s quite possible

lucasbradstreet12:11:21

What's the scale on the input retries graph? Is that 8, as in actually 8?

robert-stuttaford12:11:40

that’s 8 retries

robert-stuttaford12:11:44

right now we’re at a batch count of 250, 100ms batch timeout. two tasks in particular always have very high latencies when the system becomes stressed, but those high numbers aren’t reflected in the INPUT complete latency numbers

robert-stuttaford12:11:17

my anxiety comes from not understanding how this situation can arise

lucasbradstreet12:11:42

Yeah I don't understand it either. Very curious about what is happening in chat links

robert-stuttaford12:11:51

is it that chat_links is in some sort of retry loop?

robert-stuttaford12:11:04

and it just keeps doing more and more work without ever finishing anything?

lucasbradstreet12:11:27

Oh I know why complete latency isn't reflected. Chat links takes too long. Then they get retried before completion

lucasbradstreet12:11:52

But retries aren't really included in the complete latency statistics, tho it should probably be maintained (currently no good way to do this)

lucasbradstreet12:11:03

Yes that seems to make sense

robert-stuttaford12:11:41

ok. so our current :onyx/pending-timeout value, the 60s default is, for whatever reason, too low

lucasbradstreet12:11:48

Retry loop would make sense. You could increase the pending-timeout to give it more time to finish but you should probably figure out what chat links is doing

lucasbradstreet12:11:01

And maybe reduce the batch size

robert-stuttaford12:11:03

are retries at the task or the workflow layer?

lucasbradstreet12:11:07

So at least some finish

lucasbradstreet12:11:21

Input task layer. Part of every plugin

lucasbradstreet12:11:32

Reduce batch size for chat links

lucasbradstreet12:11:51

Is it all calculation in chat links or is it hitting up something IO related

robert-stuttaford12:11:54

so in our case it’d be something like input > prepare > tracks > chat links FAIL input > prepare > tracks > chat links FAIL input > prepare > tracks > chat links FAIL

robert-stuttaford12:11:16

it’s all datomic queries + producing a transaction for commit-tx to eventually do

lucasbradstreet12:11:16

Ah. Unless you're using the batching functionality to do multiple queries at once you should probably reduce the batch size then. And maybe optimise he the query.

lucasbradstreet12:11:59

If you're doing the query from onyx/fn then batching it with a high batch size could be hurting you by preventing any of the segments making it through in time

robert-stuttaford12:11:21

ok so a batch size of 250 we’re asking it to do 250 concurrent sets of queries in 60s

robert-stuttaford12:11:06

seems obvious now 😩

lucasbradstreet12:11:26

If you're doing it in onyx/fn you're asking it to do 250 sequential queries within 60 seconds

lucasbradstreet12:11:05

You may be better off by increasing the number of peers dedicated to that task

lucasbradstreet12:11:09

And reducing the batch size

michaeldrogalis12:11:18

Oh boy I'm here for once when the action is happening. Hi guys. 😄

lucasbradstreet12:11:36

That'll give you a bit of parallelism

robert-stuttaford12:11:53

i’m starting with tripling the timeout and reducing the batch to 50 again. healthy first, then fast

lucasbradstreet12:11:04

Sounds reasonable

lucasbradstreet12:11:13

And maybe see if you can improve those queries later

robert-stuttaford12:11:55

Oliver is stress-testing from tomorrow. we’ll optimise what work we do for what events, and how efficiently we do that work, once we have that data

lucasbradstreet12:11:04

End diagnosis looks like you hit some load, so things queue up so you hit a big batch, which isn't finished before the pending timeout, which causes retries and it to all happen all over again

robert-stuttaford13:11:12

@lucasbradstreet: i set an :onyx/pending-timeout of 180000, and when i restarted the system, it didn’t read from the datomic log at all

robert-stuttaford13:11:24

removing the value from the catalog, it reads from the log again

lucasbradstreet13:11:09

@robert-stuttaford: input retry timeout of 1000 still?

lucasbradstreet13:11:58

Because they interact and ack-daemon-timeout has to be bigger than all the pending timeouts in your input tasks

robert-stuttaford13:11:16

ok. didn’t get any exceptions in Yeller

lucasbradstreet14:11:39

Hmm k. You're using the yeller timbre appender?

robert-stuttaford14:11:26

i don’t think so

robert-stuttaford14:11:44

we have a flow condition set up, and a Thread/setDefaultUncaughtExceptionHandler set up as well

lucasbradstreet14:11:18

Hmm. We catch a lot of exceptions and log them

lucasbradstreet14:11:46

For example you submit a job that has had parameters like the above one and we'll catch it and log it

robert-stuttaford14:11:06

clojure.lang.ExceptionInfo: Pending timeout cannot be greater than acking daemon timeout clojure.lang.ExceptionInfo: Pending timeout cannot be greater than acking daemon timeout clojure.lang.ExceptionInfo: Pending timeout cannot be greater than acking daemon timeout

robert-stuttaford14:11:29

ok. is it possible to have these go to Yeller? it sounds like it might be?

robert-stuttaford14:11:05

what’s the largest i could make this timeout without violating the acking daemon limit?

lucasbradstreet14:11:27

60000. You need to increase the acking daemon timeout via the peer config

lucasbradstreet14:11:27

If you add the above dependency and then configure the log config as discussed here http://onyx.readthedocs.org/en/latest/user-guide/logging/

robert-stuttaford14:11:30

is there a trick to display all sections of http://www.onyxplatform.org/cheat-sheet.html at the same time, to make search easy?

lucasbradstreet14:11:57

Unfortunately not atm. I'll add that though. I'm also adding tags soon

lucasbradstreet14:11:04

I'm going to discuss making the acking daemon timeout bigger with Mike. There should probably be some headroom for jobs to choose a bigger pending-timeout at runtime without a redeploy

robert-stuttaford14:11:52

great. we’ll stick with 60s for now

lucasbradstreet14:11:20

Yeah I think reducing the batch size will help reduce the pathological behaviour anyway. I assume that means you're not doing a redeploy to make these changes then?

robert-stuttaford14:11:03

we did have to, as I had to parameterise the pending-timeout

robert-stuttaford14:11:32

regarding the logging thing - is there a specific configuration setting to add to enable the yeller appenders?

robert-stuttaford14:11:57

@lucasbradstreet: increasing the peer-count, this is via min-peers and max-peers, or more simply n-peers, right? so to double the concurrency for pressured tasks, simply give them :onyx/n-peers 2 (and make sure the (onyx/start-peers count group) receives enough to satisfy it)

lucasbradstreet14:11:17

though parallelism is probably a better word

lucasbradstreet14:11:30

At least it’s making some progress?

robert-stuttaford14:11:52

yes, it’s keeping up

robert-stuttaford14:11:07

i’m busy increasing the peer-count for those 2 laggard tasks

lucasbradstreet14:11:20

when you increase the peer-count, you should consider reducing the batch-size for the task prior to it

lucasbradstreet14:11:14

The reason is that if you have a task that has a batch of 50, and it produces 500 segments, all 500 will be sent to a single task, because we don’t currently chop up the outgoing batch

lucasbradstreet14:11:39

so if you have a prior task that generates a lot of outgoing segments in one batch, then you may still end up with all your segments on a single peer

lucasbradstreet14:11:51

we have plans to change how this works, but it’s something to keep in mind

robert-stuttaford14:11:05

ok. we don’t have that situation, i think

robert-stuttaford14:11:14

we almost always pluck out one datom from a transaction, and that turns into a segment. we then run it through several tasks, each one appending some tx-data for commit-tx at the end, and each one using datomic’s d/with with the upstream tx-data to perform queries

robert-stuttaford14:11:35

the only task to receive more than 1 item is the last one; commit-tx

lucasbradstreet14:11:55

What we currently do generally works but it was still worth mentioning

lucasbradstreet14:11:09

Yeah, commit-tx is the one which really gets the benefit from batching so that’s good anyway

lucasbradstreet15:11:32

600,000 ms for a batch of 250 means your queries are taking 2+ seconds per segment

lucasbradstreet15:11:34

is that reasonable?

lucasbradstreet15:11:47

I realise you’re in triage mode currently

lucasbradstreet15:11:53

so no need to respond right away

robert-stuttaford15:11:52

it’s possible. it seems unlikely, but it’s possible

robert-stuttaford15:11:58

the task is clearly doing too much

robert-stuttaford15:11:40

pretty sure it’s in retry-nado again, w b-c of 50

lucasbradstreet15:11:45

Yeah, seems like either you need more computing power behind it, or it could be written better, or you need to revise what it’s doing and when it’s doing

lucasbradstreet15:11:14

same task exhibiting the high batch latency?

robert-stuttaford15:11:32

it just seems so damn odd that so little work can stress it so much

robert-stuttaford15:11:08

the regular spikes in word_count, response_count, init_stats are interesting

robert-stuttaford15:11:18

but yeah, the two culprits are chat_links and percent_complete

lucasbradstreet15:11:21

mmm, it’s 100K now, but that’s clearly too high still

lucasbradstreet15:11:23

thinking about it

lucasbradstreet15:11:39

and 100K is around 1/5th of what you had before

lucasbradstreet15:11:51

so it’s clearly still too high

robert-stuttaford15:11:04

current batch + timeout gives each task 12.5 seconds to do its thing

robert-stuttaford15:11:19

which is way plenty enough

lucasbradstreet15:11:38

But, you were getting 600K, as in 600s per batch

lucasbradstreet15:11:54

which is over 2s per segment

robert-stuttaford15:11:07

sorry, this batch count is now at 50

lucasbradstreet15:11:25

Yeah, so you reduced batch size from 250 to 50, so assuming full batches

lucasbradstreet15:11:38

it should take 1/5 the time if the batch_latency holds right

lucasbradstreet15:11:53

Which is roughly what you’re seeing at 600K -> 100K

lucasbradstreet15:11:15

which is still too high for the segments to be finished in time

lucasbradstreet15:11:48

with the amount of time that these queries are taking you could probably reduce batch size to 1 for that task and you wouldn’t have any perf impact

lucasbradstreet15:11:57

your main cost is the query you’re doing in that function

robert-stuttaford15:11:02

the 60s throughput numbers are 20, 30

lucasbradstreet15:11:53

reducing batch size further may help somewhat there then but in that case your chat_links performance is abysmal

lucasbradstreet15:11:05

we should monitor average batch sizes too

robert-stuttaford15:11:14

it’s actually retrying the same 30 over and over

lucasbradstreet15:11:28

are you hitting any hot keys?

lucasbradstreet15:11:02

Or more generally, ids/queries that have a lot of activity where the queries might take longer

lucasbradstreet15:11:30

I’d say reduce batch size on that task to 10 and increase the acking-daemon-timeout

lucasbradstreet15:11:34

in your peer config

lucasbradstreet15:11:09

and figure out what’s going on with that task. Parallelism will help a bit but I’d want to figure out whether I could improve perf for that task first

robert-stuttaford15:11:47

we’ve got our homework to do

robert-stuttaford15:11:04

your advice is, as always, tremendously helpful and much appreciated

lucasbradstreet15:11:06

no worries. I probably should’ve pointed our particular tasks as the issue earlier. After the metrics issue, I was a bit too focused on issues in our code this time around 😄

robert-stuttaford15:11:30

haha yes i know that feeling

robert-stuttaford15:11:49

thats part of the issue - knowing where the issue actually is!

lucasbradstreet15:11:27

we’re slowly building up a flow chart of sorts

lucasbradstreet15:11:34

for possible things to look for

robert-stuttaford15:11:27

one thing i do think might be useful is a way to limit the amount of times segments are retried

robert-stuttaford15:11:38

because now the only way to deal with these 30 is to restart

robert-stuttaford15:11:42

and hope that they make it through

robert-stuttaford15:11:57

(until such time as we can do all the longer term fixing)

lucasbradstreet15:11:12

Yeah, I’m hesitant to do that by default because it means data loss

lucasbradstreet15:11:20

but I do understand why you might want that in the short term

robert-stuttaford15:11:36

if we could specify a max retry count, that would help us to keep a stable system up for the rest of the data

lucasbradstreet15:11:48

I think some kind of exponential backoff might be better

robert-stuttaford15:11:58

in our world, we already have all the data. we’d just find and re-queue missed txes

lucasbradstreet15:11:02

though even that is troublesome

robert-stuttaford15:11:16

we can’t suffer data loss (thank datomic!)

lucasbradstreet15:11:29

yeah, with kafka it’s similar

lucasbradstreet15:11:35

at least you can replay that part of the log

lucasbradstreet15:11:27

we have various improvements in the works that will make this work better

lucasbradstreet15:11:18

for example, we’re going to track the times that segments are put on their queues, so if queues lag and segments are already being tried we’ll skip them

robert-stuttaford15:11:34

i’m sure that’d help al ot

lucasbradstreet15:11:35

it’ll make things recover a lot better in the case of retry storms

lucasbradstreet15:11:16

we’ll get there

robert-stuttaford15:11:27

so, you mention data loss. wouldn’t users of less immutable input queues lose data anyway, due to also having to restart?

lucasbradstreet15:11:05

most of the plugins have some level of durability

lucasbradstreet15:11:30

for example, with kafka, we checkpoint the where we are in the log so after restart we start back up there

lucasbradstreet15:11:04

sql we checkpoint column ranges to ZK as we finish them. This is okay, though the DB isn’t immutable so things may change underneath us

lucasbradstreet15:11:09

nothing we can do about that though

robert-stuttaford15:11:32

ok - so how does infinite retry vs finite retry play into that?

lucasbradstreet15:11:08

with finite retry we’d essentially have to checkpoint that the entry was finished

lucasbradstreet15:11:25

I mean even with datomic you’d still have to do some programming work to fill in the gaps with finite retries

lucasbradstreet15:11:39

we give you that for free with the plugin if you use infinite retries

lucasbradstreet15:11:46

it’s the same with kafka

robert-stuttaford15:11:41

are retries somehow stateful in ZK?

robert-stuttaford15:11:03

just restarted now, and got 5 retries almost immediately

lucasbradstreet15:11:05

so, the way it works is that when we ack messages as complete

robert-stuttaford15:11:16

just waiting for another metrics tick and then i’ll share a graph

lucasbradstreet15:11:26

we update an index in ZK which is basically the highest fully acked log id

lucasbradstreet15:11:49

so say you have txes 1-1000 and 1-994, 997, 998 are all acked

lucasbradstreet15:11:01

then we update in ZK that up to 994 has been acked

lucasbradstreet15:11:16

if you restart at this point then we set the log reader plugin to start at 995

robert-stuttaford15:11:34

ok. i think that’s probably exactly what happened now

lucasbradstreet15:11:15

what did you change?

robert-stuttaford15:11:23

no config changes

lucasbradstreet15:11:15

What’s your priority? getting passed these hard entries?

robert-stuttaford15:11:45

yes. presumably having them retrying is reducing our capability to process novelty

lucasbradstreet15:11:17

OK, if you have any control over you job submission data without a retry then reduce max-pending really low for a bit

lucasbradstreet15:11:59

I mean if you want to sort out your temporary issues

robert-stuttaford15:11:14

that would force the system to process a very small workload at a time, right?

robert-stuttaford15:11:27

backpressuring all the way to the read-log

robert-stuttaford15:11:37

ok, giving that a go

robert-stuttaford15:11:50

my system is constipated 😞

lucasbradstreet15:11:02

Haha, yeah. I realised that maybe you just want to fix your issue first

robert-stuttaford15:11:55

ok, now we have flat-lined retries and chat-links

robert-stuttaford15:11:02

waiting another minute for another tick to be sure

lucasbradstreet15:11:36

Cool, you’ll want to increase it again at some point otherwise latency for your users might go up

lucasbradstreet15:11:02

It’ll be invisible metrics wise because unread log entries won’t contribute to the stats

robert-stuttaford15:11:25

instead of going to 1000, i’m going back up to 250

robert-stuttaford15:11:46

at least until we can get more cores and peers underneath it all

lucasbradstreet15:11:53

that’s reasonable

lucasbradstreet15:11:30

with how long your tasks are taking and the fact that you’re on core.async (so latency is super low), you could probably reduce it even further and not lose out to throughput

lucasbradstreet15:11:38

makes sense overall though

robert-stuttaford15:11:07

do you have any Onyxen in production, Lucas?

robert-stuttaford15:11:47

if so, i’m curious what you’re using for alerts ala http://pagerduty.com etc

lucasbradstreet15:11:16

nothing currently. No time for side projects at the moment 😞

robert-stuttaford18:11:14

@lucasbradstreet: just confirming that the default peer count per task is 1, assuming there are enough peers to go around?

lucasbradstreet18:11:41

If you have exactly the same number of peers as tasks then yes. Exactly one per task. Otherwise it'll be based on the scheduler and max-peers/min-peers

robert-stuttaford18:11:31

i want to keep our config nimble, so i’m going to set defaults for the lot, and specifics where needed

robert-stuttaford18:11:08

what’s more idiomatic - to set max-peers, or n-peers? e.g. for our input or send-email tasks

lucasbradstreet18:11:38

n-peers is more specific. I would generally use n-peers unless you want some scheduler flexibility. For your input task you can only use one peer regardless so it doesn't make much difference.

lucasbradstreet18:11:09

I'd make the number of peers you start a parameter, or make it high enough to cover some run time flexibility

robert-stuttaford18:11:51

thanks simple_smile does it make sense to provide 3 n-peers, or should it scale 1, 2, 4

robert-stuttaford18:11:22

i want to give the ones that had moderate pressure 2, and the two that had large pressure 3 or 4

lucasbradstreet18:11:45

You can use any number assuming there aren't any plugin restrictions (using input plugins)

robert-stuttaford18:11:49

fyi, @lucasbradstreet Read log tasks must set :onyx/max-peers 1 simple_smile

robert-stuttaford19:11:52

got yeller+timbre logging in, an 8 core box with 14gb ram for the JVM process, and distributed 10 additional peers into the tasks that were most impacted

lucasbradstreet19:11:47

Ah with n-peers? Ok. I should improve the validation for that plugin

robert-stuttaford19:11:56

we’ll continue to do daily restarts until we’ve made some headway with you on the leak

lucasbradstreet19:11:59

n-peers is a new option

robert-stuttaford19:11:18

but those changes should dramatically raise our performance ceiling

robert-stuttaford19:11:16

interesting… in these graphs, we’re on more ram and cores, but not yet on additional peers. see the throughput spike top right? that’s 200 txes a min, and it barely blinked

lucasbradstreet19:11:48

You were oversubscribed before

lucasbradstreet19:11:55

So that kinda makes sense

robert-stuttaford19:11:24

i guess our earlier perf issues that lead to retries had to do with the size of the datasets involved in the queries for those particular events

robert-stuttaford19:11:18

well, i think we’re in a much, much better place now

robert-stuttaford19:11:59

thank you for your patience, Lucas. it takes a little time for the things that are probably very clear to you to sink in over here, some times simple_smile

robert-stuttaford19:11:38

it’s going to be a tremendously interesting week as we work on the queries and sim-test

lucasbradstreet19:11:17

I'm glad it's working out. I'll definitely be interested in how your query improvements go