This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
- # admin-announcements (38)
- # announcements (1)
- # aws (1)
- # beginners (195)
- # boot (1)
- # business (3)
- # cider (100)
- # cljsrn (37)
- # clojure (99)
- # clojure-russia (69)
- # clojure-switzerland (1)
- # clojurescript (120)
- # datavis (26)
- # datomic (23)
- # devcards (4)
- # editors (38)
- # hoplon (18)
- # ldnclj (27)
- # lein-figwheel (28)
- # off-topic (50)
- # om (329)
- # onyx (255)
- # portland-or (1)
- # re-frame (8)
- # reagent (18)
- # spacemacs (5)
- # testing (1)
the first is default 1000, the second is 60000. is it possible that we’re getting retries because we’re missing that first 1000ms window?
So the way it works is that the messages are put into buckets of pending-timeout/input-retry-timeout time periods
You say that because you match up the read log throughput with when the retries happen?
we had retries very soon after the system came back up. are retries somehow stateful in ZK?
Definitely seems odd to immediately get retries. Could've been within a minute on that graph though?
You shouldn't get them immediately. I could see getting them after the first minute if the system reads in too much initially
right now we’re at a batch count of 250, 100ms batch timeout. two tasks in particular always have very high latencies when the system becomes stressed, but those high numbers aren’t reflected in the INPUT complete latency numbers
Yeah I don't understand it either. Very curious about what is happening in chat links
and it just keeps doing more and more work without ever finishing anything?
Oh I know why complete latency isn't reflected. Chat links takes too long. Then they get retried before completion
But retries aren't really included in the complete latency statistics, tho it should probably be maintained (currently no good way to do this)
ok. so our current :onyx/pending-timeout value, the 60s default is, for whatever reason, too low
Retry loop would make sense. You could increase the pending-timeout to give it more time to finish but you should probably figure out what chat links is doing
so in our case it’d be something like input > prepare > tracks > chat links FAIL input > prepare > tracks > chat links FAIL input > prepare > tracks > chat links FAIL
it’s all datomic queries + producing a transaction for commit-tx to eventually do
Ah. Unless you're using the batching functionality to do multiple queries at once you should probably reduce the batch size then. And maybe optimise he the query.
If you're doing the query from onyx/fn then batching it with a high batch size could be hurting you by preventing any of the segments making it through in time
ok so a batch size of 250 we’re asking it to do 250 concurrent sets of queries in 60s
If you're doing it in onyx/fn you're asking it to do 250 sequential queries within 60 seconds
You may be better off by increasing the number of peers dedicated to that task
i’m starting with tripling the timeout and reducing the batch to 50 again. healthy first, then fast
Oliver is stress-testing from tomorrow. we’ll optimise what work we do for what events, and how efficiently we do that work, once we have that data
End diagnosis looks like you hit some load, so things queue up so you hit a big batch, which isn't finished before the pending timeout, which causes retries and it to all happen all over again
@lucasbradstreet: i set an
180000, and when i restarted the system, it didn’t read from the datomic log at all
It should have thrown an exception about http://www.onyxplatform.org/cheat-sheet.html#peer-config/:onyx.messaging/ack-daemon-timeout
Because they interact and ack-daemon-timeout has to be bigger than all the pending timeouts in your input tasks
we have a flow condition set up, and a
Thread/setDefaultUncaughtExceptionHandler set up as well
For example you submit a job that has had parameters like the above one and we'll catch it and log it
clojure.lang.ExceptionInfo: Pending timeout cannot be greater than acking daemon timeout clojure.lang.ExceptionInfo: Pending timeout cannot be greater than acking daemon timeout clojure.lang.ExceptionInfo: Pending timeout cannot be greater than acking daemon timeout
what’s the largest i could make this timeout without violating the acking daemon limit?
If you add the above dependency and then configure the log config as discussed here http://onyx.readthedocs.org/en/latest/user-guide/logging/
is there a trick to display all sections of http://www.onyxplatform.org/cheat-sheet.html at the same time, to make search easy?
I'm going to discuss making the acking daemon timeout bigger with Mike. There should probably be some headroom for jobs to choose a bigger pending-timeout at runtime without a redeploy
Yeah I think reducing the batch size will help reduce the pathological behaviour anyway. I assume that means you're not doing a redeploy to make these changes then?
regarding the logging thing - is there a specific configuration setting to add to enable the yeller appenders?
-facepalm- of course, https://github.com/yeller/yeller-timbre-appender#setup this stuff, right
@lucasbradstreet: increasing the peer-count, this is via min-peers and max-peers, or more simply n-peers, right? so to double the concurrency for pressured tasks, simply give them
:onyx/n-peers 2 (and make sure the
(onyx/start-peers count group) receives enough to satisfy it)
when you increase the peer-count, you should consider reducing the batch-size for the task prior to it
The reason is that if you have a task that has a batch of 50, and it produces 500 segments, all 500 will be sent to a single task, because we don’t currently chop up the outgoing batch
so if you have a prior task that generates a lot of outgoing segments in one batch, then you may still end up with all your segments on a single peer
we almost always pluck out one datom from a transaction, and that turns into a segment. we then run it through several tasks, each one appending some tx-data for commit-tx at the end, and each one using datomic’s
d/with with the upstream tx-data to perform queries
Yeah, commit-tx is the one which really gets the benefit from batching so that’s good anyway
600,000 ms for a batch of 250 means your queries are taking 2+ seconds per segment
Yeah, seems like either you need more computing power behind it, or it could be written better, or you need to revise what it’s doing and when it’s doing
the regular spikes in word_count, response_count, init_stats are interesting
with the amount of time that these queries are taking you could probably reduce batch size to 1 for that task and you wouldn’t have any perf impact
reducing batch size further may help somewhat there then but in that case your chat_links performance is abysmal
Or more generally, ids/queries that have a lot of activity where the queries might take longer
I’d say reduce batch size on that task to 10 and increase the acking-daemon-timeout
and figure out what’s going on with that task. Parallelism will help a bit but I’d want to figure out whether I could improve perf for that task first
no worries. I probably should’ve pointed our particular tasks as the issue earlier. After the metrics issue, I was a bit too focused on issues in our code this time around 😄
one thing i do think might be useful is a way to limit the amount of times segments are retried
if we could specify a max retry count, that would help us to keep a stable system up for the rest of the data
in our world, we already have all the data. we’d just find and re-queue missed txes
for example, we’re going to track the times that segments are put on their queues, so if queues lag and segments are already being tried we’ll skip them
so, you mention data loss. wouldn’t users of less immutable input queues lose data anyway, due to also having to restart?
for example, with kafka, we checkpoint the where we are in the log so after restart we start back up there
sql we checkpoint column ranges to ZK as we finish them. This is okay, though the DB isn’t immutable so things may change underneath us
with finite retry we’d essentially have to checkpoint that the entry was finished
I mean even with datomic you’d still have to do some programming work to fill in the gaps with finite retries
if you restart at this point then we set the log reader plugin to start at 995
yes. presumably having them retrying is reducing our capability to process novelty
OK, if you have any control over you job submission data without a retry then reduce max-pending really low for a bit
that would force the system to process a very small workload at a time, right?
Cool, you’ll want to increase it again at some point otherwise latency for your users might go up
It’ll be invisible metrics wise because unread log entries won’t contribute to the stats
with how long your tasks are taking and the fact that you’re on core.async (so latency is super low), you could probably reduce it even further and not lose out to throughput
if so, i’m curious what you’re using for alerts ala http://pagerduty.com etc
@lucasbradstreet: just confirming that the default peer count per task is 1, assuming there are enough peers to go around?
If you have exactly the same number of peers as tasks then yes. Exactly one per task. Otherwise it'll be based on the scheduler and max-peers/min-peers
i want to keep our config nimble, so i’m going to set defaults for the lot, and specifics where needed
what’s more idiomatic - to set max-peers, or n-peers? e.g. for our input or send-email tasks
n-peers is more specific. I would generally use n-peers unless you want some scheduler flexibility. For your input task you can only use one peer regardless so it doesn't make much difference.
I'd make the number of peers you start a parameter, or make it high enough to cover some run time flexibility
i want to give the ones that had moderate pressure 2, and the two that had large pressure 3 or 4
You can use any number assuming there aren't any plugin restrictions (using input plugins)
got yeller+timbre logging in, an 8 core box with 14gb ram for the JVM process, and distributed 10 additional peers into the tasks that were most impacted
we’ll continue to do daily restarts until we’ve made some headway with you on the leak
interesting… in these graphs, we’re on more ram and cores, but not yet on additional peers. see the throughput spike top right? that’s 200 txes a min, and it barely blinked
i guess our earlier perf issues that lead to retries had to do with the size of the datasets involved in the queries for those particular events
thank you for your patience, Lucas. it takes a little time for the things that are probably very clear to you to sink in over here, some times
it’s going to be a tremendously interesting week as we work on the queries and sim-test