This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-03-18
Channels
- # admin-announcements (1)
- # aleph (3)
- # beginners (20)
- # boot (9)
- # cider (74)
- # clara (1)
- # cljs-dev (28)
- # cljsrn (15)
- # clojars (32)
- # clojure (149)
- # clojure-dusseldorf (3)
- # clojure-italy (6)
- # clojure-nl (3)
- # clojure-russia (20)
- # clojure-uk (5)
- # clojurescript (133)
- # core-async (2)
- # cursive (19)
- # datomic (24)
- # devcards (1)
- # funcool (100)
- # hoplon (48)
- # keechma (1)
- # kosmos (7)
- # ldnclj (3)
- # leiningen (3)
- # luminus (16)
- # off-topic (8)
- # om (103)
- # onyx (47)
- # pedestal (3)
- # proton (7)
- # re-frame (13)
- # reagent (11)
- # ring-swagger (1)
- # specter (6)
- # testing (12)
- # untangled (24)
- # yada (32)
Onyx users may be interested in this post, since we build our aggregations in Onyx in a somewhat similar way, using BookKeeper (which Manhattan also uses) https://blog.twitter.com/2016/strong-consistency-in-manhattan
How do we feel about this? https://github.com/onyx-platform/onyx/commit/ab9c24e201524dba3926d67ddc4d370d72e87b26 Basically this gives us an ability for the task component to tell why it’s being shut down, so that things like triggers know whether the task is just being re-scheduled or whether the job has been completed (maybe you want to only write to a database with the final value, not just because the peer lost a connection to ZK, or a new job was submitted). I’d get rid of the varargs on start-new-lifecycle and always supply an event. I’m more looking for advice about whether you like the strategy. In a full change I'd add the field to the record.
The only other way I can see to do this is to communicate the cause via channels, but onyx has several channels to stop the task, and it seems way better to do this by the component
This is currently something that is a problem with jepsening Onyx. Currently I just use one peer for the task with a trigger on it, and then write out the timestamp with the trigger call. We only look at the trigger call with the latest timestamp when checking whether the state is correct. This wouldn’t be a problem if we knew that we were calling the trigger during job completion.
The main issue that I can see is that we don’t have a two phase job completion, so we can’t be sure that the job completion trigger gets made successfully. For this to happen, I think we would need to have an signal that is sent out after seal-output, to signify that all the triggers have been successfully called.
You can imagine that happening where the job is completed, but the peer on the task writing the trigger out is then partitioned from where it’s writing out to. The job is still “complete” but the trigger never successfully completed.
Does this become a non-problem with ABS where every peer failure results in replaying the state from upstream?
In the current way we do things, I was under the assumption that if a peer is partitioned during it’s :sync fn then those segments are retried
Maybe the two phase part becomes a non-issue. How we signal job completion is still a problem.
Ok, so the problem is if you only want to write out your final state value on job completion
i.e. you’ve processed all your segments, and now the overall scheduler has figured out that your job has completed, now you want to write the value out
I think this will be a problem whether you’re using ABS or not. Imagine batches of segments come in where | is barrier. | 1 2 3 | 3 4 5 | | | | | | :decide-to-complete-job
Ok I see what you mean
It’s just a safe implementation for a job-complete trigger where the end is unknown
During :decide-to-complete we have a failure, the rest of the peers have already completed and cleaned up
Our scheduler can’t handle that problem because it does the job completion in a single phase
Yea I see
It’s like a 2-phase commit problem
Anyway, the commit I posted doesn’t solve this issue
It just solves how to signal to the task/peer that it’s being shutdown via a job completion
Rather than a re-scheduling
Yea the root of the problem would require coordination among all the peers
I haven’t really thought enough about how to solve that problem. First I just want to know whether the peer is being stopped because the job is complete or not
That will be a lot better than what we currently have
Yea totally
But yes, I think it’ll require some coordination to solve the greater issue
I mean having that info available (why a job is moved/canceled) is going to be useful regardless
Yes, it’s definitely closer to the ideal. I think it’s achievable before 0.9.0 too, which is why I’m pushing it now.
And it's immediately useful to you, lucasbradstreet? You can use it in the jepsen testing?
Well, it would save me from having to read back all the trigger writes and grab the one with the greatest timestamp.
Mostly I like that I could test it with jepsen 😛
Personally, I think users generally shouldn’t care to write out if a vpeer was rescheduled, only if it meets their normal criteria or if the job was completed.
We should generally try to minimise rescheduling of those peers anyway, but writing something out when it’s going to be replayed on another peer, and continue is generally not the main concern
@gardnervickers: this is the same underlying issue https://github.com/onyx-platform/onyx/issues/292
In your example above, where | is barrier. | 1 2 3 | 3 4 5 | | | | | | :decide-to-complete-job,
wouldnt the last peer not clear the upstream buffers if it’s partitioned before :decide-to-complete-job
?
Sorry I know this doesent help our current problem but just thinking about it
It would, but say you only want to actually call the trigger at the end of the job
Not, say, at the end of every barrier
I realise that could be impossible to do exactly once. I’m just aiming for at least once
Sorry about the onyx-jepsen walk through. I smashed my head really hard into a wall yesterday 😕
Heh, there was some clothes hanging in the doorway and my depth perception decided to screw with me 😛
I adjusted the font on the website to make the documentation more readable: http://www.onyxplatform.org/docs/user-guide/latest/testing-onyx-jobs.html
Let me know what ya'll think.