Fork me on GitHub

Onyx users may be interested in this post, since we build our aggregations in Onyx in a somewhat similar way, using BookKeeper (which Manhattan also uses)


How do we feel about this? Basically this gives us an ability for the task component to tell why it’s being shut down, so that things like triggers know whether the task is just being re-scheduled or whether the job has been completed (maybe you want to only write to a database with the final value, not just because the peer lost a connection to ZK, or a new job was submitted). I’d get rid of the varargs on start-new-lifecycle and always supply an event. I’m more looking for advice about whether you like the strategy. In a full change I'd add the field to the record.


The only other way I can see to do this is to communicate the cause via channels, but onyx has several channels to stop the task, and it seems way better to do this by the component


This is currently something that is a problem with jepsening Onyx. Currently I just use one peer for the task with a trigger on it, and then write out the timestamp with the trigger call. We only look at the trigger call with the latest timestamp when checking whether the state is correct. This wouldn’t be a problem if we knew that we were calling the trigger during job completion.


The main issue that I can see is that we don’t have a two phase job completion, so we can’t be sure that the job completion trigger gets made successfully. For this to happen, I think we would need to have an signal that is sent out after seal-output, to signify that all the triggers have been successfully called.


You can imagine that happening where the job is completed, but the peer on the task writing the trigger out is then partitioned from where it’s writing out to. The job is still “complete” but the trigger never successfully completed.


Does this become a non-problem with ABS where every peer failure results in replaying the state from upstream?


In the current way we do things, I was under the assumption that if a peer is partitioned during it’s :sync fn then those segments are retried


Maybe the two phase part becomes a non-issue. How we signal job completion is still a problem.


Ok, so the problem is if you only want to write out your final state value on job completion


i.e. you’ve processed all your segments, and now the overall scheduler has figured out that your job has completed, now you want to write the value out


I think this will be a problem whether you’re using ABS or not. Imagine batches of segments come in where | is barrier. | 1 2 3 | 3 4 5 | | | | | | :decide-to-complete-job


Ok I see what you mean


It’s just a safe implementation for a job-complete trigger where the end is unknown


During :decide-to-complete we have a failure, the rest of the peers have already completed and cleaned up


Our scheduler can’t handle that problem because it does the job completion in a single phase


It’s like a 2-phase commit problem


Anyway, the commit I posted doesn’t solve this issue


It just solves how to signal to the task/peer that it’s being shutdown via a job completion


Rather than a re-scheduling


Yea the root of the problem would require coordination among all the peers


I haven’t really thought enough about how to solve that problem. First I just want to know whether the peer is being stopped because the job is complete or not


That will be a lot better than what we currently have


But yes, I think it’ll require some coordination to solve the greater issue


I mean having that info available (why a job is moved/canceled) is going to be useful regardless


Yes, it’s definitely closer to the ideal. I think it’s achievable before 0.9.0 too, which is why I’m pushing it now.


And it's immediately useful to you, lucasbradstreet? You can use it in the jepsen testing?


Well, it would save me from having to read back all the trigger writes and grab the one with the greatest timestamp.


Mostly I like that I could test it with jepsen 😛


Personally, I think users generally shouldn’t care to write out if a vpeer was rescheduled, only if it meets their normal criteria or if the job was completed.


We should generally try to minimise rescheduling of those peers anyway, but writing something out when it’s going to be replayed on another peer, and continue is generally not the main concern


In your example above, where | is barrier. | 1 2 3 | 3 4 5 | | | | | | :decide-to-complete-job, wouldnt the last peer not clear the upstream buffers if it’s partitioned before :decide-to-complete-job?


Sorry I know this doesent help our current problem but just thinking about it


It would, but say you only want to actually call the trigger at the end of the job


Not, say, at the end of every barrier


I realise that could be impossible to do exactly once. I’m just aiming for at least once


Sorry about the onyx-jepsen walk through. I smashed my head really hard into a wall yesterday 😕


Heh, there was some clothes hanging in the doorway and my depth perception decided to screw with me 😛


I adjusted the font on the website to make the documentation more readable:


Let me know what ya'll think.