Fork me on GitHub

lucas / michael: Could either of you talk briefly about how updating code interacts with jobs, when to use resume points, etc. For instance, if I fix a bug in my business logic that doesn't change the onyx job definition, is best practice just killing all my peer java processes, redeploy a new jar, and just turn the peers back on, keeping the same job running throughout? Are the only times I should be using the resume-points is if I actually need to change the structure of my onyx job definition? Are there any timeouts or anything to be aware of with how long a job can stay in zk without peers, does it stay frozen indefinitely?


Will write up an answer in the AM - Ill get these all under the FAQ, they're good questions.


i'm having a bit trouble with getting all the onyx log messages


specifically, it appears that after a :submit-job, i'm not receiving anything else anymore


(at least, using subscribe-to-log)


i'm trying to figure it whether i'm at fault here


i know for a fact that a job is running, but my process that's monitoring the onyx log doesn't receive anything else anymore after submit-job


is that expected behavior ?


There won’t necessarily be any other messages unless something changes about the job


The submit job message is enough for all of the peers to know what to do and start the job


okay, so it's not possible by observing the log which jobs are actually running ?


You have to actually apply the log entries


Via the same log application method as the peers will use


aha, that sounds like the thing i was unaware of


what will applying the log do in this case ? trying to understand the flow here


will it then start communicating with the peers directly ?


just looking up a good example


old version since I got to the docs via google 😛


so, just for my understanding


applying the log will basically start communicating with the peers directly, to figure out the events that are not stored in the onyx log ?


basically, the current cluster state is a state machine which is built by applying the log entries in turn. This ensures that every node in your cluster has the same view of the cluster


is that correct ?


applying the log just applies the log command to the state machine


to materialize the view of the cluster


so if i apply the log makes progress in the state machine


if you don’t apply the log entries, you won’t know what the current cluster state is


that sentence doesn't make sense


i figured that this log application would occur behind the scenes


yeah, it does happen in the peer-group, but if you want to get a view of the cluster from outside this has to take place somewhere else.


yep, makes sense


since it’s easier for you to just do it yourself than reach into the peer group


thanks for your help!


no worries, good luck. Looking forward to seeing what you come up with. We have some similar code to do log -> job status in the DB. We tried to keep the flow one way otherwise it gets very tough to reason about.


essentially log playback -> DB for the job status


exactly what i'm working on right now


then if you want to change the job status, send a kill-job command to the cluster


and wait for that to go back through the playback flow to get the new status


Cool, hope it turns out well. Onyx needs something like this


hit me up if you have any more questions


sure thing


how does the replica figure out which jobs are allocated where ?


since the allocations is not explicitly part of the submit log entry, i assume that allocation is always deterministic and as such is correctly reproduced in all replicas. is this correct ?


Yes it’s completely deterministic.


The only exception /can/ be if you switch from one onyx version to another. When we migrate to new onyx versions we just figure out what the end submit/kill job results are and migrate the jobs that aren’t killed. That way we don’t need it to play back exactly the same


looks like it's that function that does all the allocation logic, right ?


Is this just for interest to understand how it’s all working?


well, more like i want to know what i'm doing 🙂


Ok cool :). Just making sure since I thought I might not understand your problem well enough


this replica state reproducing with so little information felt a bit like "magic"


Definitely good to understand what’s going on :)


Yeah it is a bit heh


well it makes sense, once you realise that it's all deterministic


It’s not so crazy once you understand it but it’s kinda novel for it to all work like that


We had to set a seed on the scheduler so it would make the same decisions everywhere


This is what we mean when we called onyx masterless, although we do still lean on zookeeper for the rest, so it’s not completely masterless


One nice thing with this is that you can find out the state of the cluster at any time, by looking at the replica state at that time


onyx-dashboard allows you to step through it, which can be useful when debugging


well, thanks for the info


good night


@eriktjacobsen Ill repaste your question down here:

lucas / michael:  Could either of you talk briefly about how updating code interacts with jobs, when to use resume points, etc. For instance, if I fix a bug in my business logic that doesn't change the onyx job definition, is best practice just killing all my peer java processes, redeploy a new jar, and just turn the peers back on, keeping the same job running throughout? Are the only times I should be using the resume-points is if I actually need to change the structure of my onyx job definition? Are there any timeouts or anything to be aware of with how long a job can stay in zk without peers, does it stay frozen indefinitely?


The first thing to get a mental picture of is the division of behavior (your concrete Clojure code and functions that do real things) and wiring instructions (the Onyx job that stitches all the Clojure code together at runtime). When you stand up a peer, all you've done is plugged that peer into the network to listen to the log for instructions, and possibly open up a network connection to another peer when it's going to run a task for a job. This peer has the behavior bundled into its jar - it doesnt know anything about your specific Onyx jobs. When you submit an Onyx job, you're putting those instructions about how to use your behavioral code into ZooKeeper, which goes into the log, which is eventually read by a peer. The submission of the job happens away from the cluster - these instructions are received at runtime. If you have a bug in your business logic that pertains to your code, you'll definitely need to reboot the peer with the new code. Whether you stop and start your job is a completely different matter that depends on whether or not you need to replay all your data to correct the error. It also depends on whether its okay for your domain to possible have two peers on at the same time with the incorrect and the correct code. Resume points are used for transferring state between jobs. So in the case that you do need to restart your job, but you want to keep your progress so far in regards to window contents, Kafka offsets, and so forth - you use resume points to literally resume previous progress.


There are no inherent limitations for how long a job can stay around before peers don't pick it up for work. Its simply a key in a map waiting for a peer to be scheduled on it. See the above conversation with @lmergen about how the log works.


I also gave a talk a few years ago on this called Inside Onyx that you can find on YouTube that explains the log. None of that has changed


it takes a bit of mental gymnastics to wrap your head around it


i think a large part of this design stems from the fact it’s masterless


It's a log architecture being used in a place where it's never been used before. Storm/Flink/Spark/Data flow all use architectures that rely on a centralized coordinator. To the best of my knowledge Onyx is the first to do this.


So the latest on the Media Driver Saga on GKE. We did lots of debugging to determine if the peers could communicate over UDP and that seems to all work fine. Our last attempt for the day was to set the media driver into a dedicated thread mode . Currently the idle job has been running for 10 minutes where before it wouldn’t make it past 1 minute. Any thoughts on the media driver thread mode ?


Thanks for the explanation michael, clears things up


If it's surviving that's indicative of the peer being noisy wrt to resource usage. Dedicated thread mode for Aeron will do just that


@camechis if dedicated mode is making it better it makes me wonder if this job is complex/large and is opening a lot of channels. It would also partially explain why it’s working in single peer mode.


@camechis how many peers in this job?


10 per peer


only about 7 needed


Ok that isn’t the cause then


The media driver and the peers might just be fighting for resources then


i have 5 physical nodes in my cluster with about 30 gigs available in mem and not a whole lot of CPU being used since everything is pretty idle


@lucasbradstreet, @camechis shared with me that they're individually limited through K8s resource constraints, more than our test env.


4 cores each. I deployed 2 peers in this cluster


Yes, where we provide a single upper bound on both containers, they are providing individual bounds though.


That should be enough machine resources at least, though the limits might apply. I dunno.


Limits for CPU are not strict either, they're a scaling factor in relation to other containers running on the node.


I think you should get a flight recording of the peers and of Aeron


any pointers on how that works in kubernetes

Using those JVM opts, you can kubectl port-forward to 1099, then run Mission Control to get a flight recorder file.


I should say you should only do this for testing/non permanent prod purposes only, as flight recorder is commercial in jdk8. It’s opened up in OpenJDK/jdk9 thankfully


oh nice, doesn’t sound to hard


It’s really great


I can't commit to anything for at least next 2 months, however, I'm curious are you guys actively working on spec integration? I see but it looks like focused on speccing onyx itself rather than integrating it with jobs. It seems like an analogous data structure to workflow that lists specs for messages passing between two components, something in catalog to say what specs a task accepts and emits, and being able to put a spec directly into a flow condition instead of having to wrap it in an opaque predicate function would be really nice and would also let you generate diagrams / documentation for your job that actually shows the types of data flowing between each node. Thoughts?


@eriktjacobsen I had considered that before Spec came out. I think that would make a really nice supporting library - it adds an additional checker to your runtime