Fork me on GitHub

hi all! Some example how can i init and submit jobs in a production mode?


I think that all depends on what production actually is but the onyx-template is a good starting point


@lellis There’s really no difference between development and production ‘modes’ other than its more typical to run ZooKeeper in memory while developing. Otherwise they’re exactly the same.


I know this question is a bit broad, but even gut feelings / generalizations would be helpful: if you had a Clojure / Datomic webapp with basic queuing / background job needs, and aspiring to have big data needs (but not there yet), does Onyx seem like a good fit? Is it overkill?


@camdez Good fit, yes. Overkill? Maybe, would need to know a lot more to make a sound decision. onyx-local-rt is a good intermediate stepping stone.


@michaeldrogalis: I don’t want to take up your time, but can you give me the 30 second version of what other factors you’d consider? I’m unfortunately out of my depth here.


onyx-local-rt does look great for a migration path.


@camdez Sure. Operations is probably the biggest one. Can you team run a distributed system and keep it up? Do you have experience with monitoring and diagnosing metrics? Are you willing to read the user guide and learn about the backpressure model? Can you set up log aggregation to collect log files from remote servers? These are the basics that are needed to run any kind of distributed system. There’s more, but most teams admittedly don’t have the time or experience or money to tackle these.


A most direct question might be “how bad is it if it goes down?” If it’s “really bad”, Onyx is a good choice because we spend a lot of time engineering for fault tolerance. If it’s not a big deal, rolling your own thing might be fine.


I rarely link to Hacker News, but this was a good summary for microservices, and most of it is applicable to distributed processing:


(The top post, that is)


@michaeldrogalis: sweet, thanks. You hit the nail on the head. Operations is precisely where I fall short and feel less confident. It isn’t that bad if it goes down. Slightly surprised to hear you suggest rolling my own distributed queuing system. I was pondering how I could lean on more managed options (SQS, for example) to avoid the operations needs.


@camdez Er, yeah let me rephrase that. Don’t build that stuff yourself - you have the right idea. Use lighter technologies that already exist.


local-rt is stateless/threadless etc though, so it should help you get the modeling portion of your problem down pat.


@michaeldrogalis: gotcha. Cool, I really appreciate the advice. Today I’m handling most of my needs with durable-queues, so the separation is already in place. Just starting to explore scaling that up / broader needs. Onyx seems so flexible that it’s hard (for me, the uninitiated) to know which of the problems it can solve are actually the problems I should use it to solve.


learn-onyx is a good way to find the boundaries.


We go through all the major features in stages.


I’ve also been a bit surprised that Clojure doesn’t have an obvious background job story like (e.g.) Sidekiq or Delayed::Job in the Ruby world. Unless I’m missing it.


I think some of that is Rich’s influence. He’s been an advocate for uni-directional dataflow systems for a long time, and I guess that’s rubbed off on anyone thinking about implementing actors or similar.


@michaeldrogalis: if I’m reading you correctly, the preference is for ~queues over scanning a jobs table as the mechanism for triggering background work, right? That seems sensible. I think the bigger issue for a lot of users try to put a system together is monitoring / handling failed tasks. The popular Ruby projects for background jobs make those pieces quite easy.


@camdez For the most part, designs that center around a jobs table are mutability clear as day. Those things often come with the baggage of serious race condition problems. Can you give an example of something that’s easy with SideKiq that’s you’re having trouble finding an analog for?


@michaeldrogalis Your comment RE “mutability” intrigues me. My first thoughts for background jobs are side-effecting, but not mutating local data (maybe that is what you mean?). Sending emails, processing images, generating reports… I wouldn’t say I’m having trouble finding an analog, but it’s handy to have an off-the-shelf solution making it easy to see what has run, what’s queued, what failed (and why), restart failed tasks, etc.


Does that make sense? I may not be explaining clearly.


@camdez Certainly, the actual work done by the job is often side effecting. My comment was directed toward the implementation of the parts that manage the work that gets done. If all tasks are tracked in a single, flat place (a database table, for example), you inherently have limited power to reason about history. The problem gets more complicated when you want to chain jobs together, where the status of one job affects another. This is mostly a critique of hand-rolled job queues, by the way. SideKiq clearly put a lot of work into their project.


@michaeldrogalis Oh, no doubt, jobs tables tend to be mutating. Though they need not be.


To your second message, I often turn to a workflow engine when I need to answer those kinds of questions. The tasks you mentioned are heavier, and are likely a better fit for something like Amazon SWF - which checkpoints progress at every step. This penalizes throughput, but decreases the cost of recovering from a failure.


@michaeldrogalis Great, thanks again for the suggestions. Nice chatting with you.


Ok so let me explain my situation, i want to init onyx and jobs inside my web app -main. Im totaly wrong or its possible to do that? Im trying to do this but code stop after start/peers.


@lellis Start your peers off on their own resources. Keep them up persistently. Invoke onyx.api/submit-job only from the web server


That is, don’t start and stop the peers before and after every job. So long as sufficient resources are given, Onyx will handle the work load and transition between jobs according to its scheduler.


Can someone help walk me through this snippet from the basic example task in the Onyx template? I don't recall seeing anything like it in the learn-onyx repo:

(defn inc-in-segment
  "A specialized version of update-in that increments a key in a segment"
  [ks segment]
  (update-in segment ks inc))

(def IncKeyTask
  {::inc-key [s/Keyword]})

(s/defn inc-key
  ([task-name :- s/Keyword task-opts]
   {:task {:task-map (merge {:onyx/name task-name
                             :onyx/type :function
                             :onyx/fn ::inc-in-segment
                             :onyx/params [::inc-key]}
    :schema {:task-map IncKeyTask}})
  ([task-name :- s/Keyword
    ks :- [s/Keyword]
   (inc-key task-name (merge {::inc-key ks} task-opts))))


@stephenmhopper, are you familiar with task bundles?


@colinhicks That's helpful. Thank you!


@stephenmhopper Task bundles are just an organizational pattern that helps you to compose and validate your Onyx tasks.


Yeah, the confusion for me was that I don't remember doing anything with them in the learn-onyx stuff. Luckily, it seems like the learning curve is low


@stephenmhopper To summarize quickly, Onyx uses “vertical partitioning” of its components. e.g. All flow conditions in a vector, all windows in a vector, etc, regardless of which task they belong to. Task bundles are the “horizontal partitioning” way of doing it. That is, task a is a map of {:windows … :flow-conditions …} only for task A. There’s some schema checks too, but thats it


Nice to have both


Important to note that Onyx only accepts the former. Task bundles are an idiom that the developer him/herself unrolls to the vertical style before submission.


I'm working on creating a workflow for ingesting data from various sources and dumping it to a Postgres database. Sounds simple, right? I'm not sure how to design the thing to work with Onyx. My first, relatively naive, design looked something like this: 1. Message about the job to start shows up on a queue (core.async channel, most likely) 2. Onyx job picks up message. Inserts record to database indicating that the job has started. 3. The next Onyx task picks up message and starts pulling data from the specified data source (i.e. local file, third party database, third party API, etc.). Because data is unlikely to fit entirely into memory, each row / record is forwarded to the next task. 4. Rows / records are transformed arbitrarily (this is the "T" step in ETL) and forwarded to the next task. 5. Rows / records are dumped to a database. 6. After all rows / records have been processed, or after the job fails, the record in the database from step 2 is updated to indicate the final status of the job as well as some other stats (total number of records processed, etc.)


My main questions are: (1) What's the best way to do tracking on an entire job for historical purposes? (2) Individual Onyx tasks can emit sequences of segments (instead of a single segment) in order to send multiple messages to the next task. However, since not all of my segments will fit in memory at once, I'm guessing that I'll instead have to find some way to set up whatever input source I'm reading from (local file, third party database, third party API, etc.) as an Onyx input and not merely as a task inside of a job, no? Does this mean that I need separate jobs? One for doing the job tracking, another for actually reading and processing the data, and another for closing the loop on the job tracking?


@stephenmhopper will get a response back to you in a bit. Bit focused on something over here at the moment


@stephenmhopper: so from what I understand you mostly want to be able to track the status of jobs, such as what jobs are running, completed, killed?


We have ways to subscribe to the coordination log that onyx peers use to coordinate / schedule work, and you can use this subscription to monitor what is happening with the cluster / jobs. You could do this from the same node that is doing the job submission so you can tie up the job ids with the actual work that was submitted


If that sounds roughly like what you want I can point you to some resources


@lucasbradstreet I might actually be interested in that info, lol


You can also use onyx peer http query, but that is generally intended to be run on the peer nodes themselves so it isn't really great for such a service


Great, was just thinking about figuring out how to know if a job dies or not and being able to monitor that