Fork me on GitHub

hey guys, great project! I’m wondering if there is a tutorial for setting up the basic environment/bare necessities in order to make onyx run.


I’ve been looking at guides with not too much success


@zcaudate I can help out with that


oh that’s great


We have the onyx template


lein new onyx-app <app-name> — +docker will get you a dockerized onyx cluster


it streams data from into kafka, then does some light processing, and writes the results to mysql


@gardnervickers: thanks… trying it now


There are some large-ish changes we’re going to put out in a couple days around organizing things a bit better.


But if you run into any issues, let me know


the docker instance is great


Yup, the one thing is the>kafka setup is a little janky, sometimes the DNS wont resolve and there’s no error handling as it’s just curl’ing the update stream right into kafka


I have a twitter plugin that’s almost ready that we will replace that with so people can get up and running quickly writing Onyx jobs


that will be great! also… I’ve been watching the talks… are there any examples of UIs that are built on top of onyx?


it seems like a workflow ui would be very easy to fit on top of the system


But @michaeldrogalis recently did some work on a REST server for viewing the cluster state, and we’re actively trying to get time/resources to work on some cool visualizations.


The cluster state read from zookeeper, called the “replica”, has a TON of useful information. Visualizing active workflows would be really great


One idea that was floated was to make a graph of executing tasks and show a heatmap of latency ontop


Since it’s all data-driven, building up workflows from a (java|clojure)script frontend fits really well.


yeah exactly


it’ll be great to have that sort of control over process


@michaeldrogalis: i got to the bottom of my issue - nothing to do with onyx in the end - some misconfiguration of the kafka mesos framework was causing the broker logs to be stored on ephemeral container storage, with the predictable consequences when the kafka cluster got restarted


Mmm. Not a recipe for a good time. Good to hear that we don't need to fix anything.


@lucasbradstreet: i'm looking on the bright side - much easier to fix me having been dumb than an occasional race condition across many components simple_smile


@zcaudate: Infrastructure aside, there's also this if you need a hand learning:


early days but here’s where I’m deploying onyx log UI:


focussing on the log viewport on the left hand side atm. Going to add some more info per entry: the peer-id and time


the idea is to click the log entries and see a visualisation of the replica state


or have it in scrolling mode somehow if you want realtime visualisations


That'd be really great. Should make it easy to remotely diagnose problems in a cluster with an easy-to-read UI.


I'm on board with this, but we should think about the ways that the current dashboard is failing and what to do about them, since it's kinda looking like a dashboard rewrite


I'm ok with it because the current dashboard is a bit of a mess


I had a play around with lib-onyx earlier. I see you’ve decided to encapsulate consuming from the channel and only present the latest replica state. Apart from add-watching the state atom I couldn’t think of an easy way of streaming the events, so I’m still using the onyx api


@lucasbradstreet: so the idea is to have one mega dashboard? I guess that makes more sense


I'm not sure but I'm already seeing overlap here


@lsnape @lucasbradstreet I don't think one huge dashboard is the way to go. I think what Lucas is saying is that the existing Dashboard lets you see the log entries in the same way that you're starting out now, but it doesn't let you do much more than that. So I think the point he was making is trying to figure out what the underlying use case is there.


@michaeldrogalis: gotcha. So what I was aiming for is a way of seeing peer - task allocation. The log entries are really just a way of indexing and navigating between the states..


shall I hold fire for now until we have a better idea of what to build?


@lsnape That's pretty much what I had in mind to build. @lucasbradstreet I don't see much overlap if it's just the list of entries being used for navigation. You more or less need to see what transitions are happening for it to be useful.


Ok, so close to what the console dashboard is doing?


@lucasbradstreet: Closer to that, yeah - and more featureful. And I think that might be a good thing to move into the browser anyway. Thoughts?


My main concern with the web interface dashboards is that they really need to be setup as part of deployment, with the web port open to the outside. I think that's why the current dashboard doesn't get all that much use (that and we don't keep the jars up to date)


The console dashboard is easy. You can run it on ssh on your servers if you need, otherwise you just need access to the ZK port. I guess that's true of the web dashboard too though.


I think I mostly want to figure out how to make sure it gets used


The console will eventually be limited by what you can display. I kind of agree what there's an advantage to having the replica viewer on the command line, but for visualizing what the scheduler is doing I think the browser wins out


I get that there’s more work involved for someone to deploy and serve up a web dashboard. As a user of Onyx I wouldn’t really want to do this more than once i.e. have more than one dashboard


something else to maintain, that’s one thing to consider


Maybe we just need to make the deployment story easier than it is for the current Dashboard. I would 100% take the time to deploy the tool in question if it existed. The value is very high


Yeah, the thing that makes me hesitate is that the dashboard currently streams the log entries into itself. So it seems like a lot of overlap if what this is trying to achieve is better visualisation


The current dashboard can do stuff like dump logs too, which would have to be rewritten


@robert-stuttaford probably has the best insight for what would make a tool like that easiest to deploy in the wild, and what would make it most useful.


I could see providing some parse multimethods along with a clojurscript wrapping component to make jumping into developing this stuff quick.


I think it needs to be easy to configure, and easy to automatically download the matching version. Our docker images with tagged versions helps, but not everyone uses those


It's a good point. Really what we're asking is, how do we make the entire development experience smoother - from writing your application all the way to deploying it and understand what's happening in the data center.


Definitely. So I guess there are two main issues: what's needed, and how do we make it easy enough that it'll get used.


I think everyone agrees that a visualization of which peers are on which tasks is needed, right?


Yes most definitely


Also which hosts they're on


Cool. So, that's a least a direction everyone agrees on. @lsnape


Part of what I wanted to see in the replica query code, is queries to see what tasks are running on a host, with their task names (not just the ids)


But I think we should consider putting it in the dashboard or making this a rewrite


That'd be fine to me. I just don't want to bog any contributors down with another project. We could merge them together later to take the burden off them. I dunno, maybe easier said than done though.


I don't think it'll be too hard to merge later. I don't want to bog it down either


I guess if they're both using different front end techniques it could get messy


I'm actually pretty ok with scrapping the current dashboard


Mmkay, cool. Sounds like we're in agreement there. The tool is valuable, it can/should overlap/overtake the dashboard (even with some merging on our own), and we need to make deployment smoother so that it sees more usage.


@michaeldrogalis @lucasbradstreet sounds good, albeit slightly more ambitious simple_smile


@lsnape: Continue to keep the scope small, really whatever you want to work on, we'll guide the merge and make sure your component fits right in.


hi - can anyone point me in the direction of examples of using onyx to do data joins?


in particular I am looking for something that aids with a non-equivalence join, a join based on pattern match


@aaelony: Can the data set that you're joining on fit in memory? That's what it more or less always comes down to with how you approach it


In this case, the smaller dataset can probably fit. It's on the order of 10k rows


basically a listing of problem words to match up with other inbound text


just want to see if the words occur (or not) in the inbound text


@aaelony: Nah, that's a little different. Flow conditions control routing of segments between tasks in the workflow. I don't think we have a join example, but I'll take the time to make one in the next week or so because it comes up now and again. You'll basically want two input tasks, A and B, to merge into C. So your workflow would look like [[:a :c] [:b :c]], where C could use an atom as your join space. You'd want to use group-by-key to make sure segments with the same key get routed to the same machine. Does that make sense?


We don't have a great story for joins on huge batch data sets, there's still some manual footwork you'd have to do. But for streaming joins its pretty standard to either retain messages in memory or use stable storage, depending on what properties your application needs.


yes, that's great. It's also okay to assume that the smaller :a fits in memory, but the larger :b is quite big


@aaelony: One option you have is to preload data set A into memory via a lifecycle and simply do the workflow [[:big-input-data-set :join-task]]


cool, I will read up on that


@aaelony: Certainly. Happy to answer any questions along the way.


Thanks as well for the quick response simple_smile


btw, slack auto-correct is super aggressive... it literally changes what I have written