Fork me on GitHub
#onyx
<
2016-02-26
>
zcaudate07:02:29

hey guys, great project! I’m wondering if there is a tutorial for setting up the basic environment/bare necessities in order to make onyx run.

zcaudate07:02:55

I’ve been looking at guides with not too much success

gardnervickers07:02:10

@zcaudate I can help out with that

zcaudate07:02:28

oh that’s great

gardnervickers07:02:32

We have the onyx template

gardnervickers07:02:03

lein new onyx-app <app-name> — +docker will get you a dockerized onyx cluster

gardnervickers07:02:23

it streams data from http://meetup.com into kafka, then does some light processing, and writes the results to mysql

zcaudate07:02:38

@gardnervickers: thanks… trying it now

gardnervickers07:02:29

There are some large-ish changes we’re going to put out in a couple days around organizing things a bit better.

gardnervickers07:02:39

But if you run into any issues, let me know

zcaudate07:02:01

the docker instance is great

gardnervickers07:02:37

Yup, the one thing is the meetup.com->kafka setup is a little janky, sometimes the DNS wont resolve and there’s no error handling as it’s just curl’ing the http://meetup.com update stream right into kafka

gardnervickers07:02:32

I have a twitter plugin that’s almost ready that we will replace that with so people can get up and running quickly writing Onyx jobs

zcaudate07:02:07

that will be great! also… I’ve been watching the talks… are there any examples of UIs that are built on top of onyx?

zcaudate07:02:45

it seems like a workflow ui would be very easy to fit on top of the system

gardnervickers07:02:29

But @michaeldrogalis recently did some work on a REST server for viewing the cluster state, and we’re actively trying to get time/resources to work on some cool visualizations.

gardnervickers07:02:16

The cluster state read from zookeeper, called the “replica”, has a TON of useful information. Visualizing active workflows would be really great

gardnervickers07:02:38

One idea that was floated was to make a graph of executing tasks and show a heatmap of latency ontop

gardnervickers07:02:53

Since it’s all data-driven, building up workflows from a (java|clojure)script frontend fits really well.

zcaudate07:02:11

yeah exactly

zcaudate07:02:46

it’ll be great to have that sort of control over process

mccraigmccraig11:02:01

@michaeldrogalis: i got to the bottom of my issue - nothing to do with onyx in the end - some misconfiguration of the kafka mesos framework was causing the broker logs to be stored on ephemeral container storage, with the predictable consequences when the kafka cluster got restarted

lucasbradstreet11:02:15

Mmm. Not a recipe for a good time. Good to hear that we don't need to fix anything.

mccraigmccraig12:02:04

@lucasbradstreet: i'm looking on the bright side - much easier to fix me having been dumb than an occasional race condition across many components simple_smile

michaeldrogalis15:02:08

@zcaudate: Infrastructure aside, there's also this if you need a hand learning: https://github.com/onyx-platform/learn-onyx

lsnape15:02:38

early days but here’s where I’m deploying onyx log UI: https://secret-chamber-21526.herokuapp.com/

lsnape15:02:33

focussing on the log viewport on the left hand side atm. Going to add some more info per entry: the peer-id and time

lsnape15:02:24

the idea is to click the log entries and see a visualisation of the replica state

lsnape15:02:48

or have it in scrolling mode somehow if you want realtime visualisations

michaeldrogalis15:02:40

That'd be really great. Should make it easy to remotely diagnose problems in a cluster with an easy-to-read UI.

lucasbradstreet15:02:45

I'm on board with this, but we should think about the ways that the current dashboard is failing and what to do about them, since it's kinda looking like a dashboard rewrite

lucasbradstreet15:02:01

I'm ok with it because the current dashboard is a bit of a mess

lsnape15:02:00

I had a play around with lib-onyx earlier. I see you’ve decided to encapsulate consuming from the channel and only present the latest replica state. Apart from add-watching the state atom I couldn’t think of an easy way of streaming the events, so I’m still using the onyx api

lsnape15:02:41

@lucasbradstreet: so the idea is to have one mega dashboard? I guess that makes more sense

lucasbradstreet15:02:42

I'm not sure but I'm already seeing overlap here

michaeldrogalis15:02:44

@lsnape @lucasbradstreet I don't think one huge dashboard is the way to go. I think what Lucas is saying is that the existing Dashboard lets you see the log entries in the same way that you're starting out now, but it doesn't let you do much more than that. So I think the point he was making is trying to figure out what the underlying use case is there.

lsnape15:02:21

@michaeldrogalis: gotcha. So what I was aiming for is a way of seeing peer - task allocation. The log entries are really just a way of indexing and navigating between the states..

lsnape15:02:00

shall I hold fire for now until we have a better idea of what to build?

michaeldrogalis15:02:14

@lsnape That's pretty much what I had in mind to build. @lucasbradstreet I don't see much overlap if it's just the list of entries being used for navigation. You more or less need to see what transitions are happening for it to be useful.

lucasbradstreet15:02:26

Ok, so close to what the console dashboard is doing?

michaeldrogalis15:02:24

@lucasbradstreet: Closer to that, yeah - and more featureful. And I think that might be a good thing to move into the browser anyway. Thoughts?

lucasbradstreet15:02:24

My main concern with the web interface dashboards is that they really need to be setup as part of deployment, with the web port open to the outside. I think that's why the current dashboard doesn't get all that much use (that and we don't keep the jars up to date)

lucasbradstreet15:02:10

The console dashboard is easy. You can run it on ssh on your servers if you need, otherwise you just need access to the ZK port. I guess that's true of the web dashboard too though.

lucasbradstreet15:02:58

I think I mostly want to figure out how to make sure it gets used

michaeldrogalis15:02:20

The console will eventually be limited by what you can display. I kind of agree what there's an advantage to having the replica viewer on the command line, but for visualizing what the scheduler is doing I think the browser wins out

lsnape15:02:45

I get that there’s more work involved for someone to deploy and serve up a web dashboard. As a user of Onyx I wouldn’t really want to do this more than once i.e. have more than one dashboard

lsnape15:02:43

something else to maintain, that’s one thing to consider

michaeldrogalis16:02:19

Maybe we just need to make the deployment story easier than it is for the current Dashboard. I would 100% take the time to deploy the tool in question if it existed. The value is very high

lucasbradstreet16:02:12

Yeah, the thing that makes me hesitate is that the dashboard currently streams the log entries into itself. So it seems like a lot of overlap if what this is trying to achieve is better visualisation

lucasbradstreet16:02:33

The current dashboard can do stuff like dump logs too, which would have to be rewritten

michaeldrogalis16:02:23

@robert-stuttaford probably has the best insight for what would make a tool like that easiest to deploy in the wild, and what would make it most useful.

gardnervickers16:02:26

I could see providing some om.next parse multimethods along with a clojurscript wrapping component to make jumping into developing this stuff quick.

lucasbradstreet16:02:15

I think it needs to be easy to configure, and easy to automatically download the matching version. Our docker images with tagged versions helps, but not everyone uses those

michaeldrogalis16:02:56

It's a good point. Really what we're asking is, how do we make the entire development experience smoother - from writing your application all the way to deploying it and understand what's happening in the data center.

lucasbradstreet16:02:49

Definitely. So I guess there are two main issues: what's needed, and how do we make it easy enough that it'll get used.

michaeldrogalis16:02:12

I think everyone agrees that a visualization of which peers are on which tasks is needed, right?

lucasbradstreet16:02:15

Yes most definitely

lucasbradstreet16:02:27

Also which hosts they're on

michaeldrogalis16:02:52

Cool. So, that's a least a direction everyone agrees on. @lsnape

lucasbradstreet16:02:53

Part of what I wanted to see in the replica query code, is queries to see what tasks are running on a host, with their task names (not just the ids)

lucasbradstreet16:02:18

But I think we should consider putting it in the dashboard or making this a rewrite

michaeldrogalis16:02:58

That'd be fine to me. I just don't want to bog any contributors down with another project. We could merge them together later to take the burden off them. I dunno, maybe easier said than done though.

lucasbradstreet16:02:30

I don't think it'll be too hard to merge later. I don't want to bog it down either

lucasbradstreet16:02:10

I guess if they're both using different front end techniques it could get messy

lucasbradstreet16:02:08

I'm actually pretty ok with scrapping the current dashboard

michaeldrogalis16:02:04

Mmkay, cool. Sounds like we're in agreement there. The tool is valuable, it can/should overlap/overtake the dashboard (even with some merging on our own), and we need to make deployment smoother so that it sees more usage.

lsnape19:02:17

@michaeldrogalis @lucasbradstreet sounds good, albeit slightly more ambitious simple_smile

michaeldrogalis19:02:54

@lsnape: Continue to keep the scope small, really whatever you want to work on, we'll guide the merge and make sure your component fits right in.

aaelony20:02:18

hi - can anyone point me in the direction of examples of using onyx to do data joins?

aaelony20:02:50

in particular I am looking for something that aids with a non-equivalence join, a join based on pattern match

michaeldrogalis20:02:55

@aaelony: Can the data set that you're joining on fit in memory? That's what it more or less always comes down to with how you approach it

aaelony20:02:49

In this case, the smaller dataset can probably fit. It's on the order of 10k rows

aaelony20:02:18

basically a listing of problem words to match up with other inbound text

aaelony20:02:51

just want to see if the words occur (or not) in the inbound text

michaeldrogalis20:02:19

@aaelony: Nah, that's a little different. Flow conditions control routing of segments between tasks in the workflow. I don't think we have a join example, but I'll take the time to make one in the next week or so because it comes up now and again. You'll basically want two input tasks, A and B, to merge into C. So your workflow would look like [[:a :c] [:b :c]], where C could use an atom as your join space. You'd want to use group-by-key to make sure segments with the same key get routed to the same machine. Does that make sense?

michaeldrogalis20:02:22

We don't have a great story for joins on huge batch data sets, there's still some manual footwork you'd have to do. But for streaming joins its pretty standard to either retain messages in memory or use stable storage, depending on what properties your application needs.

aaelony20:02:25

yes, that's great. It's also okay to assume that the smaller :a fits in memory, but the larger :b is quite big

michaeldrogalis20:02:15

@aaelony: One option you have is to preload data set A into memory via a lifecycle and simply do the workflow [[:big-input-data-set :join-task]]

aaelony20:02:43

cool, I will read up on that

michaeldrogalis20:02:03

@aaelony: Certainly. Happy to answer any questions along the way.

aaelony20:02:41

Thanks as well for the quick response simple_smile

aaelony20:02:36

btw, slack auto-correct is super aggressive... it literally changes what I have written