Fork me on GitHub
Drew Verlee00:04:51

How would i go about getting a sense for how onxy compares to hadoop in terms of performance. I realize that might be a horrible naive and open ended question. However, my companies requirements are loosely defined at this point. Were building a lambda architecture (storm + hadoop) and im worried we haven’t considered alternatives. So im hoping to gather some information to help raise our awareness. Possible help the team pivot towards a different solution. I suppose as a basic estimate i could see use having to perform jobs against 2TB-4TB of data within 8 months. So how would onxy compare against hadoop mapreduce on 3TB? I’m open the the idea that this question is the wrong one to be asking simple_smile.


Performance is highly dependent on a lot of factors, that said Onyx is comparable to Storm in performance. Onyx’s approach is to treat batch processing as a special case of stream processing, eliminating the need to develop/maintain both Hadoop and Storm. This allows you to reuse components of your batch job in your streaming job, which makes the kinds of problems you want to solve using the Lambda architecture (batch + continuous updates from streaming engine) easier to reason about.


I believe the LinkedIn folks call this the “Kappa Architecture”, where you forgo a batch processing engine in favor of a sufficiently powerful stream processing engine.


There's pretty much no reason to build a Lambda Architecture anymore. I just took out 5 slides from my Clojure/west talk that more or less railed on it. So I guess here is as good a place as any to comment on that.


Until recently, there wasn't a practical way to unify the underlying engines that powered both batch and stream processing. There wasn't an engine that could do both, and there wasn't an API. So the Lambda Architecture was a good workaround. But it's a design that requires that the user gets a ton of things right, including a complicated merge function, deduplication, etc. Plus there's two systems to not only build, but also maintain. It comes with a ton of baggage, and I hardly see anyone do it well.


Then windowing happened. Windowing is just the right interface to represent both batch and streaming problems, with minimal changes to switch and back forth. We finally got one API, but still having 2 engines sucked.


About a year ago the team that builds Flink wrote a paper on their engine called Asynchronous Barrier Snapshotting, which is what we're implementing now. The engine is significantly more capable than batch only engines because it keeps data is constant motion. Its much more performant than micro-batching, too.


That lands us with 1 API and 1 engine. Hence, the Lambda Architecture has seen its best days.

Drew Verlee00:04:49

thanks a lot @michaeldrogalis and @gardnervickers. All my research has concluded similar findings in regards to batching and streaming. Next week im going to be pitching onyx to the team and i’m looking to do so with as much information as i can. Im worried their will be vague arguments about scalability,maturity and performance that i wont be able to address. I feel confident i can explain, how onyx is easier to manage, deploy, reason about, windows, triggers, watermarks, event time vs processing time, stream vs batch, etc..

Drew Verlee00:04:41

it might come down to do the fing research simple_smile


If you have any more questions feel free to ask

Drew Verlee00:04:00

I suppose what i’m not clear about is if the hadoop ecosystem has performance characteristics that make it advantageous at certain sizes. From what little i can gather flink should be able to outperform spark which should be able to outperform mapreduce. Thats a lot of shoulds based on a very weak amount of research. Is that correct? Assuming it is. My understanding is that onyx is ‘easily’ deployed on ec2 instances (though possible it could be run elsewhere?). Why not have onyx leverage hadoops ecosystem like flink. Possible i just asked why a fish can’t fly.


Onyx is really just a library, if you can run java and talk to zookeeper you can deploy it. This makes it very easy to use different Devops strategies (Docker, Ansible, etc…) to deploy Onyx.


I cant offer any insight on Hadoop performance in relation to Flink/Spark.

Drew Verlee00:04:07

ok cool. It seems apparent to me i need to understand our needs better before i can start asking better questions.

Drew Verlee02:04:55

@gardnervickers: I think i just a piece of the puzzle just fell into place. It seems reasonable that our team needs a place to store timeseries data (OpenTSDB). But what we dont need is a separate mechanism (map reduce on top of hadoop) to perform computations. Put another way, it reasonable that we could use onyx for streaming and batch jobs from data flowing in from either current events AND from ‘historical’ data stored in OpenTSDB. So if i had a topology that summed events, it could just as easily be applied to either. So flink, dataflow, onyx don’t ‘replace' storage mechanism but rather the tools built on top of them to run computations.

Drew Verlee02:04:13

errr, none of this is your really your concern simple_smile. as in, thanks for chatting with me about it.


Yea thats the proper way of thinking of it


Flink/Onyx/Dataflow under the Lambda/Kappa architecture are more akin to a databases indexer than it’s storage.


Under the Lambda/Kappa architecture, you use something like Kafka to emulate your write-ahead-log that a traditional SQL DB might have, then you use your stream processing framework to create different views on that write-ahead-log


And yes, the big win for the Kappa architecture is by using a streaming platform, you can run the same kinds of queries on historical data as new incoming data.


Windowing’s big win was it lets you talk about stateful parts of both streaming and batch processing in the same way, thats what Mike was saying

Drew Verlee02:04:07

right. Ok. last question of the night. So do you think were trading performance for simplicity by abandoning hadoop (M/R, spark, etc…) and pulling that data into kafka and then onyx? For what its worth, i think simplicity matters much more in our case. I dont think our customers are going to notice a couple second difference. If this question is better deferred to another resource i’m open to suggestions.


If the decision was up to me I would ditch Hadoop entirely.


especially considering a relatively small dataset


I would choose a streaming framework, I am obviously biased towards Onyx’s approach, but that’s because I believe it’s the correct way to deal with distributed workflows.

Drew Verlee02:04:49

> relatively small dataset I assume this refers to my estimation of 3TB Your comment seems to suggest you think we could drop OpenTSDB as well. How would that work? It still makes sense to me have a storage for the timeseries data. I would probably just use postgres until we had a use case for something else. For some context, were basically building a monitoring tool. ^#&* im asking way to much here. Regardless, feel like im a lot closer to shading some light on the topic for the rest of my team. I really appreciate your input on the matter. Hopefully I can repay the favor at somepoint.

Drew Verlee02:04:58

your github page needs a “buy the onyx team a beer” button. Or in my case, a case of beer.


Haha if your at clojure/west that could be arranged 😉


Without knowing more about your specific use case I don’t think I could make that suggestion. I do recommend that you don’t over provision by adding in a bunch of specialty databases. Keeping it simple pays off in spades (as you know coming from clojure).


its not mentioned in onyx.plugin.sql documentation but i assume that :sql/id need to be unique ? or is it just using id to partition the segments for read-rows ?


The latter. I think it should work without it being unique


im just looking throug the code, it looks like it should work simple_smile I first tried querying a MS-SQL using a column that is a type decimal, that never returned, it didnt wrong me no error message or any log statements in onyx.log just spammed the log with : Not enough virtual peers have warmed up to start the task yet, backing off and trying again...


Hmm, did you start enough peers for the number of tasks you're using in your job?


It hasn't been tested with decimal values, though it could be made to - assuming it doesn't already (above issue seems unrelated)


yes, im testing with just 4 tasks, and 10 peers.


they end up timing out: ConductorServiceTimeoutException: Timeout between service calls over 5000000000ns


@drankard: ah, it’s probably started, but it seems like it’s running out of memory, because that usually happens when GCs happen and aeron times out


ye it seems like it, but a bit weird, when the only diff is the :sql/id i now use a int(10) column that is not unique instead of the primary keys which is a decimal(18)


But thanks anyhew. simple_smile im up and running :thumbsup:


Trying to do some capacity planning around onyx, but I can't help but feel some of the business + technical requirements I have are going to need to change for practical reasons. Is there a good post somewhere I can read or anybody have a few minutes to discuss?


The crux of my issue is I fear that having user-defined workflows is something that will be difficult to support in terms of capacity except for those users that are directly paying for it (ex: paying for an AWS instance directly/indirectly as part of a plan). Simply put, my understanding is based on the number of cores and obvious things like memory allocation to the JVM, I can have only so many jobs running vs. actual physical machine resources. Naturally this makes sense, but wondering if perhaps there's a better way to construct workflows to keep the same jobs running across users rather than having many more distinct jobs due to differences in workflow schema.


You can partition using the tenancy-id


Well let me explain the domain a bit and maybe that will also help to understand


I have a command-line syntax (or UI generated command line) that works a bit like unix pipes


in an ideal world, I'd like to dynamically build a catalog out of the command line, using onyx for the bulk of the pipeline and I'm willing to have an extra step on the beginning/end edges as necessary for practical reasons


my fear is that is not something I can support because some of these commands may run as long as the user wants (i.e. forever)


There's plenty of overlap between users for simple command pipelines, i.e. a single command is going to be pretty much the same for each user, differing only where the input is coming from, but that is solvable using kafka partitioning for example


same with the output


where it gets tricky is when many commands are "piped" together since there is an explosion of combinations of catalogs


constructing the catalog itself in this way isn't too difficult, but actually running these jobs at scale is different (edited)


I can potentially charge users for usage/volume, but it needs to be cost-effective for me, i.e. I don't expect to have multiple machines per user


rather I'd like something more elastic I can shrink or grow


I can get out of this some by not using onyx and using regular clojure functions, but for obvious reasons that is a huge loss in functionality


some of the work also can be done on the client, example - I send a stream back to the client and as it arrives, I run various functions in the pipeline client-side, but this is only in the case a single user is playing with something. I have a "preview" like functionality I prototyped that works like this that more or less simulates simple things on the client.


How are you going to get out of this by using regular clojure functions? Letting users run arbitrary functions without constraints is flawed no matter what platform your using, look at how AWS Lambda works.


they aren't running arbitrary functions, there's a set amount of functions


no different than say running a unix command


I mean you could just run C at the command line, sure


but speaking at a higher-level, you run what you have installed


and what you have permission to do


Yea sure I get ya.


so I can happily construct a command-line that formats the hard drive, but that doesn't mean the system will let me do it


so to be clear, it's not doing anything like pr-string or eval'ing directly


you can just think that the syntax/UI generates 1 or more onyx tasks or catalog entries otherwise per command


so in that sense it's a good fit with onyx in that it is very easy to construct a catalog and workflow that maps to the user actions


where I fear this won't work is the combinations


simply just because there's not enough resources to make it cost-effective if I have to hog several machines for 1 user job


realistically, not every user is running a job all the time and most users probably aren't running any jobs, but it's still a situation you can easily end up with needing hundreds of machines if I'm understanding the need for free CPU cores with onyx


and not to mention memory and IO obviously


You can use large batch timeouts and and batch sizes to have the peers act somewhat more cooperatively


I have a similar situation with processing internal logic with some CQRS-like flows


@ymilky: There's very strong isolation between tasks and across jobs, there's nothing that lets you share a task between 2 jobs. Ill think on an application-level design, its an interesting question


what I do there isn't great but it works to some degree


They won’t really coordinate together though.


I simply dispatch at various points using multi-methods such that I may have like 6 workflows to support my CQRS process, but they are the same 6 for all users


also I want to be clear that the commands I was talking about earlier are not CQRS commands


though I've thought maybe that might be an answer


for example I could divide a pipeline to emit CQRS commands, which emit events, which emit commands (starting the lifecycle again)


for a pipeline-like command line, it would mean that it would have to go through that lifecycle a few time


not really ideal for that kind of processing, but it could work and cut the amount of running jobs down dramatically


from there I could throw a lot of hardware at it and shard and/or cluster resources as appropriate


to give you yet a bit more detail, the main thing I'm doing with all this is rewriting data streams and applying 1-n functions that do something with the stream in some way. It could be reordering it, windowing it in some way, transforming it, filtering it, etc.


so commands can be run on a single data point or an entire stream, which can be very small or infinite


I'm not sure if any of this makes sense to all of you, but hopefully it doesn't sound too crazy


and yes, it works in various ways, I just have been looking to make things more dynamic by using onyx and various ways of bringing a bit more power to processing and rewriting these streams than just throwing some stupid chart UI or table at users with only fixed possibilities


@michaeldrogalis: As far as isolation, that's fine really. I just fear I can't have so many different types of jobs with limited hardware resources. Rather I need to think of a way to consolidate the architecture into using a fixed amount of jobs that can adapt to the input as necessary without breaking the advantages of using onyx or limiting the end-user functionality too much.


What would you do about lifecycles/Windows/etc, you can't really share the state there


Well in general, I don't need much state sharing. But yes, I wouldn't be able to run different windowing behaviors in the same job.


Rather I'm thinking I would have a fixed pool of job types and the system could only allow processing that supported one of the possibilities.


The dynamic parts within the jobs would have to be some kind of internal dispatch like a multi-method to the correct function


obviously that's a huge down-side, but I'm not sure I see a way around it


It sounds like you agree that supporting ad-hoc jobs that can potentially run a long time is not feasible without unlimited money


I can curb this a bit I suppose by having some kind of resource pool that users are allocated so I haver at least a ceiling on the total number of resources


at that point it would mean either a single user or perhaps all users couldn't run new jobs until the cap was raised or resources were freed because jobs finished


It doesn't sound cost-effective or feasible though for a web facing system. If it was an internal company system, that probably would work alright.


The closest system I think that works anything like this is something like Databrix.


I believe they solve the issue simply by having customers pay/allocate clusters. It's an option but I'd like to have some simpler options, even if they are slower/shared between users to at least make them want to pay for the product before having to pay to spin up 5 instances to run their jobs


Anyway, if you have any thoughts at all, let me know. It's been bugging me and I think for now I'll have to stick to the idea that for random users it's going to have to be picking from a fixed set/pool of pre-defined jobs that they can submit input to (via kafka for example). For people that need custom jobs, I'll likely need to add in some kind of pay per use model that I can actually make money on and is still easy/convenient (i.e. you pay for the machines/compute time).


For the predefined jobs, the variation between users will have to come via the input itself, ex: user A wants a stream rewritten in monthly buckets while user B wants a stream rewritten in minute buckets. Those could use the same job with the bucket size as an input param or something. Contrived example, but hopefully you get the point.