Fork me on GitHub
#onyx
<
2016-09-20
>
lucasbradstreet06:09:43

@zamaterian: yes, I was trying to say you resubmit the job with a new job id, but the same remaining data

lucasbradstreet06:09:55

@vladclj @ me later and we'll figure out your kafka issue

smw07:09:12

Any way to hack around the limitations on modification from bulk functions?

smw07:09:10

ie: I’d love to be able to do a lookup in a task that uses the elasticsearch multi-search api to ‘enrich’ segments going through the pipeline

lucasbradstreet07:09:10

@smw ah yes, this is something I've wanted us to do long term. You can currently do it with lifecycle functions e.g. after-batch, but it requires some knowledge of onyx internals. I think what would be nice is onyx/batch-fn which is basically onyx/fn but where you're passed the whole batch.

lucasbradstreet07:09:27

Then you can use batch-fn or fn but not both

smw08:09:32

amen 🙂 performance impact is pretty huge for my use case. so sad when I re-read the bulk function docs again more carefully.

lucasbradstreet08:09:42

It'd be useful for use with Facebook haxl libraries like https://github.com/funcool/urania too

smw08:09:20

Hrrrm. Yes.

smw08:09:56

That’s neat stuff.

lucasbradstreet08:09:42

@smw if I branch Onyx and try a quick, quick patch for you, to see if it works, would you add a test, add the documentation to the information model, etc to bring it home? It’s hard for us to add too many features right now because we’re focused on getting the Asynchronous Barrier Snapshotting work out, but it’s definitely a feature we want.

smw08:09:21

Hey, absolutely. Though I’m about to pass out tonight, but yeah. I’m on it tomorrow.

lucasbradstreet08:09:21

I don’t imagine it’d be a lot more work once I have the code bit figured out. I just really don’t have the time to do all the extra bits

lucasbradstreet08:09:33

OK, let me just try a quick patch and see where we’re at tomorrow

vladclj08:09:41

Is there any way to fire trigger :window/type :global on last segment How I can write my aggregations to DB at one's when all data have been processed?

lucasbradstreet08:09:51

@vladclj it will fire when the job is completed. The event-type described here will be :job-completed http://www.onyxplatform.org/docs/cheat-sheet/latest/#state-event/:event-type

lucasbradstreet08:09:45

@smw it’s almost done but all I’ll be able to do on it from here is review

smw13:09:34

Awesome! I’m on it.

Drew Verlee14:09:50

really awesome article on stream processing. https://medium.baqend.com/real-time-stream-processors-a-survey-and-decision-guidance-6d248f692056#.gzsm722 only one problem, they dont mention onyx! In that article they suggest that the with the kappa arch, processing time would be correlated to the amount of data you needed to replay). Given that, it seems a bad fit for keeping a large amount of never ending data (IOT data). When i have though about leveraging Onyx in the past, I had always considered it in the context of using it in the Kappa architecture (in front of kafka). Given that onyx would need to reply streams of data (as opposed to sending the function to the data, as is done in HDFS), would it also be a bad fit (in terms of throughput) for large unbounded data?

lucasbradstreet14:09:58

@drewverlee thanks for the link. I’ll read it soon, but the overall conclusions for Onyx should currently be similar to Storm, and will be closer to Flink in the future.

Drew Verlee14:09:36

@lucasbradstreet In order to get increased throughput wouldn’t onyx need to run ontop of some sort of distributed storage?

Drew Verlee14:09:48

drastically increased throughput*

lucasbradstreet14:09:43

Scaling Onyx + Kafka out together can get you a pretty insanely long way I think. If you want to avoid replaying whole streams you can always aggregate to new topics as you go, and then rely on the raw topics for the last X days worth of data

lucasbradstreet14:09:52

I might be missing the point of your question though

lucasbradstreet14:09:11

Spark Streaming is the only one in there with v different tradeoffs

lucasbradstreet14:09:04

It’s definitely possible for some use cases that Spark and Spark Streaming may be uncatchable in throughput, but they’ll also be high latency

lucasbradstreet14:09:21

So it depends on what you really need

lucasbradstreet14:09:09

Finally saw that chart so I know what you’re asking now. I love IBM InfoSphere Streaming all the way off there in the corner by itself

lucasbradstreet15:09:54

@smw: hold off on implementing that validation / schema change I mention in the PR. We're still deciding on how it should be expressed in the task map. If you can test that it works for you, that would be great though.

smw15:09:39

I’m working on integrating it with my use case as soon as I get off this damn call 🙂

smw15:09:52

Also, I’m @smw on github.

michaeldrogalis15:09:54

We'll definitely add it as a feature, we're debating on naming and whether or not to deprecate or substitute another feature in favor of it.

Drew Verlee16:09:08

@lucasbradstreet To be clear i was trying to rephrase this observation from the article: > As a consequence, the Kappa Architecture should only be considered an alternative to the Lambda Architecture in applications that do not require unbounded retention times or allow for efficient compaction (e.g. because it is reasonable to only keep the most recent value for each given key). In our use case, collecting timeseries data and using it to create graphs over time (potentially years worth of data), that their would be a cost to using any solution that didn’t send the function to the data (the main efficiency in hadoop MR). If that isn’t the case, i would really like to know what my misunderstanding is, as I’m trying to help my company pivote towards something easier to manage then storm + MR.

lucasbradstreet16:09:28

Ah, I see what you mean. When you say sending the function to the data, are you referring to the way Spark can utilise data locality in HDFS?

Drew Verlee16:09:15

@lucasbradstreet i can’t speak with certainty about how spark works, but i assume that yes, spark can utilise data locality. Does flink not do the same if it runs on YARN?

lucasbradstreet16:09:59

Yes, it definitely can, I just wanted to check that that was what you were referring to

lucasbradstreet16:09:04

I mean Spark definitely can

lucasbradstreet16:09:25

I’m unsure about Flink, I haven’t done as much research into how HDFS works with Flink because we don’t even have a HDFS plugin yet.

michaeldrogalis16:09:51

For what it's worth (a bit tangential to the conversation), data locality has become a whole lot less important in the last few years. Most data centers have network speeds that are fast enough to dwarf the cost of talking to a remote disk as if it were local.

michaeldrogalis16:09:10

Not saying you always have that benefit, but you can get that kind of set up more often that not.

michaeldrogalis16:09:18

I'd be a little surprised if Flink did more than a small amount of optimization when running on YARN for locality, but I'd have to look. My assumption was that YARN was primarily used for scheduling, and Flink's processing engine remained mostly the same.

Drew Verlee16:09:32

@michaeldrogalis I suppose, without having to understand all the ins and outs (as well as you do), would you say onyx would be an appropriate tool for processing timeseries data from IOT devices where we never want to throw away any data. My primary concern is i we want to create new functionality over a key, will have to reprocess the entire history. I assume this was slower then what you could get out of HDFS, but i suppose i need to understand the difference better.

michaeldrogalis16:09:20

I think your question is more about application design than tool choice.

lucasbradstreet16:09:34

Yeah, application design and architecture more than Onyx vs Spark to me

lucasbradstreet16:09:52

You’d have no problem interacting with a timeseries DB from Onyx for example

lucasbradstreet16:09:51

That said, depending on the way it’s used, you may find other products support different plugins than us, and that sort of thing

Drew Verlee16:09:00

@lucasbradstreet is Onyx able to pull data from a DB in parallel?

michaeldrogalis16:09:42

Depends on the storage type and the plugin. If it's able to support some kind of partitioning scheme, then yes.

michaeldrogalis17:09:10

For example, onyx-sql does parallel reads by partitioning a table by a column on the row.

lucasbradstreet17:09:54

Yeah, another feature discussed above will allow you to batch your requests to DBs too (you could always do this in your inputs and outputs, but now you’ll be able to do it in your function tasks if you want)

lucasbradstreet17:09:55

Partitioning is key, as @michaeldrogalis says though

lucasbradstreet17:09:34

You can also improve locality in your workflow by using :onyx/group-by-key or by-fn to ensure that certain peers receive the segments that they have pre-fetched data for. You don’t need to be using windowing/aggregations to use that feature

lucasbradstreet17:09:58

I think mostly you’ll get some of those features for free more with Spark

lucasbradstreet19:09:33

RELEASE: onyx-kafka 0.9.10.1 has been released with some important bug fixes and performance improvements https://github.com/onyx-platform/onyx-kafka/blob/0.9.x/CHANGES.MD

smw22:09:39

any chance it might make sense to run some sort of lightweight java app server (wildfly?) to allow easy submit of new peer code without having to restart all your peers to deploy new functions?

smw22:09:43

would then let you easily spin up a cluster with swarm/mesos/k8s/aws ami

smw22:09:00

eh, I dunno, adds a bunch of security questions, I guess

smw22:09:24

Just hate that I’m writing one-off deployment infrastructure

gardnervickers22:09:11

Could you elaborate on what you mean by writing one off deployment infrastructure?

smw22:09:30

ie: I have a cluster of machines

smw22:09:42

in my case, they’re running mesos, but let’s just say I have 10 machines

smw22:09:07

how do you get from code-in-editor -> new code running on 10 machines

smw22:09:47

in my case, right now, I’m pushing to a private docker registry, submitting new json to marathon

smw22:09:53

(effectively)

smw22:09:13

but everything about that process is bespoke

smw22:09:19

would be nice if I could call (submit-job) with a jar file (or even better, have it automatically make one), a list of hosts, and an auth token or something

smw22:09:31

and have it deploy with my new code?

smw22:09:53

I think onyx basically tried to stay out of that piece, so that it works everywhere, right?

smw22:09:28

but I’m thinking you could avoid the initial ‘how do I get this running on all machines’ and still potentially cover ‘how do I send new code'

michaeldrogalis22:09:36

Ideally someone would rise up with some reusable deployment infrastructure that others could copy, and it would allow other people to use that, or switch to other deployment methodologies later.

smw22:09:37

I asked about this earlier, we discussed possibly updating via repl connection

michaeldrogalis22:09:02

I dont have any experience with Wildfly -- it seems like it'd be an interesting experiment, and it doesnt need to land in Onyx core.

smw22:09:04

yep. I’m just thinking of that… and realizing that there already exists something in the java universe that handles most of this

smw22:09:10

sure 🙂

smw22:09:15

mostly brainstorming here

michaeldrogalis22:09:20

How exactly does it work?

michaeldrogalis22:09:30

Sure, yeah I think there's not enough conversation about this topic.

smw22:09:33

well, big java app servers often just have a deploy/ directory

smw22:09:40

and you drop a .war or an .ear in there

smw22:09:51

and it deploys a new version, 0-downtime

smw22:09:08

sometimes there’s a remote protocol for such things

smw22:09:27

I haven’t been paying attention much because I haven’t been writing enterprise java for a while

michaeldrogalis22:09:30

Certainly seems interesting. How do you get around the long uberjar time though if you want to go from editor -> deployed to cluster as your development environment?

smw22:09:44

but they were smooth at what they did 8 years ago or so

smw22:09:58

and I imagine they still do that sort of thing, and have become ‘lighter weight'

smw22:09:36

this is clojure intergration with wildfly, that might be an interesting start

smw22:09:49

‘how do you get around the long uberjar time’ is tough

michaeldrogalis22:09:11

If all you want is more out-of-the-box deployment techniques, that seems like it'd be enough ^

smw22:09:35

but… as step one… right now I have to uberjar and then run deploy stuff which relaunches everything

michaeldrogalis22:09:58

We'd written a bunch of Ansible scripts, Docker compose files, K8s scripts, but for whatever reason people end up writing their own deployment stack every time

smw22:09:15

oh. maybe I just haven’t looked around enough 🙂

smw22:09:31

but also, I have mesos clusters (and ansible stuff to deploy mesos clusters on openstack) lying around

smw22:09:40

so … when you have a hammer...

smw22:09:34

I’ll play with it a bit

smw22:09:56

I don’t understand the behavior if you were to kill a task and restart during the processing of a job well

smw22:09:00

but I assume that must work ok

gardnervickers22:09:04

I think that roping infrastructure into your development workflow is a slippery slope and often indicative of a problem somewhere else in your process.

michaeldrogalis22:09:16

It's tolerant to a fault, yes.

smw22:09:05

@gardnervickers I hear what you’re saying, and my ‘development’ workflow right now is mostly repl based and using (with-dev-env like in the onyx-template

smw22:09:40

however… this project is long-running and ongoing, and I want metrics and stats that show me how this stuff runs against a real cluster

gardnervickers22:09:05

I do think we could offer a better way to do Docker-based deploys without too many changes, by essentially locking each jar to a job, treating tenancy-id as job-id and letting the underlying container scheduler handle resource allocation per job(container) type.

gardnervickers22:09:57

You could even home-roll this now by just having every peer try to submit a job idempotently on startup and locking the job scheduler to greedy

smw22:09:06

mmm, interesting

smw22:09:24

I don’t know what ‘locking the job scheduler to greedy’ means

smw22:09:27

but I’ll start reading more

michaeldrogalis22:09:56

Stuff like the above is why I've decidedly kept out of the deployment business with Onyx. K8s are Mesos were barely getting off the ground when I started the project.

smw22:09:06

Yeah, totally understand that.

michaeldrogalis22:09:32

Its pretty cool that we can adapt to new stuff though without a design change.

smw22:09:47

Just thinking there might be room for some optional add-ins that take care of pieces of it

smw22:09:08

I think there are lots of ways to get ‘a server’ running on machines

smw22:09:28

but solving the ‘how to get new code for a job onto the servers’ is more onyx-specific

Travis22:09:44

Willing to help on the mesos front since that is our target as well

smw22:09:21

I’m gonna screw with the app server workflow to see if it makes any sense

smw22:09:25

I’ll let you know what I find out.

michaeldrogalis22:09:08

Sounds like a plan.

gardnervickers22:09:29

As for "repl for your cluster" type functionality, while super interesting in itself I'm not convinced it would be as useful as it would appear at first glance. With app servers, nothing much changes except your trading the 20-30 seconds of JVM boot time in docker. You still need to kill your old peers, start your new peers, and start your job again. Docker's immutability "model" is sacrificed here as well. Not to mention that by stepping out of the orchestration Mesos/K8s provides you here, now you're going to have to write some ansible or something to ssh these newly built jars to each of your JVM application servers.

gardnervickers22:09:14

There is a lot of room for improvement here, no doubt about it. I think we'll gain the most in this arena by eliminating the multi-tenancy and elevating it up to Mesos/K8s as a orchestration framework concern. I also think we could do a lot better by tying checkpointing to the idempotent job identifier, so we could emulate the checkpointing features of Kafka Consumer Groups for all our plugins (being tied to a simple name/id)

gardnervickers23:09:21

As a user it would give me a lot more confidence to be able to say "start from row X, offset Y, or wherever the last job that shared my name last left off"

smw23:09:02

I imagine what would make it cool is if it provided an (authenticated, token code perhaps?) rest api where you could push a jar

aaelony23:09:45

I'm requiring onyx.plugin.kafka 0.9.10.1 and getting Unhandled clojure.lang.Compiler$CompilerException ... Caused by java.lang.IllegalAccessError kw->fn does not exist

aaelony23:09:06

should I back off the bleeding edge?

smw23:09:13

I saw that one!

smw23:09:27

I think it’s because you don’t have a new enough version of onyx?

smw23:09:30

They have to match?

Travis23:09:40

That's probably true

aaelony23:09:41

I thought I matched all of them...

aaelony23:09:43

[aero "1.0.0"]  ;; "1.0.0-beta2"]
                 [org.clojure/clojure "1.8.0"]
                 [org.clojure/tools.cli "0.3.5"]
                 [org.onyxplatform/onyx "0.9.10"] ;; "0.9.9"
                 [org.onyxplatform/lib-onyx "0.9.0.1"]

                 [joplin.core "0.3.6"]
                 [joplin.jdbc "0.3.6"]
                 [mysql/mysql-connector-java "5.1.38"]

                 [org.clojure/java.jdbc "0.4.2"]

                 [org.onyxplatform/onyx "0.9.10"] ;; "0.9.9"
                 ;; [org.onyxplatform/onyx-twitter "0.9.0.1"]
                 [org.onyxplatform/onyx-seq "0.9.9.0"]
                 [org.onyxplatform/onyx-metrics "0.9.10.0"] ;; "0.9.8.0"
                 [org.onyxplatform/onyx-kafka "0.9.10.1"] ;;  "0.9.9.0"
                 [org.onyxplatform/onyx-amazon-s3 "0.9.9.0"]

smw23:09:44

mine was because onyx.metrics didn’t match onyx version

michaeldrogalis23:09:46

First three versions need to match across the project. w.x.y.z. w, x, y need to match

smw23:09:46

seq and s3?

aaelony23:09:15

I think there was a reason why I didn't change them... I'll put it back up to snuff

smw23:09:43

Hey, I’m a newbie, but that’s my guess. Function moved to different namespace

aaelony23:09:01

hmmm. working now. I had everything onyx upgraded and it still has giving that kw->fn error. Then I upgraded clojure to 1.9.0-alpha12 and the error went away...

smw23:09:24

Is it intentional that the uberjar for onyx-template doesn’t define :main?

gardnervickers23:09:08

Yea, your meant to specify a main

gardnervickers23:09:52

I think it would be ok to default to your app core namespace but I would rather people discover they need to boot both the media driver and their peer separately

smw23:09:40

Thanks 🙂