This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-09-20
Channels
- # bangalore-clj (3)
- # beginners (30)
- # boot (117)
- # braid-chat (6)
- # carry (9)
- # cider (6)
- # clara (11)
- # cljs-dev (28)
- # cljsrn (12)
- # clojars (2)
- # clojure (114)
- # clojure-austin (2)
- # clojure-dev (1)
- # clojure-dusseldorf (1)
- # clojure-greece (47)
- # clojure-italy (5)
- # clojure-russia (79)
- # clojure-spec (121)
- # clojure-uk (133)
- # clojurescript (92)
- # community-development (67)
- # copenhagen-clojurians (1)
- # core-async (25)
- # cursive (67)
- # datascript (1)
- # datomic (34)
- # devcards (24)
- # emacs (8)
- # funcool (71)
- # juxt (1)
- # keechma (2)
- # lein-figwheel (6)
- # luminus (8)
- # mount (17)
- # om (135)
- # om-next (13)
- # onyx (147)
- # pedestal (11)
- # planck (7)
- # re-frame (42)
- # reagent (86)
- # rum (11)
- # specter (6)
- # testing (6)
- # untangled (1)
- # vim (6)
- # yada (24)
@zamaterian: yes, I was trying to say you resubmit the job with a new job id, but the same remaining data
@vladclj @ me later and we'll figure out your kafka issue
ie: I’d love to be able to do a lookup in a task that uses the elasticsearch multi-search api to ‘enrich’ segments going through the pipeline
@smw ah yes, this is something I've wanted us to do long term. You can currently do it with lifecycle functions e.g. after-batch, but it requires some knowledge of onyx internals. I think what would be nice is onyx/batch-fn which is basically onyx/fn but where you're passed the whole batch.
Then you can use batch-fn or fn but not both
amen 🙂 performance impact is pretty huge for my use case. so sad when I re-read the bulk function docs again more carefully.
It'd be useful for use with Facebook haxl libraries like https://github.com/funcool/urania too
@smw if I branch Onyx and try a quick, quick patch for you, to see if it works, would you add a test, add the documentation to the information model, etc to bring it home? It’s hard for us to add too many features right now because we’re focused on getting the Asynchronous Barrier Snapshotting work out, but it’s definitely a feature we want.
I don’t imagine it’d be a lot more work once I have the code bit figured out. I just really don’t have the time to do all the extra bits
OK, let me just try a quick patch and see where we’re at tomorrow
Is there any way to fire trigger :window/type :global on last segment How I can write my aggregations to DB at one's when all data have been processed?
@vladclj it will fire when the job is completed. The event-type described here will be :job-completed
http://www.onyxplatform.org/docs/cheat-sheet/latest/#state-event/:event-type
@smw it’s almost done but all I’ll be able to do on it from here is review
really awesome article on stream processing. https://medium.baqend.com/real-time-stream-processors-a-survey-and-decision-guidance-6d248f692056#.gzsm722 only one problem, they dont mention onyx! In that article they suggest that the with the kappa arch, processing time would be correlated to the amount of data you needed to replay). Given that, it seems a bad fit for keeping a large amount of never ending data (IOT data). When i have though about leveraging Onyx in the past, I had always considered it in the context of using it in the Kappa architecture (in front of kafka). Given that onyx would need to reply streams of data (as opposed to sending the function to the data, as is done in HDFS), would it also be a bad fit (in terms of throughput) for large unbounded data?
@drewverlee thanks for the link. I’ll read it soon, but the overall conclusions for Onyx should currently be similar to Storm, and will be closer to Flink in the future.
@lucasbradstreet In order to get increased throughput wouldn’t onyx need to run ontop of some sort of distributed storage?
drastically increased throughput*
Scaling Onyx + Kafka out together can get you a pretty insanely long way I think. If you want to avoid replaying whole streams you can always aggregate to new topics as you go, and then rely on the raw topics for the last X days worth of data
I might be missing the point of your question though
Spark Streaming is the only one in there with v different tradeoffs
It’s definitely possible for some use cases that Spark and Spark Streaming may be uncatchable in throughput, but they’ll also be high latency
So it depends on what you really need
Finally saw that chart so I know what you’re asking now. I love IBM InfoSphere Streaming all the way off there in the corner by itself
@smw: hold off on implementing that validation / schema change I mention in the PR. We're still deciding on how it should be expressed in the task map. If you can test that it works for you, that would be great though.
We'll definitely add it as a feature, we're debating on naming and whether or not to deprecate or substitute another feature in favor of it.
@lucasbradstreet To be clear i was trying to rephrase this observation from the article: > As a consequence, the Kappa Architecture should only be considered an alternative to the Lambda Architecture in applications that do not require unbounded retention times or allow for efficient compaction (e.g. because it is reasonable to only keep the most recent value for each given key). In our use case, collecting timeseries data and using it to create graphs over time (potentially years worth of data), that their would be a cost to using any solution that didn’t send the function to the data (the main efficiency in hadoop MR). If that isn’t the case, i would really like to know what my misunderstanding is, as I’m trying to help my company pivote towards something easier to manage then storm + MR.
Ah, I see what you mean. When you say sending the function to the data, are you referring to the way Spark can utilise data locality in HDFS?
@lucasbradstreet i can’t speak with certainty about how spark works, but i assume that yes, spark can utilise data locality. Does flink not do the same if it runs on YARN?
Yes, it definitely can, I just wanted to check that that was what you were referring to
I mean Spark definitely can
I’m unsure about Flink, I haven’t done as much research into how HDFS works with Flink because we don’t even have a HDFS plugin yet.
For what it's worth (a bit tangential to the conversation), data locality has become a whole lot less important in the last few years. Most data centers have network speeds that are fast enough to dwarf the cost of talking to a remote disk as if it were local.
Not saying you always have that benefit, but you can get that kind of set up more often that not.
I'd be a little surprised if Flink did more than a small amount of optimization when running on YARN for locality, but I'd have to look. My assumption was that YARN was primarily used for scheduling, and Flink's processing engine remained mostly the same.
@michaeldrogalis I suppose, without having to understand all the ins and outs (as well as you do), would you say onyx would be an appropriate tool for processing timeseries data from IOT devices where we never want to throw away any data. My primary concern is i we want to create new functionality over a key, will have to reprocess the entire history. I assume this was slower then what you could get out of HDFS, but i suppose i need to understand the difference better.
I think your question is more about application design than tool choice.
Yeah, application design and architecture more than Onyx vs Spark to me
You’d have no problem interacting with a timeseries DB from Onyx for example
That said, depending on the way it’s used, you may find other products support different plugins than us, and that sort of thing
@lucasbradstreet is Onyx able to pull data from a DB in parallel?
Depends on the storage type and the plugin. If it's able to support some kind of partitioning scheme, then yes.
For example, onyx-sql does parallel reads by partitioning a table by a column on the row.
Yeah, another feature discussed above will allow you to batch your requests to DBs too (you could always do this in your inputs and outputs, but now you’ll be able to do it in your function tasks if you want)
Partitioning is key, as @michaeldrogalis says though
You can also improve locality in your workflow by using :onyx/group-by-key
or by-fn to ensure that certain peers receive the segments that they have pre-fetched data for. You don’t need to be using windowing/aggregations to use that feature
I think mostly you’ll get some of those features for free more with Spark
RELEASE: onyx-kafka 0.9.10.1 has been released with some important bug fixes and performance improvements https://github.com/onyx-platform/onyx-kafka/blob/0.9.x/CHANGES.MD
any chance it might make sense to run some sort of lightweight java app server (wildfly?) to allow easy submit of new peer code without having to restart all your peers to deploy new functions?
Could you elaborate on what you mean by writing one off deployment infrastructure?
in my case, right now, I’m pushing to a private docker registry, submitting new json to marathon
would be nice if I could call (submit-job) with a jar file (or even better, have it automatically make one), a list of hosts, and an auth token or something
I think onyx basically tried to stay out of that piece, so that it works everywhere, right?
@smw Correct, yes.
but I’m thinking you could avoid the initial ‘how do I get this running on all machines’ and still potentially cover ‘how do I send new code'
Ideally someone would rise up with some reusable deployment infrastructure that others could copy, and it would allow other people to use that, or switch to other deployment methodologies later.
Right, yeah.
I dont have any experience with Wildfly -- it seems like it'd be an interesting experiment, and it doesnt need to land in Onyx core.
yep. I’m just thinking of that… and realizing that there already exists something in the java universe that handles most of this
How exactly does it work?
Sure, yeah I think there's not enough conversation about this topic.
I haven’t been paying attention much because I haven’t been writing enterprise java for a while
Certainly seems interesting. How do you get around the long uberjar time though if you want to go from editor -> deployed to cluster as your development environment?
Heh, amen
If all you want is more out-of-the-box deployment techniques, that seems like it'd be enough ^
but… as step one… right now I have to uberjar and then run deploy stuff
which relaunches everything
We'd written a bunch of Ansible scripts, Docker compose files, K8s scripts, but for whatever reason people end up writing their own deployment stack every time
but also, I have mesos clusters (and ansible stuff to deploy mesos clusters on openstack) lying around
Haha, yup
I don’t understand the behavior if you were to kill a task and restart during the processing of a job well
I think that roping infrastructure into your development workflow is a slippery slope and often indicative of a problem somewhere else in your process.
It's tolerant to a fault, yes.
@gardnervickers I hear what you’re saying, and my ‘development’ workflow right now is mostly repl based and using (with-dev-env
like in the onyx-template
however… this project is long-running and ongoing, and I want metrics and stats that show me how this stuff runs against a real cluster
I do think we could offer a better way to do Docker-based deploys without too many changes, by essentially locking each jar to a job, treating tenancy-id as job-id and letting the underlying container scheduler handle resource allocation per job(container) type.
You could even home-roll this now by just having every peer try to submit a job idempotently on startup and locking the job scheduler to greedy
Stuff like the above is why I've decidedly kept out of the deployment business with Onyx. K8s are Mesos were barely getting off the ground when I started the project.
Its pretty cool that we can adapt to new stuff though without a design change.
Just thinking there might be room for some optional add-ins that take care of pieces of it
Agreed
Sounds like a plan.
As for "repl for your cluster" type functionality, while super interesting in itself I'm not convinced it would be as useful as it would appear at first glance. With app servers, nothing much changes except your trading the 20-30 seconds of JVM boot time in docker. You still need to kill your old peers, start your new peers, and start your job again. Docker's immutability "model" is sacrificed here as well. Not to mention that by stepping out of the orchestration Mesos/K8s provides you here, now you're going to have to write some ansible or something to ssh these newly built jars to each of your JVM application servers.
There is a lot of room for improvement here, no doubt about it. I think we'll gain the most in this arena by eliminating the multi-tenancy and elevating it up to Mesos/K8s as a orchestration framework concern. I also think we could do a lot better by tying checkpointing to the idempotent job identifier, so we could emulate the checkpointing features of Kafka Consumer Groups for all our plugins (being tied to a simple name/id)
As a user it would give me a lot more confidence to be able to say "start from row X, offset Y, or wherever the last job that shared my name last left off"
Agree with that @gardnervickers
I imagine what would make it cool is if it provided an (authenticated, token code perhaps?) rest api where you could push a jar
I'm requiring onyx.plugin.kafka 0.9.10.1 and getting Unhandled clojure.lang.Compiler$CompilerException ... Caused by java.lang.IllegalAccessError kw->fn does not exist
[aero "1.0.0"] ;; "1.0.0-beta2"]
[org.clojure/clojure "1.8.0"]
[org.clojure/tools.cli "0.3.5"]
[org.onyxplatform/onyx "0.9.10"] ;; "0.9.9"
[org.onyxplatform/lib-onyx "0.9.0.1"]
[joplin.core "0.3.6"]
[joplin.jdbc "0.3.6"]
[mysql/mysql-connector-java "5.1.38"]
[org.clojure/java.jdbc "0.4.2"]
[org.onyxplatform/onyx "0.9.10"] ;; "0.9.9"
;; [org.onyxplatform/onyx-twitter "0.9.0.1"]
[org.onyxplatform/onyx-seq "0.9.9.0"]
[org.onyxplatform/onyx-metrics "0.9.10.0"] ;; "0.9.8.0"
[org.onyxplatform/onyx-kafka "0.9.10.1"] ;; "0.9.9.0"
[org.onyxplatform/onyx-amazon-s3 "0.9.9.0"]
First three versions need to match across the project. w.x.y.z. w, x, y need to match
hmmm. working now. I had everything onyx upgraded and it still has giving that kw->fn error. Then I upgraded clojure to 1.9.0-alpha12 and the error went away...
https://github.com/onyx-platform/onyx-template/blob/0.9.x/src/leiningen/new/onyx_app/project.clj#L17
Yea, your meant to specify a main
I think it would be ok to default to your app core namespace but I would rather people discover they need to boot both the media driver and their peer separately