This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2016-02-19
Channels
- # beginners (25)
- # boot (143)
- # braid-chat (9)
- # cider (18)
- # cljs-dev (88)
- # cljsrn (1)
- # clojure (91)
- # clojure-austin (2)
- # clojure-berlin (3)
- # clojure-japan (26)
- # clojure-russia (148)
- # clojurebridge (1)
- # clojured (29)
- # clojurescript (105)
- # cursive (7)
- # data-science (4)
- # datomic (15)
- # devcards (4)
- # emacs (8)
- # euroclojure (2)
- # events (1)
- # gsoc (27)
- # hoplon (3)
- # immutant (3)
- # ldnclj (3)
- # lein-figwheel (9)
- # leiningen (2)
- # luminus (1)
- # off-topic (5)
- # om (176)
- # onyx (136)
- # parinfer (16)
- # proton (13)
- # re-frame (33)
- # reagent (34)
- # spacemacs (1)
- # yada (127)
Not currently, but I just refactored state management to make this easier for us to implement. I'll make sure it's in 0.9.0
You're thinking about when you need to deploy upgrades to your application?
I’m probably not understanding something. My use case is aggregating/transforming metrics. I’d like to use fixed-window aggregation for each data stream, but I’m not sure how I would add a new window to the task without losing state. Instead, should I create tens of thousands of tasks, one for each data series? These task(s) run indefinitely. If I want to make a change to the task, how do I not lose state? Is there a way to access the aggregation state and write it to the db?
There’s no way to modify a running job - you have to kill a job, and submit another job. Unfortunately we currently have no good way to resume the state from the previous job. You could have a trigger that fires when the task is stopped, but it would be annoying to restore it back to the window states that were present before the job was stopped. I am about to add a feature to allow you to resume the state from your previous job, which I believe should solve your problem.
What about adding new series? Should each data series being aggregated be its own task, or should there only be one task with thousands of windows?
See https://github.com/onyx-platform/onyx/issues/319. Unfortunately it took a while to get to a state where I could implement this easily. Now that it’s refactored it should be easy enough.
That’s really up to you. It’ll be less expensive for one task to have lots of windows
You may want to split it up a bit if you’re going to have thousands of windows though.
Ultimately it’s one that I’d like to support, but you might bump up against some constraints as we haven’t profiled/benched with thousands of windows yet
What kind of volume are you looking at?
currently ~300 metrics (10 GB/day) per site, potentially up to 1000 sites. I’m aggregating Modbus meter data.
You should be OK initially. I’d be happy to help fix any bottlenecks you hit as you scale up
I have improved the performance of Windows in the latest refactor, but need to do a bench to see where we're at performance wise
thank you. So when a new series is detected, the existing task would be cancelled and a new task added with an additional window? This new task would inherit the aggregation state from the previous task?
Ohhh, I think you can do this without stopping the task at all
You just need to group-by in your task
When you use group-by it will maintain separate windows for that key, which sounds like that you want
You still need the feature I mentioned before, but only when you need to modify the job
Right, but it sounds like that will be solved. Thanks for stepping through this with me. I’m new to generalized stream processing systems. I can envision how I would solve this with Kafka, but I don’t want to reinvent the wheel
Detecting new series should be automatic given the series will be on a new tag
No problem
Let me know if you hit any issues
Hi all.. first of all thanks for the awesome work… I’m planning to use onyx to build a recommendation system. We currently have one using Scala but I’m much happier working on Clojure. I’ve started playing with Onyx by following all the different challenges on the lean-onyx github project. It’s pretty cool though I’ve struggled to understand a few things related to :onyx.core/batch.
Yeah, the information model doc should be updated.
I think if you look at the message key on each leaf you’ll be able to see the segment
(message should really get renamed to :segment itself)
I couldn’t find anything else on the guide so I went to the repository to find out what this Leaf type was
Sorry about that. Our docs should get improved there
I’m not a huge expert on Clojure (more a polyglot golang/scala/c#/java/js and a clojure) but I would love to help you.. as I definitely thing Onyx is a great project..
Awesome. We're always looking for more help
@lucasbradstreet: some teething issues running the onyx-log-subscriber-demo
. I’m running the zookeeper docker image and have changed the zookeeper connection string in list-cluster-status
. However, when I eval list-cluster-status
I get a stream of exceptions saying that the job scheduler couldn’t be discovered:
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /onyx/dev-tenancy/job-scheduler/scheduler
code: -101
path: "/onyx/dev-tenancy/job-scheduler/scheduler”
I’ve checked the ip and port values and rebooted the image a couple of times, but the behaviour is still the same
Lemme give it a go. I haven’t tried it yet
same result for me
I suspect the container doesn’t have the data in it. We’ll have to ask @michaeldrogalis
don’t mind waiting. In the meantime I’ll think about how to represent the data visually
Does anyone have a good pattern for (let [x (something)] (side-effect) x)
I guess I should just reduce the size of my function and go with that
I mean, go with the pattern I wrote above. My function is getting a bit hairy which is why I don’t like it.
That said, I think that pattern looks a bit gross
I essentially want (try (finally x)) without it being called when an exception is thrown
I want to return x
It’s just that it gets kinda fuzzy what’s being returned when things get a bit nested
I think I just need to clean up the code
Sometimes I’ll use doto to do that, but it actually needs to be operating on the return
Well, when you clean it up, point me at it, and I'll be a second set of eyes to see if it's understandable.
Cool, thanks
@lsnape: Ah, I'm a derp. The ZooKeeper image I'm using had an implicit volume for the ZK data directory. Sorry, I'll upload a new image tonight.
@michaeldrogalis: no problem at all.
Maybe this will be answered when I can stream the example, but is there a way to get the name of the task given the UUID?
@lsnape: It's not stored in the replica because it would inflate the amount of memory used on each machine. You can read all the task information right out of ZooKeeper though. /onyx/<tenancy-id>/tasks/<task-id>
. We're going to make a REST server in the next week or so that gives you friendly query capability for things like that.
I'm not on my computer, so it's hard for me to link, but we could point to what the task lifecycle does for now
https://github.com/onyx-platform/onyx/blob/0.8.x/src/onyx/peer/task_lifecycle.clj#L442
You can get a "log" component by keying into the env
that is returned by the log subscriber.
That belongs on a t-shirt
New learnings from the other day, that someone else also discovered on the clojure mailing list: dissoc’ing on a record turns it into a map
I’ve always assoc’d nil onto the records, so I was mostly safe, but it’s good to confirm that 😮
Huh, I didn't know that either.
The reason is that records have a known set of keys, so dissoc’ing one of them breaks the contract
Very good way to destroy performance though
Yeah, it totally makes sense. That's a gotcha for sure though
@leonfs: Sorry about that stumbling point in learn-onyx. Someone hit that during a live workshop, I never got to back and fix it up. Let me know if you have any questions, I'll try to get to fixing that tonight.
@michaeldrogalis: no worries… Thank you and Lucas for the great work you are doing.
@leonfs: I don't - but then again I don't know what matrix factorization is Can you explain a bit?
That sounds familiar.
I'm assuming that matrix factorization is a CPU intensive task?
@leonfs: I'm assuming there's a catch, right? Is there a particular primitive that the platform needs to expose to be able to do that kind of computation?
I guess my question was oriented to know if there is a similar library like MLlib for Spark on Onyx..
Ah. There's not been much in the way of machine learning yet. We need to support iterative computation to be a strong competitor there.
Not yet. I’m hoping mikera of core.matrix will give it a whack one day. I used to work with him here in Singapore
@lucasbradstreet: just wanted to confirm what you said last night: group-by maintains separate windows for each key, meaning that no one peer needs to hold the windows for all of the keys in memory (assuming multiple peers).
Hi @joshg, yes, a particular peer will maintain all the windows for a particular key
And these windows will be independent for each key
So, using group-by is probably not feasible for thousands of data series if conj is used for aggregation
Cool.. I’ll keep learning Onyx while exploring how to implement those kind of algorithms.. I’ll keep you informed of the findings..
Yeah, personally I’d only really use conj if I’m going to periodically flush somewhere
Otherwise I’d want to do some further aggregation
We’re going to support transformers/lookup, so that for something like conj you could do stuff like grab a key out of the segment
For now you could write your own aggregation to reduce the amount of data you’re storing
@joshg: Refinement modes are really the thing to use when your windows are approaching large sizes.
Say I want to compute stats (histogram, the 99.9th percentile, etc) for the past hour's values for each series. I could discard after that hour, but it seems like I couldn’t scale the number of series beyond what a single peer could hold in memory unless I could spread the windows across peers
Right. In that case you'd just need more memory - and hence more peers.
You can spread the keys over peers though?
Each key having a number of windows
I assume when you say series that corresponds to a group key
Oh, I misunderstood. I’m not worried about the windows for a single key fitting in memory, I was worried that the windows for all of the keys had to fit on a single peer
Oh, it doesn't work like that. Keys are evenly distributed over all the peers on a given task.
group-key makes sure that windows for a particular key are maintained separately, as well as making sure segments are routed to the correct peer
More peers == more memory == more key spreading
I appreciate your patience and I apologize for seemingly asking the same thing again. Excellent, that’s what I thought.
No worries at all
Thanks for the clarification, that’s really neat. I’ve been trying to sell onyx at my day job as well to digest notifications and being able to scale the number of windows is critical.
Sessions are really handy, yep.
@joshg: the only real consideration there is that we currently have no way to repartition the keys over more peers, so you may want to initially over-provision the number of peers on the windowing task. If you get to the point where you outgrow them we’d love to be sponsored to work on repartitioning.
Just a heads up that it’s technically possible, but not a priority at the moment
@lucasbradstreet: would the task re-deploy support you mentioned yesterday solve that issue?
Not initially, but it’s touching very similar areas
It's fundamentally different from Storm/Spark/Flink etc. I have faith that if we get enough momentum behind it, wonderful things will happen.
Gotta run, back tonight!