Fork me on GitHub
Drew Verlee00:07:37

@gardnervickers: Is creating your own aggregation uncommon? I would assume it would be normal to want to perform a calculation like standard deviation on a set of segments. I suppose im confused when some aggregations are natively supported like while others arent. Im worried im missing some larger piece of the puzzle.


@drewverlee: It’s dependent on what you need. Its perfectly reasonable to do a conj aggregation and in your trigger calculate std. deviation. But if you want incremental state updates and only storing deltas (rather than the whole segment) to the state backend then Onyx exposes the aggregation API for you.

Drew Verlee01:07:06

@gardnervickers: ok, interesting … i’ll stew on that for a bit and hopefully come back with a better question. Thanks again! thanks to lucas to (dont want to ping him if he is offline)


It’s a granularity thing, the conj aggregation just gives you a bag of aggregate data for you window. What you do with that is up to you.

(def average
  {:aggregation/init average-aggregation-fn-init ;; Initialize the aggregation state
   :aggregation/create-state-update average-aggregation-fn ;; calculate an average for the window
   :aggregation/apply-state-update set-value-aggregation-apply-log ;; write the results above to shared log
   :aggregation/super-aggregation-fn average-super-aggregation}) ;; collapse discrete averages into one larger average


So for std. deviation, you would :aggregation/init set your std. deviation init value, something like {:std-deviation 0 :sample-size 0} :aggregation/create-state-update calculate a std-deviation for your window, outputting {:std-deviation <X> :sample-size <X>} :aggregation/apply-state-update set-value-aggregation-apply-log reuse this function, write the results above to shared log :aggregation/super-aggregation-fn create one std. deviation from a bunch of {:std-deviation <X> :sample-size <X>} entries, I think you square the std deviations and divide by their sample sizes, then add?


Or if you wanted, you could just use a onyx.windowing.aggregation/conj aggregation for your segments [{:n 1} {:n 5} {:n 3} …] and in your trigger, you would be passed the state [{:n 1} {:n 5} {:n 3} …]. Then you can just (std-deviation (map :n state))


The former avoids keeping all those segments [{:n 1} {:n 5} {:n 3} …] around in each window when you really only want the std deviation (one number). The later means you don’t have to write the former, at the cost of having to store each segment in durable storage for a final recombine.


@drewverlee: Writing a custom aggregation taps into our exactly-once state update mechanism.


Try pretty hard to not use conj if you can. Personally I'd kinda like to use it because it is quite easy to create performance issues, as it is seductively easy to use


He means delete it, and yeah. We're planning to drop it.


Has anyone tested recently the dashboard? I have a job running but the dashboard does not show any jobs running, the event viewer shows that a job was submitted


I think this might be a bug, can you submit an issue?


I sure can


Is there a proper way to actually stop a job? For instance If i kill the job container and the peer container through marathon when that peer restarts the job will automatically restart ( not the container ) but the job running on the peer. I ended up with lots of duplicate jobs in zookeeper as I was always restarting the submit-job container.


Yea onyx.api has a kill-job function


ok, just one of those little things that makes sense but I didn’t realize it until after it happened, lol


If your just testing, I often just switch the tenancy-id and nuke the old one in ZooKeeper


kind of what i did


one i recognized the issue


also what is your recommendation for the number of physical peers?


vs virtual


i ended up scaling mine back to one physical for the time being because I was having a hard time jumping between all the nodes to find where the action was occuring


Like anything, I recommend getting your jobs running, then worrying about performance. General rule of thumb for optimal performance is 1 core per vpeer but that is heavily dependent on your job.


yeah, that makes sense


Monitor your resource usage and go from there


The ongoing process of ingesting data from a websocket connection and sending it through a processing pipeline would be considered one job, right? There isn't a separate job for each message received?


on the monitoring topic, is there any examples of using Reimann?


@codonnell: in Onyx parlance, each message could also be called a "segment"


@camechis: not yet but if your willing to help write something up I would love to lend a hand getting it setup.


@gardnervickers: Thanks, that makes sense.


ok, i think we almost have our reimann/influx/granfana env setup just wasn’t sure how to truly send the events to reimann based on the docs


but this could be because I haven’t used reimann yet, lol


Have you looked at onyx metrics?


actually i think i did take a peak awhile back


@gardnervickers: Do you recommend kicking a job off through marathon? Just wondering because I would think if for some reason the job container died then marathon would relaunch it creating a second instance of the job.


It seems roundabout to pass the channel in as an argument to build-lifecycles and store it in an atom living in the lifecycles namespace rather than just using the channel itself.


Onyx works on keyword paths to things that it resolves at runtime. This allows it to be data-driven. Code constructs (functions/objects/etc..) can’t be reliably encoded into a data format, so Onyx requires that you define those separate and declaratively wire code constructs up with data.


That said, the template has a good example of using task bundles to build a job. This is a higher-level feature to hide some of the verbosity when writing out these jobs.


@gardnervickers: Thanks. I'll take a closer look at the core-async plugin and how those examples are wired up.