This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-04-29
Channels
- # aws-lambda (13)
- # beginners (4)
- # cider (6)
- # cljs-dev (1)
- # cljsrn (4)
- # clojure (134)
- # clojure-android (7)
- # clojure-dev (14)
- # clojure-russia (18)
- # clojure-spec (3)
- # clojurescript (81)
- # core-matrix (2)
- # datomic (9)
- # figwheel (1)
- # hoplon (11)
- # lumo (10)
- # off-topic (18)
- # onyx (78)
- # pedestal (5)
- # portkey (2)
- # re-frame (8)
- # reagent (3)
- # rum (1)
- # spacemacs (23)
- # yada (5)
what's common, that people have a single "big" job in onyx, or many small jobs ? i.e. in a CQRS setting, do you have a single job that calculates all aggregates, or one job per aggregate ?
(i'm err'ing towards many small jobs, but am unsure about the performance difference and/or ease of use)
what is the preferred way to do aggregates ? i get the idea there are multiple ways to achieve the same goal. let’s say that i want to count the number of events that go through my stream. i can use both an aggregate function, or a global window function which keeps track of a counter. what is the preferred way for this ? i thought that aggregate functions would be the way to go, but it appears as if these only emit their state after the job is done, and are not compatible with streaming jobs. is this correct ?
When do you want the event count to be emitted downstream or sync’d to some external system?
Every x segments? Every y minutes?
What makes you say that?
Ahh, you compose grouping with windows, and windows with triggers.
Grouping routes, windows collect, triggers emit
This might help
Now your question about aggregate functions, are you talking about these aggregation functions? http://www.onyxplatform.org/docs/user-guide/latest/#_aggregation
https://github.com/onyx-platform/onyx-examples/blob/0.10.x/aggregation/src/aggregation/core.clj
For your example, counting events in your system. A global window with the :onyx.windowing.aggregation/count
with no grouping fn
set will maintain a count of all the events passing into the aggregate. You can then trigger periodically to either sync
to an external service/db or emit
downstream.
If you wanted to get a count of event by their type, you’d then use a grouping fn to group by an attribute on the segment.
Then you’d have something like {:type/login 10 :type/logout 2 …}
Lastly, you asked about when to implement your own aggregation function. While the provided aggregation functions are general enough to do quite a bit with, you’ll often want to make your own that can use domain knowledge to better handle managing your aggregates.
thanks for the info @gardnervickers !
@lmergen Sounds like you’re all sorted out, but I’m here if you have more questions about windowing.
too late @michaeldrogalis 😉
Heh, no complaints here. Glad you’re set!
just verifying whether i’m reading this correctly: if i’m doing a huge aggregate (e.g. grouping by session id, and having millions of sessions), and only want to emit
or sync
the keys that have actually changed, i want an accumulating
trigger, right ?
@lmergen Refinement modes are pertinent to whether data should continue to be stored in the aggregate after sync
or emit
are called.
Accumulating keeps it around, discarding flushes it.
so, if i have the ability to reduce
keys in my aggregate store, it should be a discarding
trigger
Indeed.
You could use boxes with very large amounts of memory and use accumulating triggers.
But most of the time you’re going to want to flush it elsewhere. Those state stores aren’t meant to be permanent.
Well - how far in the past do you need to know? 🙂
Unique within a 1 month window of time? Or over all of time?
If you’re using Onyx’s window state as the basis for decisions, it does. If you need all information over all of time, you need some hefty memory on your cluster.
Windows have their name because they only permit you to see a slice of time, since presumably seeing all information for all time is too much to fit in memory.
Application engineering time!
well i think “unique visitors” is just a very hard problem to begin with, especially when you have (common) business requirements like, “i want to dynamically change the timeframe when i query”
so then you’re back to the drawing board, and you come to the conclusion you need a discarding
trigger anyway 🙂
@lmergen Yeah, I’d agree with that assessment. You can make the problem much more manageable by using :onyx/group-by-key
to force all user ID events of the same value to go to the same peer
Then you can use many more boxes with smaller amounts of memory to handle it.
But I dont think you get away without discarding data at some point
It’s window state is kept in memory, but also durably checkpointed out to S3.
If the cluster goes down, it will rollback to its last consistent, fully ack’ed state.
Yeah, none of the window features are lossy by nature. If peers go down, it’ll recover correctly.
Resume points give you coordinates to carry through state from S3 between jobs. So, sort of.
resume points allow you to interact with the internal checkpointing that onyx does — that’s perhaps a better way of wording it ?
Yes, correct.
Anytime