onyx 2017-04-29 | Slack Archive

lmergen09:04:01

what's common, that people have a single "big" job in onyx, or many small jobs ? i.e. in a CQRS setting, do you have a single job that calculates all aggregates, or one job per aggregate ?

lmergen09:04:12

(i'm err'ing towards many small jobs, but am unsure about the performance difference and/or ease of use)

lmergen18:04:13

what is the preferred way to do aggregates ? i get the idea there are multiple ways to achieve the same goal. let’s say that i want to count the number of events that go through my stream. i can use both an aggregate function, or a global window function which keeps track of a counter. what is the preferred way for this ? i thought that aggregate functions would be the way to go, but it appears as if these only emit their state after the job is done, and are not compatible with streaming jobs. is this correct ?

gardnervickers18:04:33

When do you want the event count to be emitted downstream or sync’d to some external system?

gardnervickers18:04:45

Every x segments? Every y minutes?

lmergen18:04:04

every x minutes, yes

lmergen18:04:13

so i feel that a window with a trigger will do the job

lmergen18:04:17

is this correct ?

gardnervickers18:04:22

Yes

lmergen18:04:32

when are function aggregates useful ?

lmergen18:04:59

this: http://www.onyxplatform.org/docs/user-guide/latest/#_grouping_aggregation

lmergen18:04:21

they’re not really compatible with streaming jobs, eh ?

gardnervickers18:04:09

What makes you say that?

lmergen18:04:52

because it appears there is no way to flush them periodically ?

gardnervickers18:04:13

Ahh, you compose grouping with windows, and windows with triggers.

gardnervickers18:04:24

Grouping routes, windows collect, triggers emit

lmergen18:04:13

aha, so within a window, you use aggregate groups

lmergen18:04:14

gotcha

gardnervickers18:04:21

This might help

gardnervickers18:04:22

http://www.onyxplatform.org/docs/user-guide/latest/#_grouping

gardnervickers18:04:25

Now your question about aggregate functions, are you talking about these aggregation functions? http://www.onyxplatform.org/docs/user-guide/latest/#_aggregation

lmergen18:04:10

yes, i think i am talking about a “global” window

lmergen18:04:18

https://github.com/onyx-platform/onyx-examples/blob/0.10.x/aggregation/src/aggregation/core.clj

lmergen18:04:34

that seems to be am example of what i was thinking

gardnervickers18:04:24

For your example, counting events in your system. A global window with the :onyx.windowing.aggregation/count with no grouping fn set will maintain a count of all the events passing into the aggregate. You can then trigger periodically to either sync to an external service/db or emit downstream.

lmergen18:04:38

yep, that’s what i was planning to do

gardnervickers18:04:58

If you wanted to get a count of event by their type, you’d then use a grouping fn to group by an attribute on the segment.

lmergen18:04:05

yes

lmergen18:04:13

and that grouping fn would be inside the window

lmergen18:04:17

like the wordcount example

gardnervickers18:04:19

Then you’d have something like {:type/login 10 :type/logout 2 …}

lmergen18:04:30

ok, thanks!

lmergen18:04:41

i know enough to keep myself busy for a while again 🙂

gardnervickers18:04:50

Lastly, you asked about when to implement your own aggregation function. While the provided aggregation functions are general enough to do quite a bit with, you’ll often want to make your own that can use domain knowledge to better handle managing your aggregates.

lmergen18:04:20

lmergen18:04:53

yeah i can see why that would be the case

lmergen18:04:07

thanks for the info @gardnervickers !

lmergen18:04:14

(once again :))

michaeldrogalis18:04:33

@lmergen Sounds like you’re all sorted out, but I’m here if you have more questions about windowing.

lmergen18:04:51

too late @michaeldrogalis 😉

michaeldrogalis18:04:00

Heh, no complaints here. Glad you’re set!

lmergen18:04:12

just verifying whether i’m reading this correctly: if i’m doing a huge aggregate (e.g. grouping by session id, and having millions of sessions), and only want to emit or sync the keys that have actually changed, i want an accumulating trigger, right ?

lmergen18:04:23

oh wait, that actually looks like it should be a discarding

michaeldrogalis18:04:53

@lmergen Refinement modes are pertinent to whether data should continue to be stored in the aggregate after sync or emit are called.

lmergen18:04:01

yep

michaeldrogalis18:04:02

Accumulating keeps it around, discarding flushes it.

lmergen18:04:35

so, if i have the ability to reduce keys in my aggregate store, it should be a discarding trigger

lmergen18:04:15

hmmm, now that i think about it, it highly depends on what you’re actually doing

michaeldrogalis18:04:25

Indeed.

michaeldrogalis18:04:43

You could use boxes with very large amounts of memory and use accumulating triggers.

lmergen18:04:53

tracking “unique visitors over time” should probably be a accumulating

michaeldrogalis18:04:58

But most of the time you’re going to want to flush it elsewhere. Those state stores aren’t meant to be permanent.

michaeldrogalis18:04:14

Well - how far in the past do you need to know? 🙂

michaeldrogalis18:04:34

Unique within a 1 month window of time? Or over all of time?

lmergen18:04:53

would it make a difference ?

michaeldrogalis18:04:50

If you’re using Onyx’s window state as the basis for decisions, it does. If you need all information over all of time, you need some hefty memory on your cluster.

lmergen18:04:18

yeah, i think either way this just requires clever data models

michaeldrogalis18:04:27

Windows have their name because they only permit you to see a slice of time, since presumably seeing all information for all time is too much to fit in memory.

michaeldrogalis18:04:38

Application engineering time!

lmergen18:04:41

haha

lmergen18:04:27

well i think “unique visitors” is just a very hard problem to begin with, especially when you have (common) business requirements like, “i want to dynamically change the timeframe when i query”

lmergen18:04:04

so then you’re back to the drawing board, and you come to the conclusion you need a discarding trigger anyway 🙂

michaeldrogalis18:04:11

@lmergen Yeah, I’d agree with that assessment. You can make the problem much more manageable by using :onyx/group-by-key to force all user ID events of the same value to go to the same peer

michaeldrogalis18:04:28

Then you can use many more boxes with smaller amounts of memory to handle it.

lmergen18:04:34

yeah but then still, you’re relying on state in the machine — i don’t like that

michaeldrogalis18:04:37

But I dont think you get away without discarding data at some point

michaeldrogalis18:04:52

It’s window state is kept in memory, but also durably checkpointed out to S3.

michaeldrogalis18:04:06

If the cluster goes down, it will rollback to its last consistent, fully ack’ed state.

lmergen18:04:14

ok, that’s good to know