Fork me on GitHub
#onyx
<
2018-03-13
>
dbernal14:03:22

what's the best way to aggregate a list of segments using group-by-key without knowing the size of the list? Basically I have an incoming set of segments that are related through their ids and I need to have them grouped. I'm having a hard time wrapping my head around using state as a way to group segments

michaeldrogalis15:03:57

@dbernal Windowing to the rescue! Why do you think you need to know the size of the collection?

dbernal15:03:50

ok I that's what I was starting to look into. I was mostly looking at the group-by-key test code and seeing how it was merging maps. I thought I needed to know the size of the collection so that the task would know when to emit the full aggregated collection. For a batch job, would I use the global window type and use the state map to handle the grouping of segments?

michaeldrogalis16:03:28

@dbernal Correct, yeah. You definitely want windows and triggers there.

michaeldrogalis16:03:55

group-by-key will segment data across windows, but the windows themselves will maintain the state, and triggers will emit aggregations

michaeldrogalis16:03:09

You can also access window contents without triggering via HTTP

dbernal16:03:49

@michaeldrogalis is there a particular way to trigger at the end of the batch? How would I be able to trigger after all segments have been processed by the task?

michaeldrogalis16:03:02

@dbernal What input storage are you reading out of?

michaeldrogalis16:03:50

@dbernal The job will complete after all segments have been read out of a SQL table, and you'll receive a trigger event for completion. You can flush them then

dbernal16:03:44

@michaeldrogalis is there a particular trigger type I would use?

michaeldrogalis16:03:33

Shouldn't particularly matter if all you're doing is looking to flush at the end - otherwise just make it no-op based on non-matching event types.

lucasbradstreet16:03:29

Almost all of the triggers count job-completed as a reason to trigger e.g. https://github.com/onyx-platform/onyx/blob/0.12.x/src/onyx/triggers.cljc#L84

dbernal16:03:07

ok cool. Got it! Thanks for all the help @michaeldrogalis @lucasbradstreet

dbernal18:03:43

is it possible to flush out aggregated segments to a downstream task?

lucasbradstreet18:03:17

generally you will want to use :onyx/type :reduce on the task that does the sending downstream, because you typically don’t want to pass down the segments that are being reduced