Fork me on GitHub
#onyx
<
2016-04-06
>
leonfs09:04:07

Hi guys.. question about a challenge on the learn onyx project

leonfs09:04:13

on challenge 6.0

leonfs09:04:20

this is said..

leonfs09:04:21

;; Fire the trigger every 10 segments we see. This corresponds to the number ;; of items in our toy data set - which gives us the predictability of ;; knowing that it will fire 3 times. :trigger/threshold [10 :elements]

lucasbradstreet09:04:28

Hi leonfs. Go for it

leonfs09:04:32

but on the test data there are only 9 elements..

lucasbradstreet09:04:30

Hmm yes, I can see how that is confusing. What’s probably happening is that it’s firing on task completion as well

lucasbradstreet09:04:06

the :onyx.triggers/segment trigger will also fire on job / task completion so that you can flush your windows even if you haven’t hit the threshold

lucasbradstreet10:04:04

That appears to be the case. We need to fix up the wording / case to be less confusing

lucasbradstreet10:04:06

(defn deliver-promise! [event window trigger {:keys [window-id lower-bound upper-bound event-type]} state]
  (println "Event type was " event-type)
  (let [lower (java.util.Date. lower-bound)
        upper (java.util.Date. upper-bound)]
    (swap! fired-window-state assoc [lower upper] (into #{} state))))

lucasbradstreet10:04:00

I pulled out the event type key and it seems to be triggered for "Event type was :job-completed"

leonfs10:04:19

what exactly accounts for task completion? the input channel closing?

lucasbradstreet10:04:42

The :done being read on all input tasks, and the segments being acked through the whole workflow

lucasbradstreet10:04:48

i.e. all data has been processed

leonfs10:04:32

but in a production environment, let’s say input data coming from Kafka, then you can’t effectively know what “all” input is right?

lucasbradstreet10:04:40

Yes, that’s right. For those jobs there’d be no way to “complete” the job, only kill a long running job

leonfs10:04:17

when I set it to 9 the threshold the debug output is:

leonfs10:04:18

Event type was :new-segment Event type was :new-segment Event type was :new-segment Event type was :job-completed Event type was :job-completed Event type was :job-completed

leonfs10:04:29

makes sense..

leonfs10:04:12

one for each window.. on threshold kicking in and on job completed..

leonfs10:04:53

how can I see from which extent the delivery came from? I tried to use window-id but I got null

lucasbradstreet10:04:05

The extent maps directly to lower-bound and upper-bound

lucasbradstreet10:04:29

I think generally you only care about what the lower and upper bounds of the windows are, not their extents

leonfs10:04:51

cool.. then I would add that on github issue change the window-id key to window as window-id is not part of the state event

leonfs10:04:33

or remove it as there is already a binding for window

lucasbradstreet10:04:07

Please do - that’s a vestige of our old version. Now that we include the window you can just look it up there

leonfs10:04:22

cool.. thanks for the help lucas..

lucasbradstreet10:04:28

You’re welcome

leonfs10:04:15

the other thing I’m trying to understand is..

leonfs10:04:16

;; When set to true, if any "extent", an instance of a window, fires, all ;; extents will fire. This is desirable behavior for segment-based triggers, ;; so we turn it on. :trigger/fire-all-extents? true

leonfs10:04:39

why would it be a desirable behavior?

lucasbradstreet10:04:03

Without it you will only see a window associated to your event fire e.g. last segment may end up in the 10:00-10:05 bucket, which would be the extent that fires when that trigger is hit.

lucasbradstreet10:04:36

Since we might want to flush all of our window buckets, when 10 segments come through, we need that option turned on, so that all of our buckets will have the trigger sync fn called on them

lucasbradstreet10:04:47

not just the one matching the last event

leonfs10:04:36

thanks Lucas.. I’ve been playing with the threshold and the input vector to test what you just mentioned

leonfs10:04:42

tell me if I’m wrong..

leonfs10:04:06

but for the given input, if the threshold is set 2 to elements, fire-all-extents to false.. shouldn’t (println "Event type was" event-type "on" (.format (java.text.SimpleDateFormat. "h") (java.util.Date. upper-bound))) be 4 instead of 3…

leonfs10:04:12

according to what you mentioned

leonfs10:04:42

I always get 3 (window bucket)

leonfs10:04:56

these are the first 2 elements of the input

leonfs10:04:57

{:event-id 1 :event-time #inst "2015-11-20T03:15:00.000-00:00" :page-visited "/"} {:event-id 2 :event-time #inst "2015-11-20T04:45:00.000-00:00" :page-visited "/login"}

lucasbradstreet11:04:10

On first look, that does appear very odd.

lucasbradstreet11:04:28

I’m just about to have dinner, I’ll give it a better look when I’ve finished

leonfs11:04:49

thanks Lucas..

leonfs11:04:00

I’ll go for Lunch.. talk to you in a bit..

leonfs16:04:22

Another question here guys.. Usually in the windowing examples the window points to a task that references an identity function. if for example the function would modify the segment would the window receive the recently transformed segment?

leonfs16:04:19

what would the downstream task on the workflow actually receive?

gardnervickers16:04:32

So windowing cannot emit window state downstream currently. So downstream would just recieve a segment transformed by your task function, in the case of the windowing examples, transformed by identity

gardnervickers16:04:59

The only way to get data out of your aggregations/windows is through the trigger :sync

gardnervickers16:04:51

@leonfs Changing this behavior to allow multiple serial aggregations is a focus right now

leonfs16:04:37

the only way to achieve NOW multiple serial aggregations would be to push the state of the window to another input source from :sync right?

gardnervickers16:04:53

something durable preferably

leonfs16:04:20

do you have a link to a Roadmap for the next versions?

gardnervickers16:04:26

Some folks are using Amazon SQS for it right now and that’s working well, Kafka would be another good choice (and is similar to how Spark works)

gardnervickers16:04:17

That particular issue is being tracked here https://github.com/onyx-platform/onyx/issues/552 . Development is moving swiftly on it so I’m not sure that keeping the issue up to date has been a priority.

gardnervickers16:04:59

We do not currently have a roadmap but the need for one has been brought up.

leonfs16:04:49

cool.. thks gardner.. I will take a look to the issue.. it’s a fascinating field.. Hopefully once I’m up to date with the inner workings of Onyx I will be able to help/contribute to the project..

gardnervickers16:04:11

Awesome, seems like your moving quickly