Fork me on GitHub
#onyx
<
2016-06-03
>
mike_ananev12:06:18

Hello, onyx team! Is there any example how to join two datasets using Onyx?

michaeldrogalis15:06:49

@mike1452: Hi! We don't have an example handy for that since there are a lot of variations on that problem. Onyx is currently optimized for streaming joins by using the windowing functionality. State is recorded durably using an incremental snapshot approach, similar to Samza. We're coming out with a swappable module soon that does value-based recording, which is better for batch joins - similar to Spark and Flink.

michaeldrogalis15:06:44

Interestingly, the incremental snapshot approach is way, way harder to get right than the value recording approach, so we're in good shape to knock that one out quick and support both.

joshg19:06:44

In general, is it better practice with onyx to have many smaller, finite tasks or fewer longer-running tasks?

gardnervickers19:06:46

The length you’re running tasks for is entirely up to your’re job

joshg19:06:22

I’m looking at a messaging application. Each message has windows and may expand into hundreds of thousands of recipients, each which needs template rendering, metrics, etc. Trying to decide whether each message is a task or to have one task handling all messages (which gives us less flexibility).

gardnervickers19:06:59

Not sure what you mean by each message is a task

joshg19:06:37

sorry, a job, not a task

gardnervickers19:06:51

So tasks are the individual functions in a “Job”, in the “Job” [[:in, :increment] [:increment :out]], :in, :increment, and :out are all tasks

joshg19:06:07

my apologies, I got the terminology mixed up

gardnervickers19:06:05

No worries! But yea you don’t want to be running thousands of jobs.

joshg19:06:21

Thanks, that’s what I thought, but wanted to confirm.

gardnervickers19:06:15

There are certainly ways to keep the flexibility you’re after without making numerous different jobs.

joshg19:06:17

To clarify, the question was whether to run one job per message (which expands into thousands of emails, push notifications, etc) or one job period which deals with an infinite stream of messages.

gardnervickers19:06:35

Yes you want the second in every case

joshg19:06:46

gotcha, that makes sense