Fork me on GitHub
#onyx
<
2016-11-25
>
jasonbell09:11:57

I’m presenting an Introduction to Stream Processing with Kafka and Onyx at ClojureX next week.

yonatanel10:11:07

@jasonbell Will it be recorded?

jasonbell10:11:25

I tihnk so, not 100% sure

akiel13:11:40

What is the best way to just discard segments in output? In my workflow, I’m only interested in aggregations which I sync with a trigger. Currently I have an output function.

lucasbradstreet13:11:59

You can return an empty vector from your output onyx/fn or you can use flow conditions to avoid then being sent down at all

akiel13:11:59

And the segments will be ack’ed if the go into the output function? I ask because I have problems that my job is not finished in some situations.

akiel15:11:44

My problem was the default :onyx.messaging/inbound-buffer-size of 20.000. I need more like 1 million otherwise the job just stalls. Would be good to add this to the FAQ.

michaeldrogalis17:11:09

@akiel Can you explain how you went about crafting your job? Usually that knob doesn’t need to be touched.

akiel17:11:36

My workflow is a but unusual. I start with a seq input of 7 segments. The first task extracts about 100 segments from each input segment. The second task is 1:1 on segments. Now the third task blows up each segment up to 1 million times. After that I group by a function and aggregate over a global window. A trigger writes out the aggregates.

michaeldrogalis17:11:10

@akiel You’re using almost all the features. 🙂 Cool.

michaeldrogalis17:11:30

Yeah, very large fan outs are rare, but it makes sense for your situation there. What domain are you working in?

akiel17:11:48

A medical study.

akiel17:11:22

It’s an ETL job.

michaeldrogalis17:11:45

Nice, that’s pretty cool.

akiel17:11:57

@michaeldrogalis My window aggregation is idempotent. So I don’t need exactly once aggregation updates. Can I configure my job so that I don’t need Bookkeeper? I already use :onyx/deduplicate? false and have no :onyx/uniqueness-key set.

michaeldrogalis18:11:37

@akiel It can tolerate a replay?

akiel18:11:03

Yes I think so.

akiel18:11:44

The replay is always from the input - right?

michaeldrogalis18:11:02

BK can’t be turned off when using aggregations. Most aggregations can’t gain exactly once semantics without the extra help.

michaeldrogalis18:11:24

How do you prevent aggregating a duplicate segment?

akiel18:11:58

My aggregation is done with assoc-in so its really idempotent.

akiel18:11:50

I need at-least-once.

michaeldrogalis18:11:57

Got’cha, that would do it. Is it feasible for you to write a lifecycle that maintains an atom to write your state into? You can periodically put that on storage.

michaeldrogalis18:11:26

Since assoc-in is friendly to replays, you can actually get away without aggregations-proper in Onyx altogether and not use BK.

akiel18:11:59

I had the atom solution before. But I hit problems and started with the window aggregation. At the end my problems are likely solved by the inbound-buffer-size config. So I may go back to the atom solution.

michaeldrogalis18:11:07

What were the issues that you came across?

akiel18:11:29

The thing with the large fan-out and to small inbound-buffer-size.

michaeldrogalis18:11:30

Ah, right’o. Seems like you’re on the right track then.