onyx 2016-04-05 | Slack Archive

rasom07:04:50

Hi! I’m looking to onyx-redis, and i found :onyx/plugin :onyx.plugin.redis/reader in http://reade.me but didn’t find onyx.plugin.redis/reader itself in code. Is it ok? Because looks like it doesn’t work...

gardnervickers10:04:54

@rasom: Apologies, that should not exist

gardnervickers10:04:03

I’ll take fix that thanks for catching it

rasom10:04:54

So, do you mean that onyx-redis will not support any kind of input tasks from redis?

gardnervickers10:04:30

We removed the redis reader because it did not allow for safe checkpointing that other readers support

gardnervickers10:04:46

You can still use onyx-seq or the core-async plugin to read from redis and inject segments that way

gardnervickers10:04:23

We did not want folks to get confused on the safety guarantees offered by the redis plugin.

rasom10:04:33

ok, i see, thanks

gardnervickers10:04:26

If you need to read segments from redis I can help out with that a little later, I’d love to include a section on that in in the redis docs for now until we can get around to making the input safe.

gardnervickers10:04:52

Stepping out right now though

rasom10:04:13

https://github.com/onyx-platform/onyx-kafka here is another problem with reademe.md: [org.onyxplatform/onyx-kafka "0.9.0.0-alpha11”] doesn’t exist in clojars

gardnervickers11:04:51

@rasom thank you for the help, I went through the rest of the plugins for the 0.9.x branches and they should be good. Please let me know if you see any other discrepancies, seems like there was a problem with the release scripts this time around.

rasom11:04:11

ok, thanks

jasonbell13:04:02

Apologies for the big exception dump but I’m having some trouble with Kafka input stream with a fixed windows.

jasonbell13:04:35

I’m assuming it’s not finding Zookeeper though it’s running locally.

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /onyx/a654f33d-261f-47f6-906d-f58f4a6f6c79/ledgers/available/readonly
    code: -101
    path: "/onyx/a654f33d-261f-47f6-906d-f58f4a6f6c79/ledgers/available/readonly”

gardnervickers13:04:15

Unfortunately this is a problem with BookKeeper, if you restart a job quickly BK will not ignore/overwrite the previous ZooKeeper node.

jasonbell13:04:44

Is there a preferred workaround?

gardnervickers13:04:21

The only real workaround is to wait for the ephemeral node to expire

jasonbell13:04:34

ok, thanks for the quick response, appreciate it

gardnervickers13:04:40

Unfortunately that’s not a super great workaround 😕

gardnervickers13:04:14

I usually develop against an in-memory zookeeper instance and just teardown/setup when I’m iterating

gardnervickers13:04:38

see the kafka tests https://github.com/onyx-platform/onyx-kafka/blob/0.9.x/test/onyx/plugin/input_test.clj

jasonbell13:04:06

Thanks @gardnervickers - it’s all helpful. I’ll take a look. Thanks for your time.

gardnervickers13:04:09

The problem is on our radar and being worked on

gardnervickers13:04:21

Thanks!

leonfs16:04:08

@michaeldrogalis: The dataflow model supports a refinement mode called: “Accumulating & retracting”, is this something Onyx supports?

michaeldrogalis16:04:40

@leonfs: Not yet. We have an open ticket for that one. The implementation is complicated.

michaeldrogalis16:04:16

I know the DataFlow model had it in the paper since the beginning, but when did Retraction support land in Google DataFlow itself? It wasn't there for quite a bit.

leonfs16:04:38

@michaeldrogalis: I trully don’t know. I’ve been reading the paper and I struggle to understand the motive for retractions.. So I thought maybe it’s better explained on Onyx documentations. To later find out that it has not been implemented..

gardnervickers16:04:20

You would use retractions if you wanted to answer the question, “What’s the highest temprature read by the sensors over 10 minute windows"

gardnervickers16:04:35

Where data can arrive arbitrarily late

michaeldrogalis16:04:05

@leonfs: Think of it like this. Say you have an event stream ingesting vote counts from an election. You get a message that says 10 votes for Bernie Sanders, so you add that to your running total. Then later in the day, the person who sent the message realization they screwed up, so they send a new message that says Bernie gets 12 votes. You need to retract the previous assertion of "10 votes" into your account, reapply an assertion of "12 votes", and then (here's the kicker) retract and reassertion all downstream state updates that depend on the counter change.

michaeldrogalis16:04:27

It basically covers scenarios where upstream producers either have an "oops", "nevermind, I changed my mind", or "I learned something new about what I previously told you" case.

leonfs16:04:59

Cool.. thanks for the examples it is clearer now..

michaeldrogalis16:04:29

No prob.

leonfs16:04:59

Do you have a conceptual idea on how you could implement it?

michaeldrogalis16:04:01

I haven't done any design work for it yet. I think @gardnervickers may have had a think about it though?

gardnervickers16:04:25

@leonfs: I wrote this up a while ago to check my understanding, not sure how useful it will be https://gist.github.com/gardnervickers/cdc48ce5762ac5819384

leonfs16:04:26

@gardnervickers: great.. I will definitely have a look at your notes..

gardnervickers16:04:15

But retractions are tricky. Technically Onyx does support them currently if you specify a custom refinement, but that’s just because Onyx does not yet support serial aggregations.

gardnervickers16:04:58

retractions are not really interesting until you can have multiple serial aggregations though

gardnervickers16:04:18

https://github.com/onyx-platform/onyx/blob/0.9.x/src/onyx/refinements.clj

gardnervickers16:04:31

This is a poor solution though because whatever you come up with would have to be application specific. If you’re making a vote counter, your retraction would be a segment that’s -1 instead of +1. If you’re making a temperature data ranking system, a retraction would be dissoc’ing the temperature value.

leonfs16:04:00

yeap, the paper talks about persisting the emitted value

leonfs16:04:23

on the next trigger before subimiting the new value, emit the persisted old value

gardnervickers16:04:29

Exactly

gardnervickers16:04:38

they rewind essentially

gardnervickers16:04:32

You would draw a line in time and say “any data before T, i dont care about, its too old, any data after T gets re-incorporated into the window it’s supposed to be in”. This would put the burden on Onyx to maintain all after-T segments at the first aggregation

gardnervickers16:04:25

I believe T is the watermark in their paper

ymilky18:04:44

I've also never found many cases where I needed retractions, but that's primarily because for a long time I've always operated on the principle of using immutable logs. So in the case of votes, fixing the mistake itself is an event in the stream and thus the fix has its own t associated with it and events, which contain state to decrement the votes. Mistakes and changes in my experience are always a part of processing any data stream, so it's just best to treat them as pure data - i.e. roll forward the corrections, changes, whatever. A system that can't do that is at minimum at risk when you upgrade to say a new data format or conversion formula anyway. It's useful to know when that stuff happened rather than mutating anything directly as the mistakes are often useful information. No special mechanism is generally needed if it's just a matter of running something that streams the same values and applying some function that takes data (ex: deltas). The only gotcha I've found is avoiding getting yourself into situations where you cannot replay a stream in the same manner (ex: not idempotent, bad side effects, etc), but that's usually a sign of bad design.

gardnervickers18:04:47

Retraction support of the type expressed in the data flow paper is necessitated by unbounded streams that cannot guarantee event-time ordering.

gardnervickers18:04:09

Even with a log you still need to use retractions for certain types of queries

ymilky21:04:24

I'm not saying they aren't needed, just they aren't needed often

ymilky21:04:12

well unless all your streams are as you describe of course, but in practice I haven't seen it much unless there's some awful design or someone overcomplicating things

ymilky21:04:41

if you can make it work, it certainly is elegant, but not easy to solve

gardnervickers21:04:59

Got some coffee and thought about this some more, I think we’re talking about two different things. You need the ability to assert retractions for certain types of queries no matter your design. With CQRS/ES (immutable log processing) your still asserting your retractions, just instead of being persisted across the network it’s sitting in Kafka. If you distribute your processing you’ll hit the same complications that Onyx is going to hit.

ymilky22:04:59

I think we are talking about different things. I'm lost with regards to the CQRS/ES + Kafka. With CQRS/ES, I'm not sure how Kafka fits in to the issue of retractions. I do use Kafka in CQRS/ES, but as a command/event queue. Actual events are written to an event store. If say 10 fake votes were cast that shouldn't count, those 10 presumably will have already been processed. Once you emit an event in CQRS, that's it, you cannot ever reject/retract it, that should be done by the command handler or before the command is issued. Once the command is accepted, in most implementations that means an event was/will be emitted. If the event is sitting in Kafka, it needs to be guaranteed to be written. Kafka is only there to preserve order and to resist crashes before an event is written to the store. You could theoretically reject the event before it is written to the event store point by dropping it on the floor if you knew those 10 votes were fake and what events they correspond to, but that's in my experience a bad idea. Instead, you'd simply issue a new command to remove 10 fake events, which would emit events that subtract those events and probably also emit fake vote events so you know what happened. That way you see there was "fraud" potentially and take further action if necessary. The hard part and where formalized stuff comes in is exactly as implied, order, but if your data is already out of order, writing it to Kafka has no impact unless you plan on somehow windowing and emitting the results again in order.

gardnervickers22:04:18

> Instead, you'd simply issue a new command to remove 10 fake events, which would emit events that subtract those events and probably also emit fake vote events so you know what happened.

gardnervickers22:04:25

You issued a retraction 😉

ymilky22:04:32

yes

ymilky22:04:54

writing more but I'm differtiating between a stream processor doing complex retraction at a low-level and math-like retraction

ymilky22:04:14

that's my point, not that "no retractions" should ever happen. You need and must have basic retractions

ymilky22:04:43

Where I'm saying it goes wrong is where people add in tons of overhead because of that in-between time of the retraction data vs. the current data

ymilky22:04:01

that is, they want to either drop the data in the pipeline on the floor or have some kind of complicated rewind for example

ymilky22:04:18

I've had experience with this at places like e-bay/paypal, banks, financial systems, ads, airplane data, and government services. Most of these domains are classics for retractions. The funny thing is that 99% of the time when someone said hey we need the system to do xyz complicated thing, you could do it by just using some simple math/state to set things right. I'll give the example of ads because for whatever reason ad people seem obsessed with doing things like this. Is the cost of 10 seconds of missed ad-revenue worth the ad system going 50% slower? Of course not, you make more money even if it's crap for a few seconds than you do trying to make something perfect.

ymilky22:04:14

We saw this all the time in financial systems as well. The simple case almost always wins out for business benefit even if on a technical level it makes people afraid something is wrong. When you add low-level complex logic and things in there to fix otherwise easily fixable things, you end up adding in other problems, usually the bad ones you don't know about

ymilky22:04:05

The only system I've ever worked on that needed something super crazy for sure was related to fighter jet data because accuracy was everything

ymilky23:04:59

another example is a colleague of mine was at e-bay/paypal and they had such a complex pipeline that all the technical magic they were working on really pointed to their pipeline sucking and needing to be simplified/broken up, again eliminating a lot of complex retraction-related lower-level logic

ymilky23:04:42

in that case the pipeline would take too long and get in weird states, so the solution was more make it run faster and more discrete roles, no more restarts ever

gardnervickers23:04:31

Two things. 1. Take the scenario where you’re counting votes over 10 minute spans, splitting those votes by district, and getting an average income. You realize 5 of those votes are fake. How do you fix the state of the system?

gardnervickers23:04:17

Each task would have to know how to emit it’s own retraction, thats a ton of custom logic multiplied by every onyx user

gardnervickers23:04:36

2. It is true that for domain problems, your approach is simpler, but so is running everything on a giant EC2 instance. As a framework Onyx needs to be able to handle this by keeping track of what segments generated what downstream state.

2016-04-05

Channels