onyx 2017-11-09 | Slack Archive

fellows00:11:54

Hmmm, ok. Maybe I've been thinking about it wrong. Should I, in general, always assume the possibility of duplicated state getting emitted? Or is this a special case?

lucasbradstreet00:11:53

Well, as a separate point, we can’t guarantee exactly once side effects since that’s impossible, so in that respect you should assume the possibility of sync being called more than once on the same state.

fellows00:11:44

We're never using sync, just emit

fellows00:11:59

With a global window

lucasbradstreet00:11:25

Ok. In that case it’s certainly possible for it to emit more than once if a node fails and it has to recover.

lucasbradstreet00:11:16

I would also assume that it may be called more than once when the job is completed, because you generally need that final seal to flush anything new that didn’t get triggered by the number of elements

lucasbradstreet00:11:52

You could turn that off but you may not get the final state written

fellows00:11:47

Ok, good to know. I still have a lot to learn about what's going on behind the scenes. Thanks for your help.

lucasbradstreet00:11:19

No worries

asolovyov05:11:10

@lucasbradstreet hey 🙂

asolovyov05:11:35

sorry for nagging 🙂

asolovyov05:11:02

I'm just afraid this stuff is a little bit too deep inside for me to comprehend what exactly is wrong 🙂

lucasbradstreet05:11:06

Hi, sorry I didn’t get a chance to look at it yet. Lemme fire it up.

asolovyov05:11:32

oh thanks

lucasbradstreet05:11:00

It’s stops writing in that test?

lucasbradstreet05:11:04

I mean calling write-batch

asolovyov05:11:33

yeah. So the test is that for one of requests I want to go further than retry-params allow, which should generate an exception

asolovyov05:11:52

and that exception is propagated in write-batch, which is not called

asolovyov05:11:15

so somebody calls write-batch for quite a few times and then stops doing that 🙂

lucasbradstreet05:11:36

OK, did the job get killed?

asolovyov05:11:39

while completed? is called and returns false, so the job never finishes

lucasbradstreet05:11:00

I guess if the job got killed it would have finished with feedback-exception!

asolovyov05:11:06

yeah...

asolovyov05:11:05

also, I figured out that Thread/sleep seems to work properly - when a few requests try to back off, they don't mess with each other, which is what we need

asolovyov05:11:22

asolovyov05:11:29

I wonder if I need to call ack-fn 😞

asolovyov05:11:41

I mean after exception happened

asolovyov05:11:50

I only call async-exception-fn there...

lucasbradstreet05:11:55

hmm, shouldn’t have to. What do you want to happen when the exception happens?

asolovyov05:11:07

well, I guess exception propagated?

asolovyov05:11:19

So I can deal with regular Onyx mechanics

lucasbradstreet05:11:21

which would make it to a handle exception?

lucasbradstreet05:11:22

Right

lucasbradstreet05:11:24

lucasbradstreet05:11:28

Just checking

asolovyov05:11:17

yeah, if I call ack-fn, job finishes, but exception is never propagated... I wonder if there is a better place for propagation than throwing it in write-batch?

asolovyov05:11:51

oh wow, but I get this in my onyx.log:

clojure.lang.ExceptionInfo: Handling uncaught exception thrown inside task lifecycle :lifecycle/write-batch. Killing the job. -> Exception type: clojure.lang.ExceptionInfo. Exception message: HTTP request failed!

asolovyov05:11:30

I wonder why feedback-exception! doesn't throw it in a test though?

asolovyov05:11:46

that happens if I call ack-fn after async-exception-fn

lucasbradstreet05:11:49

something weird is going on

lucasbradstreet05:11:54

for sure

asolovyov05:11:15

ok 🙂

lucasbradstreet05:11:54

I think I know what’s happening.

lucasbradstreet05:11:34

OK, so we setup all of our async calls in write-batch, and progress to the next state because we returned true

asolovyov05:11:47

yes...

lucasbradstreet05:11:00

At some point we receive a barrier, and we have already made all the write-batch calls so all we have to do is wait for the sync

lucasbradstreet05:11:22

so onyx is going to just keep calling sync until it returns true

lucasbradstreet05:11:50

so we’re stuck in a situation where we never return true, and because we only check for an exception in write-batch it never gets thrown

asolovyov05:11:59

yeah

asolovyov05:11:18

so what's the right way to deal with that? do not return true in write-batch until all async requests are finished?

lucasbradstreet05:11:31

You could do that, or you could just check for failures in sync and completed too

asolovyov05:11:36

asolovyov05:11:41

I mean that's easy 🙂

lucasbradstreet05:11:46

🙂

asolovyov05:11:02

do I need to track somehow that I throw an exception for correct batch or something?

lucasbradstreet05:11:06

Onyx appears to be propogating the exceptions properly, it just never got the chance to throw one.

lucasbradstreet05:11:18

Not really since it’ll cause it to rewind to the last barrier

lucasbradstreet05:11:24

so everything since the last barrier will be retried anyway

asolovyov05:11:57

> never got the chance to throw one that's what I figured out 🙂

asolovyov05:11:06

okay that sounds really cool

asolovyov05:11:09

about batch/barrier

asolovyov05:11:32

I mean I would love some docs though, because I drop asleep at reading papers 😮

lucasbradstreet05:11:18

This is probably the best you’ve currently got http://www.onyxplatform.org/docs/user-guide/0.12.x/#_asynchronous_barrier_snapshotting

asolovyov05:11:33

do I need to empty async-exception-info somehow?

asolovyov05:11:42

or HttpOutput will be recreated?

lucasbradstreet05:11:48

Not really since the best you can get is that the peer will be rebooted

asolovyov05:11:50

maybe in recover?

lucasbradstreet05:11:53

so everything will be re-initialized anyway

asolovyov05:11:01

asolovyov05:11:05

lucasbradstreet05:11:12

Actually you should probably reset it

lucasbradstreet05:11:28

Because maybe you get some old exceptions from the previous recovery

asolovyov05:11:49

ok! 🙂

lucasbradstreet05:11:53

and you end up throwing an exception even though none of the messages since the recovery should have caused it.

asolovyov05:11:22

yeah

asolovyov05:11:45

by the way, it seems I was getting same result with propagated exception if I called ack-fn

asolovyov05:11:27

so I wonder what's the better way - call ack-fn to cause write-batch or throw exception in synced? and completed? ?

asolovyov05:11:45

yeah it passes wooohooo! 🙂

asolovyov05:11:26

okay, I've pushed it, and will clean up and turn into a single commit a little bit later today

asolovyov08:11:27

@lucasbradstreet do you think it's necessary to implement task for batch output? I've never used that and can't think of use cases.

eelke13:11:36

Anyone has seen this Exception: Lost and regained image with the same session-id and different correlation-id? And has a hint with what goes wrong?

Travis13:11:24

Yeah, usually this has to do with memory issues with aeron

Travis13:11:03

O.12 adjusts the settings to be a little tamer for certain workloads

eelke13:11:02

Thank you @camechis. So increase the memory for aeron might help?

Travis13:11:22

Here are the settings for 0.12 https://github.com/onyx-platform/onyx/blob/0.12.x/changes.md

Travis13:11:06

Before that there are some jvm properties to set for aeron

lmergen14:11:08

@eelke https://clojurians-log.clojureverse.org/onyx/2017-10-20.html might be useful to quote @lucasbradstreet > Essentially what happened is the aeron messaging subscriber dropped off, but then rejoined again at a later point in the stream, which is unsafe because you could lose messages. I believed it would never happen in practice, so added that check as kind of an assert. In reality, I think there are conditions like GCs where the subscriber could be booted but be rejoined with the same session-id.

eelke15:11:46

ok thank you.

souenzzo19:11:37

Any tip to deal with concurrency on logs? - im on onyx 0.10 - trying to read a peer log.

lucasbradstreet19:11:55

@souenzzo concurrency in what way?

souenzzo22:11:56

There is 2 virtual peers (not sure about vpeers is the right term) both are on function A, one with data1, other with data2. Function A is defined as (comp C B) My log: start B with data 1 start B with data 2 start C with data 2 start C with data 1 ... I wanna get logs just for data1, or just for data2

michaeldrogalis23:11:28

Would recommend a log aggregator. Onyx doesn't do vpeer level logging isolation.

lucasbradstreet23:11:57

So, you mean log messages written to onyx.log?

souenzzo03:11:35

I'm using clojure.tools.logging... I will search about "log aggregator" and checkout onyx.log docs

lucasbradstreet03:11:46

If you want to be able to correlate logs to particular peer, you could inject :onyx.core/log-prefix from the event map into one of your fn arguments. log-prefix contains all of the info about a peer/slot/task/etc

fellows20:11:58

Hey @lucasbradstreet -- It looks like onyx.plugin.null/in-calls is missing in 0.12.x. I see it in 0.11.x, though, and we were using it in version 0.10.x.

lucasbradstreet20:11:31

Ah. Those calls don’t really do anything so I took them out. We actually have some functionality coming that makes the null plugin no longer necessary.

lucasbradstreet20:11:00

Basically we’ve added an :onyx/type :reduce that allows you to place a :reduce task as a leaf node without adding a plugin to it.

lucasbradstreet20:11:13

It’s not documented yet, so we aren’t really showing it anywhere.

fellows20:11:27

Ok, that means we can't get Onyx to start at all, since it can't find that function.

fellows20:11:49

If I just remove that from the lifecycles for that task, will it work?

lucasbradstreet20:11:03

Yeah, they weren’t doing anything anyway, which is why I removed them.

lucasbradstreet20:11:06

Sorry about that.

fellows20:11:49

Ok, not a huge deal, and easy enough to fix. Thanks.

2017-11-09

Channels