Fork me on GitHub
#onyx
<
2015-11-11
>
spangler00:11:22

Actually this happens even if I have a single process running

spangler00:11:44

The tasks are just repeating over and over again

spangler00:11:51

at high speed

michaeldrogalis01:11:11

@spangler: See the docs: https://github.com/onyx-platform/onyx/blob/0.8.x/doc/user-guide/architecture-low-level-design.md Your messages are failing to complete, and are being replayed for fault tolerance.

michaeldrogalis01:11:27

Your segments are failed to be fully processed.

spangler01:11:19

So ... I see the tasks finish

spangler01:11:39

How could they fail, when the tasks themselves finish?

michaeldrogalis01:11:08

You're seeing "Stopping Task LifeCycle for blahblah" and all your data is coming back to you?

spangler01:11:11

Yes, I see all the "Stopping Task LifeCycle ..." but the :done never fires

spangler01:11:17

then they repeat over and over

michaeldrogalis01:11:59

Tasks will restart themselves if they encounter an exception. Maybe is throwing an exception on shutdown?

michaeldrogalis01:11:17

I noticed in your catalog you set :force-reset? to true, which would cause a full reload of your Kafka stream, so thats not surprising

spangler01:11:34

Wait, so why would you ever set :force-reset? to true then?

spangler01:11:14

When does it cause a full reload?

spangler01:11:28

From the beginning of that topic?

spangler01:11:36

because I create the topic for that job

spangler01:11:32

Would it reread the entire set of messages for that topic?

spangler01:11:38

That would explain it I guess....

michaeldrogalis01:11:32

It's a helpful option if you want to reprocess the entire message stream, if, say, you had a bug in your code that caused a running aggregate to be incorrect, and you need to see the entire stream from the beginning to correct it.

spangler01:11:29

Okay, but would it be triggered during the job?

spangler01:11:38

Like I say, the topic is created with the job

spangler01:11:10

So that's not it then

michaeldrogalis01:11:03

I suspect your job is failing to shutdown cleanly, and Onyx is resurrecting your job, and its starting from the beginning again, because thats what the configuration is set to do.

spangler01:11:31

Where do I detect it failing to shutdown cleanly? How can I ensure it does?

spangler01:11:42

I am not getting exceptions anywhere

spangler01:11:02

It just at a certain point starts performing the same tasks at high speed

michaeldrogalis01:11:21

I'd try stripping down your job piece by piece and finding the part that's causing the problem.

michaeldrogalis01:11:29

Ah -- one thing... Hold on a sec.

michaeldrogalis01:11:21

We gave you that SNAPSHOT release yesterday to fix the :data key problem, but we knew that under some particular conditions, it swallowed exceptions. I fixed the problem entirely yesterday and pushed up a new SNAPSHOT.

michaeldrogalis01:11:39

Move to the latest, you'll probably see your exception there. Sorry - I forgot you moved to the edge.

spangler02:11:36

Okay, here we go: FATAL [onyx.messaging.aeron.publication-manager] - Aeron write from buffer error: java.lang.IllegalArgumentException: Encoded message exceeds maxMessageLength of 2097152, length=5716465

spangler02:11:51

How do I set the Aeron write buffer size

spangler02:11:29

And thank you for not swallowing exceptions

michaeldrogalis02:11:58

Yeah, sorry about that. That was a bug in Timbre we found a while back

michaeldrogalis02:11:10

I havent seen that error before. You shouldnt have to configure Aeron like that.

michaeldrogalis02:11:25

How big of a message are you sending?

spangler02:11:31

It says there

michaeldrogalis02:11:42

I've not tested Onyx with messages that large before. I'd look at Aeron and see what you need to configure to allow you to do that.

michaeldrogalis02:11:00

They're very particular about what they offer by default to achieve such high performance.

yusup09:11:34

Hi, is there deadlock detection within Onyx?

yusup09:11:16

I submitted two jobs . the first one is working correctly . Onyx and kafka share the same Zookeeper .

yusup09:11:18

after submitting the second job , I had zookeeper bad version error and everything hangs

lucasbradstreet09:11:05

No deadlock detection. That's weird that you get a zookeeper error. Are you using the same group-Id for zookeeper checkpointing?

lucasbradstreet09:11:24

You should probably use a different group-Id between the two jobs

yusup09:11:01

first one uses kafka. second one doesnot

lucasbradstreet09:11:55

Do you have enough peers for both jobs?

lucasbradstreet09:11:24

Is the zookeeper error showing up in the ZK logs or are they in the onyx logs?

yusup09:11:51

My vpn failed

lucasbradstreet09:11:45

Using ZK over high latency / unreliable connections can suck

yusup09:11:41

Do messages have to be cleaned up?

yusup09:11:54

If I want to to reset my cluster

yusup10:11:04

Other than cleaning up ask

yusup10:11:15

zookeeper

lucasbradstreet10:11:55

Not usually. You should be able to rejoin on the same id and the jobs should be auto killed. Let us know if you hit any cases where they aren't

spangler21:11:56

Okay @michaeldrogalis @lucasbradstreet I have identified this error:

FATAL [onyx.messaging.aeron.publication-manager] - Aeron write from buffer error:  java.lang.IllegalArgumentException: Encoded message exceeds maxMessageLength of 2097152, length=2112492

spangler21:11:04

So I need to raise that buffer size

spangler21:11:14

But also, once it goes off onyx fails from this point on

spangler21:11:24

Which does not seem very fault tolerant!

spangler21:11:35

Can we catch this exception rather than failing entirely?

lucasbradstreet21:11:24

It should try to reset the publication after that point

lucasbradstreet21:11:38

Create an issue and I’ll look into whether a single big message can take it down

spangler21:11:56

Hmm... would that be why my tasks just spin forever from that point on?

spangler21:11:06

What does "reset the publication" mean?

lucasbradstreet21:11:27

What’s probably happening is that message never makes it through

lucasbradstreet21:11:35

Then after the pending-timeout it retries

lucasbradstreet21:11:42

So the job never finishes

spangler21:11:49

That seems like undesirable behavior

lucasbradstreet21:11:13

Well, what’s the alternative? Complete the job even though one of the messages never made it through?

lucasbradstreet21:11:22

The bet is that the error is transitory

spangler21:11:28

Yeah, stop the job

spangler21:11:35

Or at least let the user catch it?

lucasbradstreet21:11:36

Yeah, we do that for most things

spangler21:11:37

and handle it?

lucasbradstreet21:11:48

The problem is that the error is in the messaging layer

spangler21:11:49

As it is, they spin forever and I need to kill everything

lucasbradstreet21:11:01

It’s kinda hard to give you hooks there

spangler21:11:19

Does that prevent us from catching it?

spangler21:11:28

I need to be able to handle that exception, but it is closed to me

lucasbradstreet21:11:37

I think these are the settings you’re looking for

lucasbradstreet21:11:37

"-Daeron.mtu.length=16384" "-Daeron.socket.so_sndbuf=2097152" "-Daeron.socket.so_rcvbuf=2097152" "-Daeron.rcv.buffer.length=16384" "-Daeron.rcv.initial.window.length=2097152"

lucasbradstreet21:11:47

I don’t think you need to handle that exception, you need to reconfigure aeron

lucasbradstreet21:11:54

so you can send big messages

spangler21:11:10

It seems like a significant error state that, once triggered, is unrecoverable

spangler21:11:14

I will up the size

lucasbradstreet21:11:24

Create an issue for this anyway. I’ll definitely have a think about it

lucasbradstreet21:11:27

I can see what you’re saying

spangler21:11:29

but there has to be a better way

lucasbradstreet21:11:36

I think the only thing we could really do is kill the job

lucasbradstreet21:11:50

Aeron will never be able to send the message

spangler21:11:28

That would be good behavior, then the app would not freak out and need killing

spangler21:11:05

and I could fix it and redeploy without customers being totally distressed ; )

michaeldrogalis22:11:39

The thing to do here is to configure Aeron to handle messages as large as you can possibly send for your application. Onyx rebooting the job is desirable behavior. It's trying to get the work done at any costs. It has no idea that its going to process the same message again - it cannot make that assumption. We have :onyx/restart-pred-fn to conditionally kill the job when there's a task-level error, but we can't do much if there's a networking link error. We presume its transient and will be resolved.

michaeldrogalis22:11:25

If the job restarting is a problem even after you've adjusted the size, you're going to need to design in a defense of a rebooted job. Can't do much else atm, and it feels out of scope to try to go there.

spangler22:11:07

I could design defensively, if I was able to tell from inside my code that something went wrong. The problem is the exception occurs outside of the scope of the task, so I can't handle it or respond to it

michaeldrogalis22:11:05

Its a hard thing to handle even if you could, because there's no good way to ask for the size of an object's byte count in the JVM

spangler22:11:42

Is there some way to detect that a particular task is throwing that messaging exception (one we know is not transient) and not retry that task?

spangler22:11:01

Or provide configuration for that?

michaeldrogalis22:11:33

The closest thing I can think to do is catch that specific exception type and do a string comparison for that one, particular error. Feels gross, but might be the right move

spangler22:11:29

Are there other IllegalArgumentExceptions that you do want to retry?

spangler22:11:40

It seems like an IAE will always be a permanent failure

spangler22:11:12

That is not a network issue, it is a code issue

spangler22:11:40

ie: not transient

michaeldrogalis22:11:38

Its transient if your tasks are non-deterministic, e.g. if they're doing a database look up and sending back results.

michaeldrogalis22:11:56

I'd say for now reconfigure Aeron to a very high limit, Ill think on it in the mean time

spangler22:11:05

Okay, thanks for checking it out

michaeldrogalis22:11:14

And yes, there are IAE error we retry on purpose. Np

spangler23:11:34

Okay, here is a new one I haven't seen before:

clojure.core.async/thread-call/fn                async.clj:  434
   onyx.peer.task-lifecycle/launch-aux-threads!/fn       task_lifecycle.clj:  449
                onyx.peer.pipeline-extensions/fn/G  pipeline_extensions.clj:    4
                   clojure.core/-cache-protocol-fn         core_deftype.clj:  554
java.lang.IllegalArgumentException: No implementation of method: :ack-segment of protocol: #'onyx.peer.pipeline-extensions/PipelineInput found for class: onyx.peer.function.Function

spangler23:11:38

Any idea what that is about?