Fork me on GitHub
#onyx
<
2017-10-20
>
ben.mumford12:10:45

we are seeing the following error quite often: Oct 13 17:04:16 stgonyx01 java[17679]: Exception in thread "main" clojure.lang.ExceptionInfo: Lost and regained image with the same session-id and different correlation-id. {:correlation-id 694, :original Oct 13 17:04:16 stgonyx01 java[17679]: at clojure.core$ex_info.invokeStatic(core.clj:4617) Oct 13 17:04:16 stgonyx01 java[17679]: at clojure.core$ex_info.invoke(core.clj:4617) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.subscriber$check_correlation_id_alignment.invokeStatic(subscriber.clj:77) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.subscriber$check_correlation_id_alignment.invoke(subscriber.clj:75) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.subscriber.Subscriber.onFragment(subscriber.clj:262) Oct 13 17:04:16 stgonyx01 java[17679]: at io.aeron.ControlledFragmentAssembler.onFragment(ControlledFragmentAssembler.java:141) Oct 13 17:04:16 stgonyx01 java[17679]: at io.aeron.Image.controlledPoll(Image.java:332) Oct 13 17:04:16 stgonyx01 java[17679]: at io.aeron.Subscription.controlledPoll(Subscription.java:236) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.subscriber.Subscriber.poll_BANG_(subscriber.clj:204) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.messenger.AeronMessenger.poll(messenger.clj:152) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.read_batch$read_function_batch.invokeStatic(read_batch.clj:18) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.read_batch$read_function_batch.invoke(read_batch.clj:17) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$wrap_lifecycle_metrics$fn__25591.invoke(task_lifecycle.clj:900) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle.TaskStateMachine.exec(task_lifecycle.clj:873) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$iteration.invokeStatic(task_lifecycle.clj:458) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$iteration.invoke(task_lifecycle.clj:455) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$run_task_lifecycle_BANG_.invokeStatic(task_lifecycle.clj:476) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$run_task_lifecycle_BANG_.invoke(task_lifecycle.clj:466) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$start_task_lifecycle_BANG_$fn__25611.invoke(task_lifecycle.clj:953) Oct 13 17:04:16 stgonyx01 java[17679]: at clojure.core.async$thread_call$fn__9579.invoke(async.clj:442) Oct 13 17:04:16 stgonyx01 java[17679]: at clojure.lang.AFn.run(AFn.java:22) Oct 13 17:04:16 stgonyx01 java[17679]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) Oct 13 17:04:16 stgonyx01 java[17679]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Oct 13 17:04:16 stgonyx01 java[17679]: at java.lang.Thread.run(Thread.java:748)

ben.mumford12:10:03

any idea what this means, and how to mitigate it?

ben.mumford14:10:18

i'm using the following dependencies: [org.onyxplatform/onyx "0.11.0"] [org.onyxplatform/lib-onyx "0.11.0.0"] [org.onyxplatform/onyx-kafka "0.11.0.0"]

lucasbradstreet16:10:51

@ben.mumford did it sort itself out? Essentially what happened is the aeron messaging subscriber dropped off, but then rejoined again at a later point in the stream, which is unsafe because you could lose messages. I believed it would never happen in practice, so added that check as kind of an assert. In reality, I think there are conditions like GCs where the subscriber could be booted but be rejoined with the same session-id.

lucasbradstreet16:10:28

Generally what will happen is that the peer will reboot and it should rejoin correctly, but if you are seeing these circumstances frequently then you probably have some larger problem (generally long GCs causing aeron timeouts)

ben.mumford16:10:38

tbh we've been having all sorts of problems with aeron

ben.mumford16:10:03

the most reliable we have it is when we run the embedded aeron

ben.mumford16:10:11

is there a reason it isn't recommended in production?

lucasbradstreet16:10:07

Funnily enough that is recommended because you want to reduce the chance of your user code causing GCs and that causing timeouts in aeron.

ben.mumford16:10:16

ah so embedded aeron is ok?

ben.mumford16:10:25

then that settles it 🙂

ben.mumford16:10:31

why does the documentation say otherwise?

lucasbradstreet16:10:45

No, I mean it’s recommended to run outside the main JVM for that reason

lucasbradstreet16:10:02

It’s odd that you’re having the opposite experience unless your aeron is underprovisioned

ben.mumford16:10:48

in what way underprovisioned?

lucasbradstreet16:10:53

Mostly if you’re using something like mesos or k8s and aren’t giving it enough memory (JVM or container) and CPU slices. Not that it needs a lot.

lucasbradstreet16:10:28

If you could take flight recordings of both the onyx app JVM and the aeron JVM, that would help. Are you seeing these issues locally?

ben.mumford16:10:57

we were running like this (as described on the aeron github readme): java -cp aeron.jar -Daeron.dir=/dev/shm/aeron io.aeron.driver.MediaDriver

lucasbradstreet16:10:31

The same version as Onyx 0.11.0 uses?

lucasbradstreet16:10:07

0.11.0 runs on 1.4.1

ben.mumford16:10:22

we'd run aeron, try to start the peers and then nothing in the logs and nothing in the dashboard

ben.mumford16:10:37

when we use the embedded one we see all the uuids for all the peers

ben.mumford16:10:50

aeron is driving me round the bend

ben.mumford16:10:14

i saw in another message you wrote about a -server flag?

ben.mumford16:10:17

does that still apply

ben.mumford16:10:28

i couldn't find anything in any documentation for it

lucasbradstreet16:10:17

Oh, ok, so you can’t run with non-embedded Aeron at all?

lucasbradstreet16:10:43

-server is just a flag that sets a few JVM options that tend to be good defaults for running server apps in prod.

ben.mumford16:10:03

it runs using the sample code in the onyx documenation but is flakey as anything

lucasbradstreet16:10:06

To wrap up. You’re seeing these session issues with an embedded aeron, and when you run it out of process it doesn’t work at all.

ben.mumford16:10:20

and running the distribution aeron as described above, nothing works at all

ben.mumford16:10:27

so we're back to embedded for now

lucasbradstreet16:10:42

Which sample code? The sample external aeron?

ben.mumford16:10:05

(ns your-app.aeron-media-driver (:require [clojure.core.async :refer [chan <!!]]) (:import [io.aeron Aeron$Context] [io.aeron.driver MediaDriver MediaDriver$Context ThreadingMode])) (defn -main [& args] (let [ctx (doto (MediaDriver$Context.)) media-driver (MediaDriver/launch ctx)] (println "Launched the Media Driver. Blocking forever...") (<!! (chan))))

lucasbradstreet16:10:23

Right, so you run that out of process, and have the session problems / general flakiness?

ben.mumford16:10:38

sorry for not being clear

lucasbradstreet16:10:50

No worries, just trying to make sure I understand the whole picture.

lucasbradstreet16:10:58

Can you reproduce these issues locally?

ben.mumford16:10:18

erm, not sure tbh i can try

lucasbradstreet16:10:31

sure, we have options if not, it’ll just be easier locally.

ben.mumford16:10:45

when i'm developing locally i'm using the trusty embedded aeron 🙂

ben.mumford16:10:18

we could try the -server option i guess, see what that does

lmergen17:10:34

in onyx' current design, is it correct that a full OS thread is dedicated to each virtual peer ? https://github.com/onyx-platform/onyx/blob/0.11.x/src/onyx/peer/task_lifecycle.clj#L1048

lmergen17:10:25

i'm trying to figure out some design constraints where you might want to run many virtual peers

lmergen17:10:33

(say, hundreds per server)

souenzzo18:10:04

I can call a function always that onyx peer lost zookeeper connection?

lucasbradstreet18:10:48

@lmergen that’s correct, but we have discussed thread sharing and it wouldn’t be that hard to use an executor to do so as all our code is non blocking

lucasbradstreet18:10:08

@lmergen we’ve been waiting for the right time (or someone to contribute it). It’s absolutely possible.

lmergen18:10:22

ok good to know.

lucasbradstreet18:10:28

@souenzzo could you explain what you mean a little further?

souenzzo18:10:49

I'm getting 21:55:26.034 ERROR org.apache.curator.ConnectionState - Connection timed out for connection string from

...
onyx.log.zookeeper$clean_up_broken_connections
...
It dont stop my peer, but it's a huge problem. I want to send a notification when it occurs

lucasbradstreet18:10:02

Ah, so you’re interested in monitoring? We recommend setting up prometheus, pointing it at onyx-peer-http-query and then using alerts like what we setup here https://github.com/onyx-platform/onyx-monitoring