This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-10-20
Channels
- # aws-lambda (7)
- # beginners (113)
- # boot (17)
- # cider (4)
- # cljs-dev (4)
- # clojure (65)
- # clojure-greece (3)
- # clojure-italy (7)
- # clojure-russia (10)
- # clojure-spec (37)
- # clojure-uk (20)
- # clojurescript (76)
- # community-development (2)
- # cursive (24)
- # data-science (9)
- # datomic (9)
- # emacs (1)
- # fulcro (2)
- # graphql (11)
- # hoplon (13)
- # juxt (15)
- # leiningen (1)
- # off-topic (36)
- # om (1)
- # onyx (59)
- # parinfer (41)
- # pedestal (7)
- # portkey (60)
- # protorepl (4)
- # re-frame (345)
- # reagent (7)
- # ring-swagger (16)
- # shadow-cljs (121)
- # spacemacs (30)
- # sql (6)
- # uncomplicate (2)
- # unrepl (9)
- # vim (13)
- # yada (2)
we are seeing the following error quite often: Oct 13 17:04:16 stgonyx01 java[17679]: Exception in thread "main" clojure.lang.ExceptionInfo: Lost and regained image with the same session-id and different correlation-id. {:correlation-id 694, :original Oct 13 17:04:16 stgonyx01 java[17679]: at clojure.core$ex_info.invokeStatic(core.clj:4617) Oct 13 17:04:16 stgonyx01 java[17679]: at clojure.core$ex_info.invoke(core.clj:4617) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.subscriber$check_correlation_id_alignment.invokeStatic(subscriber.clj:77) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.subscriber$check_correlation_id_alignment.invoke(subscriber.clj:75) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.subscriber.Subscriber.onFragment(subscriber.clj:262) Oct 13 17:04:16 stgonyx01 java[17679]: at io.aeron.ControlledFragmentAssembler.onFragment(ControlledFragmentAssembler.java:141) Oct 13 17:04:16 stgonyx01 java[17679]: at io.aeron.Image.controlledPoll(Image.java:332) Oct 13 17:04:16 stgonyx01 java[17679]: at io.aeron.Subscription.controlledPoll(Subscription.java:236) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.subscriber.Subscriber.poll_BANG_(subscriber.clj:204) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.messaging.aeron.messenger.AeronMessenger.poll(messenger.clj:152) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.read_batch$read_function_batch.invokeStatic(read_batch.clj:18) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.read_batch$read_function_batch.invoke(read_batch.clj:17) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$wrap_lifecycle_metrics$fn__25591.invoke(task_lifecycle.clj:900) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle.TaskStateMachine.exec(task_lifecycle.clj:873) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$iteration.invokeStatic(task_lifecycle.clj:458) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$iteration.invoke(task_lifecycle.clj:455) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$run_task_lifecycle_BANG_.invokeStatic(task_lifecycle.clj:476) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$run_task_lifecycle_BANG_.invoke(task_lifecycle.clj:466) Oct 13 17:04:16 stgonyx01 java[17679]: at onyx.peer.task_lifecycle$start_task_lifecycle_BANG_$fn__25611.invoke(task_lifecycle.clj:953) Oct 13 17:04:16 stgonyx01 java[17679]: at clojure.core.async$thread_call$fn__9579.invoke(async.clj:442) Oct 13 17:04:16 stgonyx01 java[17679]: at clojure.lang.AFn.run(AFn.java:22) Oct 13 17:04:16 stgonyx01 java[17679]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) Oct 13 17:04:16 stgonyx01 java[17679]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) Oct 13 17:04:16 stgonyx01 java[17679]: at java.lang.Thread.run(Thread.java:748)
any idea what this means, and how to mitigate it?
i'm using the following dependencies: [org.onyxplatform/onyx "0.11.0"] [org.onyxplatform/lib-onyx "0.11.0.0"] [org.onyxplatform/onyx-kafka "0.11.0.0"]
@ben.mumford did it sort itself out? Essentially what happened is the aeron messaging subscriber dropped off, but then rejoined again at a later point in the stream, which is unsafe because you could lose messages. I believed it would never happen in practice, so added that check as kind of an assert. In reality, I think there are conditions like GCs where the subscriber could be booted but be rejoined with the same session-id.
Generally what will happen is that the peer will reboot and it should rejoin correctly, but if you are seeing these circumstances frequently then you probably have some larger problem (generally long GCs causing aeron timeouts)
tbh we've been having all sorts of problems with aeron
the most reliable we have it is when we run the embedded aeron
is there a reason it isn't recommended in production?
Funnily enough that is recommended because you want to reduce the chance of your user code causing GCs and that causing timeouts in aeron.
ah so embedded aeron is ok?
then that settles it 🙂
why does the documentation say otherwise?
No, I mean it’s recommended to run outside the main JVM for that reason
It’s odd that you’re having the opposite experience unless your aeron is underprovisioned
in what way underprovisioned?
Mostly if you’re using something like mesos or k8s and aren’t giving it enough memory (JVM or container) and CPU slices. Not that it needs a lot.
If you could take flight recordings of both the onyx app JVM and the aeron JVM, that would help. Are you seeing these issues locally?
we were running like this (as described on the aeron github readme): java -cp aeron.jar -Daeron.dir=/dev/shm/aeron io.aeron.driver.MediaDriver
The same version as Onyx 0.11.0 uses?
0.11.0 runs on 1.4.1
v1.4.1
we'd run aeron, try to start the peers and then nothing in the logs and nothing in the dashboard
when we use the embedded one we see all the uuids for all the peers
aeron is driving me round the bend
i saw in another message you wrote about a -server flag?
does that still apply
i couldn't find anything in any documentation for it
Oh, ok, so you can’t run with non-embedded Aeron at all?
-server
is just a flag that sets a few JVM options that tend to be good defaults for running server apps in prod.
it runs using the sample code in the onyx documenation but is flakey as anything
To wrap up. You’re seeing these session issues with an embedded aeron, and when you run it out of process it doesn’t work at all.
and running the distribution aeron as described above, nothing works at all
so we're back to embedded for now
Which sample code? The sample external aeron?
(ns your-app.aeron-media-driver (:require [clojure.core.async :refer [chan <!!]]) (:import [io.aeron Aeron$Context] [io.aeron.driver MediaDriver MediaDriver$Context ThreadingMode])) (defn -main [& args] (let [ctx (doto (MediaDriver$Context.)) media-driver (MediaDriver/launch ctx)] (println "Launched the Media Driver. Blocking forever...") (<!! (chan))))
Right, so you run that out of process, and have the session problems / general flakiness?
sorry for not being clear
No worries, just trying to make sure I understand the whole picture.
Can you reproduce these issues locally?
erm, not sure tbh i can try
sure, we have options if not, it’ll just be easier locally.
when i'm developing locally i'm using the trusty embedded aeron 🙂
we could try the -server option i guess, see what that does
in onyx' current design, is it correct that a full OS thread is dedicated to each virtual peer ? https://github.com/onyx-platform/onyx/blob/0.11.x/src/onyx/peer/task_lifecycle.clj#L1048
i'm trying to figure out some design constraints where you might want to run many virtual peers
@lmergen that’s correct, but we have discussed thread sharing and it wouldn’t be that hard to use an executor to do so as all our code is non blocking
@lmergen we’ve been waiting for the right time (or someone to contribute it). It’s absolutely possible.
@souenzzo could you explain what you mean a little further?
I'm getting 21:55:26.034 ERROR org.apache.curator.ConnectionState - Connection timed out for connection string
from
...
onyx.log.zookeeper$clean_up_broken_connections
...
It dont stop my peer, but it's a huge problem. I want to send a notification when it occursAh, so you’re interested in monitoring? We recommend setting up prometheus, pointing it at onyx-peer-http-query and then using alerts like what we setup here https://github.com/onyx-platform/onyx-monitoring