This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
- # admin-announcements (71)
- # beginners (8)
- # boot (109)
- # cbus (5)
- # cider (27)
- # cljsrn (77)
- # clojure (65)
- # clojure-austin (5)
- # clojure-berlin (1)
- # clojure-brasil (1)
- # clojure-dev (58)
- # clojure-japan (15)
- # clojure-russia (193)
- # clojure-seattle (3)
- # clojurescript (120)
- # cursive (19)
- # data-science (1)
- # datomic (10)
- # docs (1)
- # editors-rus (17)
- # emacs (2)
- # events (1)
- # funcool (7)
- # hoplon (2)
- # jobs (1)
- # jobs-rus (16)
- # ldnclj (7)
- # leiningen (3)
- # off-topic (12)
- # om (450)
- # onyx (122)
- # re-frame (69)
- # reagent (28)
- # yada (20)
@spangler: See the docs: https://github.com/onyx-platform/onyx/blob/0.8.x/doc/user-guide/architecture-low-level-design.md Your messages are failing to complete, and are being replayed for fault tolerance.
You're seeing "Stopping Task LifeCycle for blahblah" and all your data is coming back to you?
Tasks will restart themselves if they encounter an exception. Maybe is throwing an exception on shutdown?
I noticed in your catalog you set
true, which would cause a full reload of your Kafka stream, so thats not surprising
It's a helpful option if you want to reprocess the entire message stream, if, say, you had a bug in your code that caused a running aggregate to be incorrect, and you need to see the entire stream from the beginning to correct it.
I suspect your job is failing to shutdown cleanly, and Onyx is resurrecting your job, and its starting from the beginning again, because thats what the configuration is set to do.
I'd try stripping down your job piece by piece and finding the part that's causing the problem.
We gave you that SNAPSHOT release yesterday to fix the
:data key problem, but we knew that under some particular conditions, it swallowed exceptions. I fixed the problem entirely yesterday and pushed up a new SNAPSHOT.
Move to the latest, you'll probably see your exception there. Sorry - I forgot you moved to the edge.
Okay, here we go:
FATAL [onyx.messaging.aeron.publication-manager] - Aeron write from buffer error: java.lang.IllegalArgumentException: Encoded message exceeds maxMessageLength of 2097152, length=5716465
I havent seen that error before. You shouldnt have to configure Aeron like that.
I've not tested Onyx with messages that large before. I'd look at Aeron and see what you need to configure to allow you to do that.
They're very particular about what they offer by default to achieve such high performance.
I submitted two jobs . the first one is working correctly . Onyx and kafka share the same Zookeeper .
after submitting the second job , I had zookeeper bad version error and everything hangs
No deadlock detection. That's weird that you get a zookeeper error. Are you using the same group-Id for zookeeper checkpointing?
Is the zookeeper error showing up in the ZK logs or are they in the onyx logs?
Not usually. You should be able to rejoin on the same id and the jobs should be auto killed. Let us know if you hit any cases where they aren't
FATAL [onyx.messaging.aeron.publication-manager] - Aeron write from buffer error: java.lang.IllegalArgumentException: Encoded message exceeds maxMessageLength of 2097152, length=2112492
Create an issue and I’ll look into whether a single big message can take it down
Well, what’s the alternative? Complete the job even though one of the messages never made it through?
"-Daeron.mtu.length=16384" "-Daeron.socket.so_sndbuf=2097152" "-Daeron.socket.so_rcvbuf=2097152" "-Daeron.rcv.buffer.length=16384" "-Daeron.rcv.initial.window.length=2097152"
I don’t think you need to handle that exception, you need to reconfigure aeron
The thing to do here is to configure Aeron to handle messages as large as you can possibly send for your application. Onyx rebooting the job is desirable behavior. It's trying to get the work done at any costs. It has no idea that its going to process the same message again - it cannot make that assumption. We have
:onyx/restart-pred-fn to conditionally kill the job when there's a task-level error, but we can't do much if there's a networking link error. We presume its transient and will be resolved.
If the job restarting is a problem even after you've adjusted the size, you're going to need to design in a defense of a rebooted job. Can't do much else atm, and it feels out of scope to try to go there.
I could design defensively, if I was able to tell from inside my code that something went wrong. The problem is the exception occurs outside of the scope of the task, so I can't handle it or respond to it
Its a hard thing to handle even if you could, because there's no good way to ask for the size of an object's byte count in the JVM
Is there some way to detect that a particular task is throwing that messaging exception (one we know is not transient) and not retry that task?
The closest thing I can think to do is catch that specific exception type and do a string comparison for that one, particular error. Feels gross, but might be the right move
Its transient if your tasks are non-deterministic, e.g. if they're doing a database look up and sending back results.
I'd say for now reconfigure Aeron to a very high limit, Ill think on it in the mean time
Okay, here is a new one I haven't seen before:
clojure.core.async/thread-call/fn async.clj: 434 onyx.peer.task-lifecycle/launch-aux-threads!/fn task_lifecycle.clj: 449 onyx.peer.pipeline-extensions/fn/G pipeline_extensions.clj: 4 clojure.core/-cache-protocol-fn core_deftype.clj: 554 java.lang.IllegalArgumentException: No implementation of method: :ack-segment of protocol: #'onyx.peer.pipeline-extensions/PipelineInput found for class: onyx.peer.function.Function