This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2015-11-11
Channels
- # admin-announcements (71)
- # beginners (8)
- # boot (109)
- # cbus (5)
- # cider (27)
- # cljsrn (77)
- # clojure (65)
- # clojure-austin (5)
- # clojure-berlin (1)
- # clojure-brasil (1)
- # clojure-dev (58)
- # clojure-japan (15)
- # clojure-russia (193)
- # clojure-seattle (3)
- # clojurescript (120)
- # cursive (19)
- # data-science (1)
- # datomic (10)
- # docs (1)
- # editors-rus (17)
- # emacs (2)
- # events (1)
- # funcool (7)
- # hoplon (2)
- # jobs (1)
- # jobs-rus (16)
- # ldnclj (7)
- # leiningen (3)
- # off-topic (12)
- # om (450)
- # onyx (122)
- # re-frame (69)
- # reagent (28)
- # yada (20)
@spangler: See the docs: https://github.com/onyx-platform/onyx/blob/0.8.x/doc/user-guide/architecture-low-level-design.md Your messages are failing to complete, and are being replayed for fault tolerance.
Your segments are failed to be fully processed.
You're seeing "Stopping Task LifeCycle for blahblah" and all your data is coming back to you?
Tasks will restart themselves if they encounter an exception. Maybe is throwing an exception on shutdown?
I noticed in your catalog you set :force-reset?
to true
, which would cause a full reload of your Kafka stream, so thats not surprising
It's a helpful option if you want to reprocess the entire message stream, if, say, you had a bug in your code that caused a running aggregate to be incorrect, and you need to see the entire stream from the beginning to correct it.
I suspect your job is failing to shutdown cleanly, and Onyx is resurrecting your job, and its starting from the beginning again, because thats what the configuration is set to do.
I'd try stripping down your job piece by piece and finding the part that's causing the problem.
Ah -- one thing... Hold on a sec.
We gave you that SNAPSHOT release yesterday to fix the :data
key problem, but we knew that under some particular conditions, it swallowed exceptions. I fixed the problem entirely yesterday and pushed up a new SNAPSHOT.
Move to the latest, you'll probably see your exception there. Sorry - I forgot you moved to the edge.
Okay, here we go: FATAL [onyx.messaging.aeron.publication-manager] - Aeron write from buffer error: java.lang.IllegalArgumentException: Encoded message exceeds maxMessageLength of 2097152, length=5716465
Yeah, sorry about that. That was a bug in Timbre we found a while back
I havent seen that error before. You shouldnt have to configure Aeron like that.
How big of a message are you sending?
I've not tested Onyx with messages that large before. I'd look at Aeron and see what you need to configure to allow you to do that.
They're very particular about what they offer by default to achieve such high performance.
I submitted two jobs . the first one is working correctly . Onyx and kafka share the same Zookeeper .
after submitting the second job , I had zookeeper bad version error and everything hangs
No deadlock detection. That's weird that you get a zookeeper error. Are you using the same group-Id for zookeeper checkpointing?
You should probably use a different group-Id between the two jobs
Ah, hmm
Do you have enough peers for both jobs?
Is the zookeeper error showing up in the ZK logs or are they in the onyx logs?
Using ZK over high latency / unreliable connections can suck
Not usually. You should be able to rejoin on the same id and the jobs should be auto killed. Let us know if you hit any cases where they aren't
Okay @michaeldrogalis @lucasbradstreet I have identified this error:
FATAL [onyx.messaging.aeron.publication-manager] - Aeron write from buffer error: java.lang.IllegalArgumentException: Encoded message exceeds maxMessageLength of 2097152, length=2112492
It should try to reset the publication after that point
Create an issue and I’ll look into whether a single big message can take it down
What’s probably happening is that message never makes it through
Then after the pending-timeout it retries
same result etc
So the job never finishes
Well, what’s the alternative? Complete the job even though one of the messages never made it through?
The bet is that the error is transitory
Yeah, we do that for most things
The problem is that the error is in the messaging layer
It’s kinda hard to give you hooks there
I think these are the settings you’re looking for
"-Daeron.mtu.length=16384" "-Daeron.socket.so_sndbuf=2097152" "-Daeron.socket.so_rcvbuf=2097152" "-Daeron.rcv.buffer.length=16384" "-Daeron.rcv.initial.window.length=2097152"
I don’t think you need to handle that exception, you need to reconfigure aeron
so you can send big messages
Create an issue for this anyway. I’ll definitely have a think about it
I can see what you’re saying
Cool, thank you @lucasbradstreet
I think the only thing we could really do is kill the job
Aeron will never be able to send the message
The thing to do here is to configure Aeron to handle messages as large as you can possibly send for your application. Onyx rebooting the job is desirable behavior. It's trying to get the work done at any costs. It has no idea that its going to process the same message again - it cannot make that assumption. We have :onyx/restart-pred-fn
to conditionally kill the job when there's a task-level error, but we can't do much if there's a networking link error. We presume its transient and will be resolved.
If the job restarting is a problem even after you've adjusted the size, you're going to need to design in a defense of a rebooted job. Can't do much else atm, and it feels out of scope to try to go there.
I could design defensively, if I was able to tell from inside my code that something went wrong. The problem is the exception occurs outside of the scope of the task, so I can't handle it or respond to it
Its a hard thing to handle even if you could, because there's no good way to ask for the size of an object's byte count in the JVM
Is there some way to detect that a particular task is throwing that messaging exception (one we know is not transient) and not retry that task?
The closest thing I can think to do is catch that specific exception type and do a string comparison for that one, particular error. Feels gross, but might be the right move
Its transient if your tasks are non-deterministic, e.g. if they're doing a database look up and sending back results.
I'd say for now reconfigure Aeron to a very high limit, Ill think on it in the mean time
And yes, there are IAE
error we retry on purpose. Np
Okay, here is a new one I haven't seen before:
clojure.core.async/thread-call/fn async.clj: 434
onyx.peer.task-lifecycle/launch-aux-threads!/fn task_lifecycle.clj: 449
onyx.peer.pipeline-extensions/fn/G pipeline_extensions.clj: 4
clojure.core/-cache-protocol-fn core_deftype.clj: 554
java.lang.IllegalArgumentException: No implementation of method: :ack-segment of protocol: #'onyx.peer.pipeline-extensions/PipelineInput found for class: onyx.peer.function.Function