This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2017-10-30
Channels
- # aws (5)
- # aws-lambda (2)
- # beginners (29)
- # boot (5)
- # cider (3)
- # cljs-dev (3)
- # cljsjs (2)
- # clojure (112)
- # clojure-austin (1)
- # clojure-brasil (2)
- # clojure-italy (9)
- # clojure-nl (2)
- # clojure-russia (5)
- # clojure-spec (49)
- # clojure-uk (41)
- # clojurescript (157)
- # core-logic (5)
- # crypto (1)
- # cursive (12)
- # data-science (38)
- # datomic (31)
- # emacs (3)
- # events (2)
- # garden (3)
- # graphql (10)
- # immutant (4)
- # jobs (3)
- # juxt (5)
- # klipse (1)
- # luminus (3)
- # off-topic (40)
- # om (1)
- # onyx (39)
- # other-languages (7)
- # protorepl (3)
- # re-frame (40)
- # reagent (60)
- # ring (8)
- # ring-swagger (14)
- # shadow-cljs (159)
- # spacemacs (1)
- # specter (6)
- # uncomplicate (3)
- # yada (2)
@camechis I failed to consider that -XX:+UseCGroupMemoryLimitForHeap
still takes into account the JVM default heap ratio, 1/4
if available memory. You’ll likely want to run with -XX:MaxRAMFraction=1
to give the JVM access to all container memory. Note that you’ll want to increase the fraction if you need a lot of off-heap space for threads or class-loading.
Good to know , I did remove that out of the picture and got the heap up to 3gig but it did not help
@eelke If I understand you correctly - can you use :onyx/group-by-key
with no windows and just directly write to the bucket - permitting that each segment knows which bucket it ought to go into?
Hey guys, still working on the media driver issue. I expanded our cluster and got the peers running pretty much on a dedicated node with no success. I then shrunk our job down to an in->format-data->out ( in is kafka , out is a noop identity call ) and received this exception
7-10-30 15:40:29 onyx-peer-3973721470-x1mrl WARN [onyx.peer.task-lifecycle:147] -
java.lang.Thread.run Thread.java: 748
java.util.concurrent.ThreadPoolExecutor$Worker.run ThreadPoolExecutor.java: 624
java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java: 1149
...
clojure.core.async/thread-call/fn async.clj: 439
onyx.peer.task-lifecycle/start-task-lifecycle!/fn task_lifecycle.clj: 1048
onyx.peer.task-lifecycle/run-task-lifecycle! task_lifecycle.clj: 501
onyx.peer.task-lifecycle/iteration task_lifecycle.clj: 483
onyx.peer.task-lifecycle.TaskStateMachine/exec task_lifecycle.clj: 961
onyx.peer.task-lifecycle/wrap-lifecycle-metrics/fn task_lifecycle.clj: 988
onyx.peer.read-batch/read-function-batch read_batch.clj: 19
onyx.messaging.aeron.messenger.AeronMessenger/poll messenger.clj: 152
onyx.messaging.aeron.subscriber.Subscriber/poll! subscriber.clj: 207
io.aeron.Subscription.controlledPoll Subscription.java: 238
io.aeron.Image.controlledPoll Image.java: 332
io.aeron.ControlledFragmentAssembler.onFragment ControlledFragmentAssembler.java: 128
io.aeron.BufferBuilder.append BufferBuilder.java: 167
org.agrona.concurrent.UnsafeBuffer.putBytes UnsafeBuffer.java: 944
org.agrona.concurrent.UnsafeBuffer.boundsCheck UnsafeBuffer.java: 1312
org.agrona.concurrent.UnsafeBuffer.boundsCheck0 UnsafeBuffer.java: 1306
java.lang.IndexOutOfBoundsException: index=7392 length=262004189 capacity=16777216
clojure.lang.ExceptionInfo: Handling uncaught exception thrown inside task lifecycle :lifecycle/read-batch. Killing the job. -> Exception type: java.lang.IndexOutOfBoundsException. Exception message: index=7392 length=262004189 capacity=16777216
job-id: #uuid "ae6abd45-6f2b-5691-316a-6e2ec8b3bf46"
metadata: {:job-id #uuid "ae6abd45-6f2b-5691-316a-6e2ec8b3bf46", :job-hash "b414c0e7f323e156801acc195c15d14025376484dc589e0e79867ebca97"}
peer-id: #uuid "1310c0bd-0668-cd4d-8b65-7966b65a696d"
task-name: :out
@camechis nice work getting it working better. I’ll have to look into that one, but it looks like a buffer overflow, though I’m not sure how it would happen in that spot.
well, i ran it again and we are back to the timeout issue, lol. That one maybe a fluke or a symptom of the same problem. I am bumping the timeout now
are you still tracking whether the media driver gets killed?
K good
Do you have onyx peer http metrics setup? Are you tracking heartbeat metrics?
I have it on and have prometheus scraping it but I haven’t analyzed it yet to make anything meanful
Hmm, I really don’t know how that out of bounds exception could happen
Hmm, with respect to that buffer error, are you using any term buffer size jvm flags in your onyx peers?
I changed the subscriber timeout but still received the MediaDriver within (ns):10000000000. Are those the same numbers ?
Different number. The one you changed is for when peers get timed out by their neighbours, as I wanted to see if you would get other timeouts (eg kafka) instead
I’ve asked the Aeron gitter room about that buffer exception, as it’s all in the Aeron code, and maybe it’ll give a hint at what to look at
@camechis what version of java are you using?
nevermind, I have you flight recorder dump
@camechis did you get a aeron timeout when you hit that bound error?
@camechis I had a thought. Is it possible the onyx pods are getting scheduled to the same node?
i haven’t checked all the time but I have verified that several times. We are taking everything out of the picture now by deploying a basic job with Core asnyc -> inc - Core Async out to rule out kafka on the input task causing something bad
Ok. Definitely double check that each time as it would make sense
#error {
:cause "Lost and regained image with the same session-id and different correlation-id."
:data {:correlation-id 82, :original-exception :clojure.lang.ExceptionInfo}
:via
[{:type clojure.lang.ExceptionInfo
:message "Lost and regained image with the same session-id and different correlation-id."
:data {:correlation-id 82, :original-exception :clojure.lang.ExceptionInfo}
:at [clojure.core$ex_info invokeStatic "core.clj" 4617]}]
:trace
[[clojure.core$ex_info invokeStatic "core.clj" 4617]
[clojure.core$ex_info invoke "core.clj" 4617]
[onyx.messaging.aeron.subscriber$check_correlation_id_alignment invokeStatic "subscriber.clj" 77]
[onyx.messaging.aeron.subscriber$check_correlation_id_alignment invoke "subscriber.clj" 75]
[onyx.messaging.aeron.subscriber.Subscriber onFragment "subscriber.clj" 297]
[io.aeron.ControlledFragmentAssembler onFragment "ControlledFragmentAssembler.java" 121]
[io.aeron.Image controlledPoll "Image.java" 332]
[io.aeron.Subscription controlledPoll "Subscription.java" 238]
[onyx.messaging.aeron.subscriber.Subscriber poll_BANG_ "subscriber.clj" 207]
[onyx.messaging.aeron.messenger.AeronMessenger poll "messenger.clj" 152]
[onyx.peer.task_lifecycle$input_poll_barriers invokeStatic "task_lifecycle.clj" 165]
[onyx.peer.task_lifecycle$input_poll_barriers invoke "task_lifecycle.clj" 164]
[onyx.peer.task_lifecycle.TaskStateMachine exec "task_lifecycle.clj" 961]
[onyx.peer.task_lifecycle$iteration invokeStatic "task_lifecycle.clj" 483]
[onyx.peer.task_lifecycle$iteration invoke "task_lifecycle.clj" 480]
[onyx.peer.task_lifecycle$run_task_lifecycle_BANG_ invokeStatic "task_lifecycle.clj" 501]
[onyx.peer.task_lifecycle$run_task_lifecycle_BANG_ invoke "task_lifecycle.clj" 491]
[onyx.peer.task_lifecycle$start_task_lifecycle_BANG_$fn__20749 invoke "task_lifecycle.clj" 1048]
[clojure.core.async$thread_call$fn__4627 invoke "async.clj" 439]
[clojure.lang.AFn run "AFn.java" 22]
[java.util.concurrent.ThreadPoolExecutor runWorker "ThreadPoolExecutor.java" 1149]
[java.util.concurrent.ThreadPoolExecutor$Worker run "ThreadPoolExecutor.java" 624]
[java.lang.Thread run "Thread.java" 748]]}
I think this is related to clients getting bounced, and rejoining, after being timed out.
@lucasbradstreet Do you think it ever connects at all ?
seems like it does
Especially based on that bounds exception