Fork me on GitHub
#onyx
<
2017-10-30
>
gardnervickers12:10:33

@camechis I failed to consider that -XX:+UseCGroupMemoryLimitForHeap still takes into account the JVM default heap ratio, 1/4 if available memory. You’ll likely want to run with -XX:MaxRAMFraction=1 to give the JVM access to all container memory. Note that you’ll want to increase the fraction if you need a lot of off-heap space for threads or class-loading.

Travis12:10:40

Good to know , I did remove that out of the picture and got the heap up to 3gig but it did not help

michaeldrogalis15:10:07

@eelke If I understand you correctly - can you use :onyx/group-by-key with no windows and just directly write to the bucket - permitting that each segment knows which bucket it ought to go into?

Travis15:10:38

Hey guys, still working on the media driver issue. I expanded our cluster and got the peers running pretty much on a dedicated node with no success. I then shrunk our job down to an in->format-data->out ( in is kafka , out is a noop identity call ) and received this exception

Travis15:10:54

7-10-30 15:40:29 onyx-peer-3973721470-x1mrl WARN [onyx.peer.task-lifecycle:147] -
                              java.lang.Thread.run                       Thread.java:  748
java.util.concurrent.ThreadPoolExecutor$Worker.run           ThreadPoolExecutor.java:  624
 java.util.concurrent.ThreadPoolExecutor.runWorker           ThreadPoolExecutor.java: 1149
                                               ...
                 clojure.core.async/thread-call/fn                         async.clj:  439
 onyx.peer.task-lifecycle/start-task-lifecycle!/fn                task_lifecycle.clj: 1048
      onyx.peer.task-lifecycle/run-task-lifecycle!                task_lifecycle.clj:  501
                onyx.peer.task-lifecycle/iteration                task_lifecycle.clj:  483
    onyx.peer.task-lifecycle.TaskStateMachine/exec                task_lifecycle.clj:  961
onyx.peer.task-lifecycle/wrap-lifecycle-metrics/fn                task_lifecycle.clj:  988
          onyx.peer.read-batch/read-function-batch                    read_batch.clj:   19
onyx.messaging.aeron.messenger.AeronMessenger/poll                     messenger.clj:  152
  onyx.messaging.aeron.subscriber.Subscriber/poll!                    subscriber.clj:  207
              io.aeron.Subscription.controlledPoll                 Subscription.java:  238
                     io.aeron.Image.controlledPoll                        Image.java:  332
   io.aeron.ControlledFragmentAssembler.onFragment  ControlledFragmentAssembler.java:  128
                     io.aeron.BufferBuilder.append                BufferBuilder.java:  167
       org.agrona.concurrent.UnsafeBuffer.putBytes                 UnsafeBuffer.java:  944
    org.agrona.concurrent.UnsafeBuffer.boundsCheck                 UnsafeBuffer.java: 1312
   org.agrona.concurrent.UnsafeBuffer.boundsCheck0                 UnsafeBuffer.java: 1306
java.lang.IndexOutOfBoundsException: index=7392 length=262004189 capacity=16777216
         clojure.lang.ExceptionInfo: Handling uncaught exception thrown inside task lifecycle :lifecycle/read-batch. Killing the job. -> Exception type: java.lang.IndexOutOfBoundsException. Exception message: index=7392 length=262004189 capacity=16777216
       job-id: #uuid "ae6abd45-6f2b-5691-316a-6e2ec8b3bf46"
     metadata: {:job-id #uuid "ae6abd45-6f2b-5691-316a-6e2ec8b3bf46", :job-hash "b414c0e7f323e156801acc195c15d14025376484dc589e0e79867ebca97"}
      peer-id: #uuid "1310c0bd-0668-cd4d-8b65-7966b65a696d"
    task-name: :out

lucasbradstreet16:10:30

@camechis nice work getting it working better. I’ll have to look into that one, but it looks like a buffer overflow, though I’m not sure how it would happen in that spot.

Travis16:10:16

well, i ran it again and we are back to the timeout issue, lol. That one maybe a fluke or a symptom of the same problem. I am bumping the timeout now

lucasbradstreet16:10:45

are you still tracking whether the media driver gets killed?

Travis16:10:57

it stays up

Travis16:10:03

ok so I doubled the timeout out and it did not succeed.

lucasbradstreet16:10:52

Do you have onyx peer http metrics setup? Are you tracking heartbeat metrics?

Travis16:10:33

I have it on and have prometheus scraping it but I haven’t analyzed it yet to make anything meanful

lucasbradstreet16:10:13

Hmm, I really don’t know how that out of bounds exception could happen

Travis16:10:34

wondering if its struggling talking or writing to the shm

lucasbradstreet16:10:06

Hmm, with respect to that buffer error, are you using any term buffer size jvm flags in your onyx peers?

Travis16:10:25

I changed the subscriber timeout but still received the MediaDriver within (ns):10000000000. Are those the same numbers ?

lucasbradstreet16:10:25

Different number. The one you changed is for when peers get timed out by their neighbours, as I wanted to see if you would get other timeouts (eg kafka) instead

lucasbradstreet16:10:54

I’ve asked the Aeron gitter room about that buffer exception, as it’s all in the Aeron code, and maybe it’ll give a hint at what to look at

lucasbradstreet17:10:37

@camechis what version of java are you using?

lucasbradstreet17:10:27

nevermind, I have you flight recorder dump

lucasbradstreet17:10:11

@camechis did you get a aeron timeout when you hit that bound error?

Travis18:10:25

no, that seemed so far to be a one time deal

lucasbradstreet18:10:09

@camechis I had a thought. Is it possible the onyx pods are getting scheduled to the same node?

Travis18:10:28

i haven’t checked all the time but I have verified that several times. We are taking everything out of the picture now by deploying a basic job with Core asnyc -> inc - Core Async out to rule out kafka on the input task causing something bad

lucasbradstreet18:10:58

Ok. Definitely double check that each time as it would make sense

Travis18:10:32

so I got a new Exception for you

Travis18:10:36

#error {
 :cause "Lost and regained image with the same session-id and different correlation-id."
 :data {:correlation-id 82, :original-exception :clojure.lang.ExceptionInfo}
 :via
 [{:type clojure.lang.ExceptionInfo
   :message "Lost and regained image with the same session-id and different correlation-id."
   :data {:correlation-id 82, :original-exception :clojure.lang.ExceptionInfo}
   :at [clojure.core$ex_info invokeStatic "core.clj" 4617]}]
 :trace
 [[clojure.core$ex_info invokeStatic "core.clj" 4617]
  [clojure.core$ex_info invoke "core.clj" 4617]
  [onyx.messaging.aeron.subscriber$check_correlation_id_alignment invokeStatic "subscriber.clj" 77]
  [onyx.messaging.aeron.subscriber$check_correlation_id_alignment invoke "subscriber.clj" 75]
  [onyx.messaging.aeron.subscriber.Subscriber onFragment "subscriber.clj" 297]
  [io.aeron.ControlledFragmentAssembler onFragment "ControlledFragmentAssembler.java" 121]
  [io.aeron.Image controlledPoll "Image.java" 332]
  [io.aeron.Subscription controlledPoll "Subscription.java" 238]
  [onyx.messaging.aeron.subscriber.Subscriber poll_BANG_ "subscriber.clj" 207]
  [onyx.messaging.aeron.messenger.AeronMessenger poll "messenger.clj" 152]
  [onyx.peer.task_lifecycle$input_poll_barriers invokeStatic "task_lifecycle.clj" 165]
  [onyx.peer.task_lifecycle$input_poll_barriers invoke "task_lifecycle.clj" 164]
  [onyx.peer.task_lifecycle.TaskStateMachine exec "task_lifecycle.clj" 961]
  [onyx.peer.task_lifecycle$iteration invokeStatic "task_lifecycle.clj" 483]
  [onyx.peer.task_lifecycle$iteration invoke "task_lifecycle.clj" 480]
  [onyx.peer.task_lifecycle$run_task_lifecycle_BANG_ invokeStatic "task_lifecycle.clj" 501]
  [onyx.peer.task_lifecycle$run_task_lifecycle_BANG_ invoke "task_lifecycle.clj" 491]
  [onyx.peer.task_lifecycle$start_task_lifecycle_BANG_$fn__20749 invoke "task_lifecycle.clj" 1048]
  [clojure.core.async$thread_call$fn__4627 invoke "async.clj" 439]
  [clojure.lang.AFn run "AFn.java" 22]
  [java.util.concurrent.ThreadPoolExecutor runWorker "ThreadPoolExecutor.java" 1149]
  [java.util.concurrent.ThreadPoolExecutor$Worker run "ThreadPoolExecutor.java" 624]
  [java.lang.Thread run "Thread.java" 748]]}

Travis18:10:51

different nodes by the way

lucasbradstreet18:10:07

I think this is related to clients getting bounced, and rejoining, after being timed out.

Travis19:10:28

@lucasbradstreet Do you think it ever connects at all ?

lucasbradstreet19:10:29

seems like it does

lucasbradstreet19:10:37

Especially based on that bounds exception