@camechis I failed to consider that -XX:+UseCGroupMemoryLimitForHeap still takes into account the JVM default heap ratio, 1/4 if available memory. You’ll likely want to run with -XX:MaxRAMFraction=1 to give the JVM access to all container memory. Note that you’ll want to increase the fraction if you need a lot of off-heap space for threads or class-loading.


Good to know , I did remove that out of the picture and got the heap up to 3gig but it did not help


@eelke If I understand you correctly - can you use :onyx/group-by-key with no windows and just directly write to the bucket - permitting that each segment knows which bucket it ought to go into?


Hey guys, still working on the media driver issue. I expanded our cluster and got the peers running pretty much on a dedicated node with no success. I then shrunk our job down to an in->format-data->out ( in is kafka , out is a noop identity call ) and received this exception


java.lang.IndexOutOfBoundsException: index=7392 length=262004189 capacity=16777216
         clojure.lang.ExceptionInfo: Handling uncaught exception thrown inside task lifecycle :lifecycle/read-batch. Killing the job. -> Exception type: java.lang.IndexOutOfBoundsException. Exception message: index=7392 length=262004189 capacity=16777216
       job-id: #uuid "ae6abd45-6f2b-5691-316a-6e2ec8b3bf46"
     metadata: {:job-id #uuid "ae6abd45-6f2b-5691-316a-6e2ec8b3bf46", :job-hash "b414c0e7f323e156801acc195c15d14025376484dc589e0e79867ebca97"}
      peer-id: #uuid "1310c0bd-0668-cd4d-8b65-7966b65a696d"
    task-name: :out


@camechis nice work getting it working better. I’ll have to look into that one, but it looks like a buffer overflow, though I’m not sure how it would happen in that spot.


well, i ran it again and we are back to the timeout issue, lol. That one maybe a fluke or a symptom of the same problem. I am bumping the timeout now


are you still tracking whether the media driver gets killed?


it stays up


ok so I doubled the timeout out and it did not succeed.


Do you have onyx peer http metrics setup? Are you tracking heartbeat metrics?


I have it on and have prometheus scraping it but I haven’t analyzed it yet to make anything meanful


Hmm, I really don’t know how that out of bounds exception could happen


wondering if its struggling talking or writing to the shm


Hmm, with respect to that buffer error, are you using any term buffer size jvm flags in your onyx peers?


I changed the subscriber timeout but still received the MediaDriver within (ns):10000000000. Are those the same numbers ?


Different number. The one you changed is for when peers get timed out by their neighbours, as I wanted to see if you would get other timeouts (eg kafka) instead


I’ve asked the Aeron gitter room about that buffer exception, as it’s all in the Aeron code, and maybe it’ll give a hint at what to look at


@camechis what version of java are you using?


nevermind, I have you flight recorder dump


@camechis did you get a aeron timeout when you hit that bound error?


no, that seemed so far to be a one time deal


@camechis I had a thought. Is it possible the onyx pods are getting scheduled to the same node?


i haven’t checked all the time but I have verified that several times. We are taking everything out of the picture now by deploying a basic job with Core asnyc -> inc - Core Async out to rule out kafka on the input task causing something bad


Ok. Definitely double check that each time as it would make sense


so I got a new Exception for you


#error {
 :cause "Lost and regained image with the same session-id and different correlation-id."
 :data {:correlation-id 82, :original-exception :clojure.lang.ExceptionInfo}
 [{:type clojure.lang.ExceptionInfo
   :message "Lost and regained image with the same session-id and different correlation-id."
   :data {:correlation-id 82, :original-exception :clojure.lang.ExceptionInfo}
   :at [clojure.core$ex_info invokeStatic "core.clj" 4617]}]
different nodes by the way


I think this is related to clients getting bounced, and rejoining, after being timed out.


@lucasbradstreet Do you think it ever connects at all ?


seems like it does


Especially based on that bounds exception