Fork me on GitHub
#onyx
<
2016-08-06
>
Travis00:08:51

@michaeldrogalis: I got passed the issue by running the containers in host mode and binding to $PORT0 which is the ephemeral port. All nodes started working, however, I then very shortly saw these to things

Travis00:08:58

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f11a216d97e, pid=96, tid=0x00007f11841b9700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_92-b14) (build 1.8.0_92-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0xa8d97e]  Unsafe_SetInt+0x4e
#
# Core dump written. Default location: /var/run/s6/services/media_driver/core or core.96
#
# An error report file with more information is saved as:
# /var/run/s6/services/media_driver/hs_err_pid96.log
#
# If you would like to submit a bug report, please visit:
#   
#

Travis00:08:22

along with these warning on each node

16-Aug-06 00:11:41  WARN [onyx.messaging.aeron] - 
                                       java.lang.Thread.run           Thread.java: 745
         uk.co.real_logic.agrona.concurrent.AgentRunner.run      AgentRunner.java: 105
              uk.co.real_logic.aeron.ClientConductor.doWork  ClientConductor.java: 113
              uk.co.real_logic.aeron.ClientConductor.doWork  ClientConductor.java: 293
     uk.co.real_logic.aeron.ClientConductor.onCheckTimeouts  ClientConductor.java: 346
uk.co.real_logic.aeron.ClientConductor.checkDriverHeartbeat  ClientConductor.java: 275
uk.co.real_logic.aeron.exceptions.DriverTimeoutException: Driver has been inactive for over 10000ms

16-Aug-06 00:11:41  WARN [onyx.messaging.aeron] - 
                                       java.lang.Thread.run           Thread.java: 745
         uk.co.real_logic.agrona.concurrent.AgentRunner.run      AgentRunner.java: 105
              uk.co.real_logic.aeron.ClientConductor.doWork  ClientConductor.java: 113
              uk.co.real_logic.aeron.ClientConductor.doWork  ClientConductor.java: 293
     uk.co.real_logic.aeron.ClientConductor.onCheckTimeouts  ClientConductor.java: 346
uk.co.real_logic.aeron.ClientConductor.checkDriverHeartbeat  ClientConductor.java: 275
uk.co.real_logic.aeron.exceptions.DriverTimeoutException: Driver has been inactive for over 10000ms

Travis00:08:04

shm-size is currently set to 512mb, guess i need to make it bigger

gardnervickers00:08:59

How are you setting it in Mesos

gardnervickers00:08:51

Your likely not setting it correctly, 512mb is plenty

Travis00:08:35

so in my marathon definition I am setting it like this

"parameters": [
        {
          "key": "shm-size",
          "value": "512mb"
        }
under the container section

Travis00:08:00

i assumed this was the way to do it since I left that out before and it failed immediately until i put that in

gardnervickers00:08:04

That's not really a reliable test, I would check up on the docs wrt setting shm size. In kubernetes we sidestep this by mounting a memory volume at /dev/shm

Travis00:08:56

will look into it more.

Travis00:08:05

not sure how to mount a memory volume

Travis16:08:20

@gardnervickers: I was trying to determine if my shm_size setting was actually working or not. So I execed into the running docker container on the node that the peer started up in and ran the mount command to see if I could determine the shm_size that the docker container was using here is what I got.

shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=524288k)
so it looks like it really is set to 512mb but I still got the aeron seg fault.

robert-stuttaford19:08:59

@mccraigmccraig, @michaeldrogalis: hah! we launched in September 🙂

mccraigmccraig19:08:23

@robert-stuttaford: damn, but #2 is good too - the only even prime :)