Fork me on GitHub
#onyx
<
2015-09-26
>
lucasbradstreet00:09:29

You may need to configure a few things correctly. You may need to configure the external addr

lucasbradstreet00:09:36

Which is what other peers try to connect to, rather than the address that it's bound to

lucasbradstreet00:09:50

Otherwise it's probably about opening up the right zk ports

lucasbradstreet00:09:05

Err udp ports for aeron (not ZK)

chrisn20:09:57

those are described in the peer config I would imagine.

chrisn20:09:16

The port range [40200 40400] in the default peer config.

chrisn21:09:53

hmm, still only getting a single machine of peers booting. The other machine states that it has enough peers to start task and then hangs.

chrisn21:09:01

or rather nothing happens.

chrisn21:09:42

Could you perhaps describe the rough way that peers find each other? The publish their addresses into zk, right?

lucasbradstreet21:09:03

How are you doing zookeeper? Is anything written to the log? Running the dashboard might help see what's going on

chrisn21:09:58

We are running zk using docker on a separate ec2 instance.

lucasbradstreet21:09:08

K, all the onyx virtual peers connect to zookeeper and write to a shared log, as well as some paths which are watched by each other

lucasbradstreet21:09:37

If the peers are taking a really long time to start up (or hanging) it's probably a problem connecting to zk

chrisn21:09:54

no, here is an example of the output:

chrisn21:09:04

15-Sep-26 21:15:04 pi-worker-6 INFO [onyx.peer.task-lifecycle] - [afc501d3-466b-4d4f-8962-dce4d8792a45] Enough peers are active, starting the task15-Sep-26 21:15:04 pi-worker-6 INFO [onyx.peer.task-lifecycle] │ - [0c4cf456-4642-4c94-8a92-5feb78d54809] Enough peers are active, starting the task │ │ │ 15-Sep-26 21:15:04 pi-worker-6 INFO [onyx.peer.task-lifecycle] - [5a4cdb90-ff94-4e0a-a526-030883130bae] Enough peers are active, starting the task

chrisn21:09:08

They are talking to zk.

lucasbradstreet21:09:13

Ok that looks like ZK is fine

chrisn21:09:24

The peers that are part of a single machine. ITs just multiple machines don't work.

chrisn21:09:40

Oddly enough they act like they are part of different clusters.

chrisn21:09:46

But there is no way they are.

chrisn21:09:02

Meaning one mahcine has 10 peers and will start the job, the other will stop at the above message.

lucasbradstreet21:09:07

That's very weird. Oh. Are they all using the same onyx/id?

lucasbradstreet21:09:20

That's very weird behaviour.

chrisn21:09:41

I could be doing some thing odd when starting the peers:

chrisn21:09:55

(def PEER_CONFIG {:zookeeper/address (get-config :zookeeper-url)
                  :onyx.peer/job-scheduler :onyx.job-scheduler/greedy
                  :onyx.messaging/impl :aeron
                  :onyx.messaging/peer-port-range [40200 40400]
                  :onyx.messaging/bind-addr "localhost"
                  :onyx.log/config {}
                  })


(defrecord OnyxDevEnv [n-peers onyx-id]
  component/Lifecycle

  (start [component]
    (println "Starting Onyx development environment")
    (let [peer-config (assoc PEER_CONFIG :onyx/id onyx-id)
          peer-group (onyx.api/start-peer-group peer-config)
          peers (onyx.api/start-peers n-peers peer-group)]
        (assoc component :peer-group peer-group
             :peers peers :onyx-id onyx-id)))

  (stop [component]
    (println "Stopping Onyx development environment")
    (doseq [v-peer (:peers component)]
      (onyx.api/shutdown-peer v-peer))
    (onyx.api/shutdown-peer-group (:peer-group component))
    (assoc component :peer-group nil :peers nil)))

chrisn21:09:11

That happens on every machine.

chrisn21:09:20

But the groups are acting independent; like separate clusters.

lucasbradstreet21:09:26

Bind addr localhost is the issue

chrisn21:09:36

I though it may be.

lucasbradstreet21:09:38

You need to bind to the interface

chrisn21:09:56

Hmm, like "eth0" ?

lucasbradstreet21:09:07

The ip. I know it's a bit of a pain

chrisn21:09:16

Oh wow. That is going to be a bit tough.

chrisn21:09:24

Given these are docker machines.

chrisn21:09:36

on ec2 instances and that is basically a random situation.

chrisn21:09:43

I can figure it out, one sec.

lucasbradstreet21:09:06

You probably need to set external-addr too then

chrisn21:09:16

I am using: --net host

lucasbradstreet21:09:26

Ah that's ok then

chrisn21:09:26

So docker is just using the host network and not setting up it's own.

lucasbradstreet21:09:44

One sec I'll show you what we do

lucasbradstreet21:09:13

There's an address you can curl or Clojure slurp on ec2

lucasbradstreet21:09:57

Search for bind-addr in this file

lucasbradstreet21:09:04

(I'm on my phone)

chrisn21:09:04

I literally don't know what you are doing...hitting a web address that tells you the ip?

lucasbradstreet21:09:30

It's a special thing that AWS has setup

lucasbradstreet21:09:43

I don't know how it works internally

chrisn21:09:50

ok, now that is damn helpful.

lucasbradstreet21:09:51

But yes it'll tell you your ip

chrisn21:09:53

once sec...

chrisn21:09:06

I had no idea but man that could simplify a lot of things.

chrisn21:09:28

aeron error, no space left on device

chrisn21:09:49

We need to export the aeron temp dir to the host machine.

lucasbradstreet21:09:16

Oh. There's a trick to running it on docker

lucasbradstreet21:09:13

Under Linux I think it puts it all in shm anyway

chrisn22:09:01

yep, wow. For docker there is a fix coming --shm-size but it hasn't made it into a release yet.

chrisn22:09:15

I don't understand the mount command nor priviledged mode yet.

lucasbradstreet22:09:41

I haven't tried it myself so I'm unfortunately not much help there.

chrisn22:09:25

it appeared to stop the crashing but we are now back at the original problem where peers on different machines aren't communicating with each other.

lucasbradstreet22:09:25

Ports definitely opened for udp?

lucasbradstreet22:09:27

That's about all that's left I think

chrisn22:09:38

aeron is udp? I have reached my frustration level at the moment so I think I will leave this for a bit. Thanks to your help I do believe we have got quite a bit further. I will have to ask people about udp vs. tcp but I thought for now the security policy was pretty open in the security group we are using for our aws machines.

lucasbradstreet22:09:16

Fair enough. Yeah aeron basically implements much of what tcp gives you on top of udp