Fork me on GitHub
#onyx
<
2016-04-29
>
jeroenvandijk09:04:57

What is the recommended way to set the number of peers for the docker-compose cluster? (https://github.com/onyx-platform/onyx-template/blob/0.9.x/src/leiningen/new/onyx_app/script/run_peers.sh#L8)

lucasbradstreet09:04:47

This is done via a docker environment variable, through NPEERS, which is supplied in https://github.com/onyx-platform/onyx-template/blob/0.9.x/src/leiningen/new/onyx_app/docker-compose.yml

jeroenvandijk09:04:15

ah thanks, i missed it completely simple_smile

jeroenvandijk09:04:25

What is the best way to find the underlying issue when you see Not enough virtual peers have warmed up to start the task yet, backing off and trying again… continuously?

jeroenvandijk09:04:10

I (think) I have enough peers to complete the task, I have up-ed the Zookeeper allowed client connections to 1000. Not sure what else it could be

jeroenvandijk09:04:49

Ah i think it is a hidden exception

peer_1      |                                                                    ...
peer_1      |                                 org.apache.zookeeper.ZooKeeper.getData           ZooKeeper.java: 1184
peer_1      |                                 org.apache.zookeeper.ZooKeeper.getData           ZooKeeper.java: 1155
peer_1      |                            org.apache.zookeeper.KeeperException.create     KeeperException.java:   51
peer_1      |                            org.apache.zookeeper.KeeperException.create     KeeperException.java:  111
peer_1      | org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/topics/onyx-data
peer_1      |     code: -101
peer_1      |     path: "/brokers/topics/onyx-data"
peer_1      |                           clojure.lang.ExceptionInfo: Caught exception inside task lifecycle. Rebooting the task. -> Exception type: org.apache.zookeeper.KeeperException$NoNodeException. Exception message: KeeperErrorCode = NoNode for /brokers/topics/onyx-data

jeroenvandijk09:04:15

Ok fixing the exception helped. I guess the "not warmed up” message is a bit misleading (for me)

lucasbradstreet09:04:29

Yeah, I think there’s an interaction there. One of the peers fails while starting, which means that peer hasn’t warmed up and the job is killed. Maybe we could change “warmed” to “started” or something like that

lucasbradstreet09:04:40

or maybe “signalled ready"

jeroenvandijk09:04:59

Yeah it would be nice if something could tell me that something went wrong during starting instead of the generic-everything-is-good-just-wait message simple_smile

jeroenvandijk10:04:39

Another thing, I made a mistake in the configuration of the kafka deserialization function, so there was no output. I assume there was an error somewhere given this line https://github.com/onyx-platform/onyx-kafka/blob/4e9a9da8804677b447645d872c437aa7d6619692/src/onyx/tasks/kafka.clj#L11 Where can this error be found?

jeroenvandijk10:04:09

btw, I was wrong yesterday about the incompatibility between kafka clients (0.8.1 vs 0.8.2). It seems that and old client can write to a newer cluster, and be read from a new client just fine. (Still need to test whether a newer client can read from an old cluster though)

jeroenvandijk10:04:37

I think ran into this issue while trying to consume kafka messages from my OSX host http://stackoverflow.com/questions/28664456/kafka-unable-to-connect-to-zookeeper#answer-36841719 Not sure how to solve this. I’m running everything inside the docker-compose cluster now

lucasbradstreet10:04:43

You should find that exception in the logs, or you can read the exception that killed the job via the job id. I'll point you to the function to do that in a second

lucasbradstreet10:04:39

For the Kafka connection issue, I assume Kafka is outside of your docker compose setup?

lucasbradstreet10:04:12

@jeroenvandijk: you can use this function which is part of onyx.test-helper to wait until a job is completed, and if it’s killed instead read back the exception that caused it to be killed

lucasbradstreet10:04:23

The dashboard also works to view the exceptions

jeroenvandijk10:04:40

@lucasbradstreet: regarding the kafka connection issue, I was trying to consume from a clojure process on my os x host to the kafka process running inside the boot2docker docker cluster. Note that producing did work, and zookeeper connecting also, just consuming requires something different (similar to telnet’s requirements i suppose, as that didn’t work either from the host machine)

jeroenvandijk10:04:36

Regarding the exception, I think the serializer function was eating the exception via the try/catch and emitting {:error “something}. I was wondering where i would find these “error” messages

lucasbradstreet10:04:59

Oh right, I didn’t notice that it didn’t rethrow

lucasbradstreet10:04:16

I think the idea there is that the deserializer shouldn’t necessarily take down the job, but you will have to use flow conditions to catch segments that have the error key in them, and pass them to an error handling task

jeroenvandijk10:04:38

ah yeah I see, I guess i have to be more careful simple_smile

lucasbradstreet10:04:29

with the kafka connection issue, I think you may need to expose the ports under the kafka and zookeeper settings in docker-compose.yml, and then use your boot2docker machine’s ip in the local consumer

jeroenvandijk10:04:45

yeah i think that’s what I did. I can actually produce to kafka from the host machine. And i can read zookeeper from the host machine. Just consuming the kafka stream gives a zookeeper error. I think this requires somethings else than just exposing the default ports (2181 and 9092) have been properly forwarded

lucasbradstreet10:04:22

Ah I think what is happening is that kafka is advertising itself to ZooKeeper with an IP that you can’t connect to, since it’s internal

jeroenvandijk10:04:55

maybe, but note that telnet doesn’t work either from the host and that’s just zookeeper

lucasbradstreet10:04:59

I think that requires setting advertised.host.name in your kafka server

lucasbradstreet10:04:11

ah that is interesting

jeroenvandijk10:04:31

i had to change the advertised hostname already for the produce of messages

lucasbradstreet10:04:49

so you can’t telnet to your boot2docker ip at port 9092?

jeroenvandijk10:04:49

from my docker-compose.yml

KAFKA_ADVERTISED_HOST_NAME: ${DOCKER_HOSTNAME}

lucasbradstreet10:04:47

When you docker ps, is it forwarding 9092 to 9092?

jeroenvandijk10:04:23

o man, i think i screwed up. telnet works. I copy pasted the wrong port

jeroenvandijk10:04:32

than i don’t know what goes wrong

jeroenvandijk10:04:43

but consuming doesn’t work from the host

jeroenvandijk10:04:00

816b2e35bcb7 wurstmeister/kafka "start-kafka.sh" About an hour ago Up About an hour 0.0.0.0:9092->9092/tcp adgojietlb

lucasbradstreet10:04:04

I assume telnetting to 2181 from the host works?

jeroenvandijk10:04:34

yeah both 2181 and 9092

jeroenvandijk10:04:12

It is not a major issue as i can rebuild my job jar and run in it inside the cluster, but it would be more convenient

lucasbradstreet10:04:38

Any idea what DOCKER_HOSTNAME resolves to? Could you try manually setting it to the boot2docker ip and see if that works?

jeroenvandijk10:04:41

ah sorry, that’s my creation DOCKER_HOSTNAME=$(echo $DOCKER_HOST|cut -d ':' -f 2|sed "s/\/\///g") docker-compose up

jeroenvandijk10:04:37

I double checked, it’s the same

lucasbradstreet10:04:44

Yeah, that looks right

lucasbradstreet10:04:39

It seems to be a zookeeper connection error anyway. Hmm

lucasbradstreet10:04:06

The fact that you can telnet to it but your clojure process on your host can’t connect to it is weird

jeroenvandijk10:04:59

Yeah I’ll leave it for a while and try again later. Maybe I’m just trying too many things at a time

joshg13:04:06

Sounds like an advertised host name issue.

joshg13:04:40

Wireshark has Kafka protocol support. I've found it useful when diagnosing Kafka connection issues.

michaeldrogalis14:04:52

I was fiddling with advertised.host.name yesterday, too. That can be frustrating to get right.

michaeldrogalis16:04:27

+1500 lines for the new static job analysis patch. Once again proving that providing good error messages is really, really hard.

lucasbradstreet17:04:08

Also proving that line count isn’t always the best judge of code quality

bridget19:04:09

I will finally have some time next week, so I'm going to get the experimental other language support stuff finally moved under onyx-platform

bridget19:04:50

I just had a few things to clean up, and it'll be fine for general consumption

bridget19:04:31

The cleanup is already partially done anyway. Just need to (remember where I was and) finish up and push the changes.

michaeldrogalis20:04:32

Upgrade to [org.onyxplatform/onyx "0.9.5-20160429.201738-5"] to try out the new static analyzer. Official release will be out mid next week.