Fork me on GitHub

@lowl4tency: what do you mean cloudwatch metrics to see if ZK is “working"


we used cloudwatch extensively, so I am familiar with it

Kira Sotnikov14:07:00

erichmond: so, for example I wanna a thing, it’s checking my zookeeper application if it fails the instance is terminated


I ask for a definition of working, because there could be a number of reasons why ZK wouldn’t be working (crashed process, network interruptions, the os layer getting disrupted, etc)

Kira Sotnikov14:07:16

I did a simple solution for it, add a loop to start script

Kira Sotnikov14:07:39

If the app is down I need a new one instance


so, you want something that will notify you if the process on the box fails for some reason?

Kira Sotnikov14:07:02

Not notify, do some actions with instances

Kira Sotnikov14:07:08

Terminate old and start new


yeah, if you are interested in restarting the app on a single instance, if ZK fails, then a script that runs on your instance itself is probably the best solution


so you are not subject to network conditions


At Indaba we do two things, we have a script that runs on each instance itself to reboot instance


but we also have a service that listens to cloudwatch messages, and if we get a message that a box is unreachable, that service spins up a new instance in response via ansible


reboot service*

Kira Sotnikov14:07:07

CLoudWatch is able to check a port access from outside?


well, both route53 and the ELBs have a “health check” option that you can setup, where amazon will ping the instance for you


you can setup cloudwatch to send an email / SNS / whatever if that health check fails

Kira Sotnikov14:07:25

I’ve not an ELB for zookeper instances

Kira Sotnikov14:07:40

If I’ve got an ELB it’s not a trouble

Kira Sotnikov14:07:48

It’s simple to configure


yeah, I am not sure if you can configure health check at the ec2 level, but we have found elb’s useful


and they are quite cheap


the only downside is they take time to “ramp up"

Kira Sotnikov14:07:33

18 usd per month

Kira Sotnikov14:07:59

I prefer reduce costs as much as possible

Kira Sotnikov14:07:18

erichmond: do you use cloudwatch with a 3d-party monitoring tool?


cheap and super effective


they have built in CW support


so just give them an access key


and they’ll pull all the stats you want


we’re actually moving to a unified log architecture, so we stopped using CW, but we used it with them for years


@lowl4tency: Perhaps a bit more advanced, I usually run my stuff ontop of Mesos and Marathon. Marathon is like a global upstart for your cluster. It can relocate crashed Docker containers to healthy machines.


Might only be worth it when the stakes are higher though.

Kira Sotnikov15:07:06

Looks like overhead for my case

Kira Sotnikov15:07:35

michaeldrogalis: but thank you for advise


One thing I have been curious about: :onyx/medium :kafka I searched the codebase and this key (:onyx/medium) isn't used anywhere or I missed it. Yet the schema complains if it is not there. What is its purpose?


@chrisn: Convention, and reserved for future use.


Easier to loosen the schema later than tighten it.


resolution to our discussion last week.