onyx 2015-07-29 | Slack Archive

@lowl4tency: what do you mean cloudwatch metrics to see if ZK is “working"

we used cloudwatch extensively, so I am familiar with it

erichmond: so, for example I wanna a thing, it’s checking my zookeeper application if it fails the instance is terminated

erichmond14:07:05

I ask for a definition of working, because there could be a number of reasons why ZK wouldn’t be working (crashed process, network interruptions, the os layer getting disrupted, etc)

Kira Sotnikov14:07:16

I did a simple solution for it, add a loop to start script

Kira Sotnikov14:07:39

If the app is down I need a new one instance

erichmond14:07:40

so, you want something that will notify you if the process on the box fails for some reason?

Kira Sotnikov14:07:02

Not notify, do some actions with instances

Kira Sotnikov14:07:08

Terminate old and start new

erichmond14:07:25

ahhh ok

erichmond14:07:17

yeah, if you are interested in restarting the app on a single instance, if ZK fails, then a script that runs on your instance itself is probably the best solution

erichmond14:07:26

so you are not subject to network conditions

erichmond14:07:43

At Indaba we do two things, we have a script that runs on each instance itself to reboot instance

erichmond14:07:17

but we also have a service that listens to cloudwatch messages, and if we get a message that a box is unreachable, that service spins up a new instance in response via ansible

erichmond14:07:03

reboot service*

Kira Sotnikov14:07:07

CLoudWatch is able to check a port access from outside?

erichmond14:07:47

well, both route53 and the ELBs have a “health check” option that you can setup, where amazon will ping the instance for you

erichmond14:07:05

you can setup cloudwatch to send an email / SNS / whatever if that health check fails

Kira Sotnikov14:07:25

I’ve not an ELB for zookeper instances

Kira Sotnikov14:07:40

If I’ve got an ELB it’s not a trouble

Kira Sotnikov14:07:48

It’s simple to configure

erichmond14:07:58

yeah, I am not sure if you can configure health check at the ec2 level, but we have found elb’s useful

erichmond14:07:02

and they are quite cheap

erichmond14:07:13

the only downside is they take time to “ramp up"

Kira Sotnikov14:07:33

18 usd per month

Kira Sotnikov14:07:34

Kira Sotnikov14:07:59

I prefer reduce costs as much as possible

erichmond14:07:31

haha yeah

Kira Sotnikov14:07:18

erichmond: do you use cloudwatch with a 3d-party monitoring tool?

erichmond14:07:32

yep! we use http://www.librato.com

erichmond14:07:33

I love them

erichmond14:07:37

cheap and super effective

erichmond14:07:46

they have built in CW support

erichmond14:07:55

so just give them an access key

erichmond14:07:59

and they’ll pull all the stats you want

erichmond14:07:31

we’re actually moving to a unified log architecture, so we stopped using CW, but we used it with them for years

michaeldrogalis15:07:41

@lowl4tency: Perhaps a bit more advanced, I usually run my stuff ontop of Mesos and Marathon. Marathon is like a global upstart for your cluster. It can relocate crashed Docker containers to healthy machines.

michaeldrogalis15:07:01

Might only be worth it when the stakes are higher though.

Kira Sotnikov15:07:06

Looks like overhead for my case

Kira Sotnikov15:07:35

michaeldrogalis: but thank you for advise

michaeldrogalis15:07:47

Yeah

chrisn17:07:33

One thing I have been curious about: :onyx/medium :kafka I searched the codebase and this key (:onyx/medium) isn't used anywhere or I missed it. Yet the schema complains if it is not there. What is its purpose?

michaeldrogalis17:07:53

@chrisn: Convention, and reserved for future use.

michaeldrogalis17:07:31

Easier to loosen the schema later than tighten it.

erichmond18:07:27

BOOM

chrisn21:07:33

https://github.com/thinktopic/onyx-null-endpoint

chrisn21:07:50

resolution to our discussion last week.

2015-07-29

Channels