Fork me on GitHub
Martynas Maciulevičius11:06:56

Hey. What would one use to do service peer discovery on AWS? I found that I can't use multicast. What should I do to find out what IPs are "alive" in my network?

👀 1
Martynas Maciulevičius13:06:43

I'll probably need to do polling. I don't want to use their service discovery using Route 53 and all other proprietary things because I want it real-time. Also they assume that I'll do requests to their services which are stil TCP one-to-one requests. Which is still the same except now I couple my system to theirs. AND on top of that it will have TTL parameters which means it's additional lag to find out that my service may have already been killed. Correct me if I'm wrong.


What sort of service discovery? Are you building a clustered system or do you just need to route requests to different HTTP backend? (because the later can be done with ALB and multiple target groups)


For the first one, I think something like Consul would be a better choice

Martynas Maciulevičius14:06:20

I thought about not using additional services. Because I wanted a lightweight way of finding the hosts. I wanted the hosts to find each other to run a consensus protocol between them. And I see that Consul already says "You can choose to create 3, 5, or 7 servers" which means they will run Raft and negotiate their own states.


Maybe you can make it work with jGroups - I build a proof of concept and it worked pretty well


no extra infra to deploy, the clustering is built-in into your services

Martynas Maciulevičius18:06:20

I think I've read about JGroups today. They have way too many features but what they also say is that (

Transport protocols: UDP (IP Multicast) or TCP
So it's already going to be based on IP multicast which I can't use on AWS. And if I use TCP then I would probably be better off without using JGroups. What kind of POC did you have? Was it on AWS? Maybe that means that you were able to use IP Multicast? I want to have hosts that are located in different availability zones. I also can use TCP but TCP is one-to-one which means I have to not rely on AWS's DHCP.


I used Google Cloud at that time and run into issues with GCP's networking, if so I tried autodiscovery plugins and these didn't work either. There's a AWS auto-discovery plugin for jGroups however:

Martynas Maciulevičius18:06:14

It says it uses EC2 node tag names. What if I'll use Fargate with containers inside (ECS)? Also is it supposed to be used as a transport layer for original JGroups? I hope it should :thinking_face:


ah yeah, Fargate might not work


but then again - you can use ECS API to get services, task IPs and all of that and build your own service discovery


never used any of them, we engineered ourselves out of having a service discovery/clustering

Martynas Maciulevičius18:06:12

My other idea is to simply allocate a very small CIDR range and ping them from each other. I don't need more than twice the size of my deployment at the same time :thinking_face: Maybe even node-count+1 of IPs would be enough but it depends on how the deployment works :thinking_face:

Martynas Maciulevičius19:06:49

Also if my subnet would be that small I may not even need to do any pinging as I could simply hardcode the IPs and call it a day :thinking_face:

Martynas Maciulevičius19:06:26

And what the JGroups-aws gave me an idea about was to have TCP_ALL security parameter. That could be handy.


IP allocation is random from what I can see when we do rolling deployments of our tasks

Martynas Maciulevičius19:06:51

I know that IP allocation is random and yes, I want to do rolling deployments. But I don't yet know but if I'd force the DHCP server of ECS to return an IP from predefined list then it should be fine to simply ping all of the IPs in that range. So if range would be 10 IPs and I'd be running 3-5 hosts then it should be good.


ah yeah, with these numbers of nodes that could def work

Martynas Maciulevičius19:06:02

I don't know whether it's more or less hacky than that global store. But yes, the solution with global registry would probably be easier to scale. But then I could do very many small subnets and it could still be "scalable" that way...? Not sure.