Amazon AWS is of course the leader in global cloud computing, and one of the most obvious areas is in networking. Nowhere is this more obvious than in the VPC, or Virtual Private Cloud, system. VPC has been around a while, but it is now mandatory and is the future of larger cloud systems such as OpenStack, with other clouds having built or are building simplified versions.
However, Virtual Private Clouds are fairly complex and have some special and unusual problems that are not obvious when you try to build out real systems on a VPC, especially when you want to integrate it with other key AWS services such as OpsWorks, ELBs, EIP, etc. It is pretty good once you understand all the ways you can screw it up (and has a decent setup wizard you should use), but also has at least one glaring hole that I hope they fill soon. In order to gain a better understanding of this technology we must first examine what a VPC actually is and how they first came into use.
On the old AWS network, now called “Classic,” all VMs were the same from the network point of view. They had both a random private and a public IP on boot which would change if they crashed or were stopped, thereby making interconnecting things like web and database servers a challenge. Given they all had very public IPs, protection via AWS Security Groups, and on-host iptables, the use of firewalls was essential. Furthermore, all hosts could see and talk to each other privately, plus a few static IPs included for things like web servers via Elastic IPs (EIP). That was it.
In a VPC, it is very different as you are basically building a datacenter style set of public and private VLANs, interconnected by routers, with an optional gateway to the Internet or your own private system via VPN. This is great as then only the truly public hosts get public IPs, leaving most truly private servers in isolated VLANs. In general, all IPs are static even after stopping or crashing unless you use OpsWorks or similar. You can have many separate VLANs or subnets for different systems and services, with firewalls or load balancers in between. Not to mention the ability to have multiple NICs and virtual IPs per VM, all of them static.
To set it up, you ideally use a wizard, though we don’t because we are hard-core, and want to know how it works, so we can can fix it if it breaks later. Basically, the wizard will setup a simple public/private system with a single public and a single private VLAN/subnet, along with the related Internet Gateway (IGW), internal routing, open ACLs, and Security Groups.
For a highly-available system, you manually double most of this by creating a public and private LAN/subnet in each availability zone (routing is automatic, fortunately), and all hosts except the NAT instance split or spread between zones. It all works well, but gets easier to make mistakes when the number of subnets and components rises.
Critically, the wizard sets up the most important part of the VPC, the NAT Instance. This is the SINGLE VM that provides two critical functions for the private VLANs/subnets: inbound and outbound NAT. This is a normal AWS VM with a special iptables masquerade rule for outbound NAT. With this and all else setup correctly, private VMs can reach the Internet.
However, to reach the private VMs from outside AWS such as for ssh or monitoring, this NAT instance has to be modified to add inbound NAT rules. This is similar to what you would do on a Cisco ASA or other firewall in a datacenter-based system, and can be a real pain to manage, plus it is a single point of failure.
We feel this NAT instance should be replaced ASAP with a service/appliance where you can configure the NAT rules via the console/API, otherwise you really need to know iptables and how it all works to make the system usable. Not to mention that making this NAT instance truly highly-available is a pretty painful process.
Note that the setup wizard uses a badly named default routing set which for various reasons can be exceptionally confusing. In fact the whole concept of routing tables is confusing as these are VLAN/subnet-based routing in the invisible L3 router, and the default routes are critical to the system working and can be quite inconspicuous when you start working with VPCs.
While VPCs are a really great system to use, the main issue with them is that they can be much more complex in small and easy to miss ways. There are a million ways to screw this up since outside of the Setup Wizard there are very few checks on anything, making it very easy to mess up the complex interplay of routing, secur3ity groups, NATs, ACLs, gateways, and more. This is generally because the routing is not what you would expect, since it is really a core router inter-VLAN gateway issue you need to understand and the docs are not great even though the rules and concepts are simple.
For example they talk a lot about public and private VLANs/subnets but never once explain the differences in terms of what makes something public or private, nor the rules VMs need to follow regarding addressing when on these subnets. There are only a half-dozen golden rules for VPCs, but they don’t seem to be written anywhere and we had to either discover them on our own, or by reading between the lines.
Furthermore, the critical must-have inbound NATs are 100% ignored in every document we can find, which seems absurd to us since without this you cannot ssh to, nor monitor, any of your private hosts from the outside. Not to mention you need this for any non-HTTP service or for something that can’t use the ELB. Making this NAT into a HA system is a real challenge, especially in ways that don’t result in the system becoming less stable or available.
Overall, after fully understanding and setting up many VPCs we really like the whole system, though we’d like to see more cross-checks and a NAT service. It is the future of cloud networking and is a very powerful system that scales up quite nicely. Although it can be overly complex at times, it is far easier to manage than a physical network of similar size.
Contact us today for help thinking about, building, and managing VPC networks on AWS.