This blog post is the 1st in a 4 part series with the goal of thoroughly explaining how Kubernetes Ingress Works:
- Kubernetes Networking and Services 101 (this article)
- Ingress 101: What is Kubernetes Ingress? Why does it exist?
- Ingress 102: Kubernetes Ingress Implementation Options (coming soon)
- Ingress 103: Productionalizing Kubernetes Ingress (coming soon)
Kubernetes Networking, Services, and Ingress
If you read till the end of this series you’ll gain a deep understanding of the following diagram, Kubernetes Networking, the 7 service types, Kubernetes Ingress, and a few other fundamental concepts.
Understanding Basic Concepts
I’ve seen a project with 2 websites and 2 API Groups: Externally accessible APIs and Internally accessible APIs. The External APIs were to function as publicly reachable backend points of entry for the websites and act as potentially reusable building blocks for future projects. Internal APIs were named such to make it obvious that it’d be dangerous to externally expose them as they were meant to house application middleware logic and backend database logic.
The architecture evolved to a point where both the Internal and External APIs were externally exposed using Kubernetes Ingress, firewall rules were implemented to limit access to both sets of APIs. The External APIs were put behind an API Gateway as a means of bolting on authentication functionality, and the Internal APIs were firewalled to prevent them from being externally exposed.
It’s a valid solution, but I’d like to point out that the Internal APIs should have only been internally reachable using a ClusterIP service as this would have been both more secure and less complex. Exposing the Internal APIs over Ingress came about out of ignorance of the basics of Kubernetes / that was the only known way to interact with things in the cluster.
The point of the story is that trying to implement Kubernetes by relying on how to guides can cause you to learn the tool equivalent of a hammer and then see every problem as a nail. If you take the time to deeply learn the fundamentals that advanced concepts are built upon you’ll be able to come up with multiple solutions to problems that don’t have a perfect how to guide readily available and evaluate which solution is best for a given situation. Understanding the how and why of basic concepts improves your ability to do quick solid evaluations of different tooling solutions, which is critical since no one has time to learn every tool in depth.
Generic Networking in a Kubernetes Context
I find abstract concepts are easier to understand and follow when you can build on basic facts, tie in prior knowledge, and parallel abstract concepts with concrete examples. In this section I’ll use those techniques to help explain the following concepts:
- Router’s that do PAT(Port Address Translation) form a network boundary where it’s easy to talk in 1 direction, but hard to talk in the other direction.
- Your Home Router does the job of several conceptual devices combined into a single unit, a Kubernetes Node also does the job of several conceptual devices combined into a single unit, Kubernetes Nodes act like routers, meaning they can do PAT to form a network boundary.
- A single Kubernetes Cluster often belongs to a topology involving 3 levels of Network Boundaries. (Kubernetes Inner Cluster Network, LAN Network, and Internet.)
- It’s common for a single Kubernetes Cluster to have access to 3 levels of DNS (Inner Cluster DNS, LAN DNS, Public Internet DNS)
Router’s that do PAT (Port Address Translation — a type of Network Address Translation (NAT)) form a network boundary where it’s easy to talk in 1 direction, and hard to talk in the other direction.
Routers connect networks:
Switches create networks/allow multiple computers on the same network to talk to each other.
Below is a picture of the back of a home router, which is acting as a Router by connecting Internet Network to LAN, and acting as a Switch by connecting the 4 computers on the LAN to form a network where they can freely talk to each other.
PAT allows 2 things to happen:
- The 4 computers with Private IP addresses get to share a single Public IP Address.
- Computers on the Internet (in front of the Router) can’t start a conversation with computers on the Local Area Network (behind the Router), but they are allowed to reply back. Computers on the LAN can start conversations with computers on the Internet. (This behavior creates a network boundary.)
Your Home Router is doing the job of several conceptual devices combined into a single unit. The pictures of the back of a home router clarify that it’s is a Router and Switch, they often are also DNS/DHCP servers, Wireless Access Points, and sometimes even modems rolled into a single unit. In a similar fashion Kubernetes Node’s aren’t just computers, they act like virtual routers and use PAT to form a network boundary, they also act like virtual switches and create another network:
A single Kubernetes Cluster often belongs to a topology involving 3 levels of Network Boundaries.
Default network configuration settings make it so computers on the left side can’t start conversations with computers on the right side, but they can reply to conversations started by computers on the right side. Computers on the right side are free to start a conversation with any computers on the left side.
So by default:
- A computer on the internet can’t start a conversation with a database server on the LAN or a frontend pod in the cluster.
- A Kubernetes Pod can talk to a database server on the LAN, or on the internet.
- A management laptop on the LAN could talk to a database server on the internet, but can’t start a conversation with a frontend pod in the cluster.
This offers a secure default traffic flow to start with, and allowing traffic to flow against the secure default flow requires configuration.
It’s common for a single Kubernetes Cluster to have access to 3 levels of DNS (Public Internet DNS, LAN DNS, Inner cluster DNS)
A pod will have access to all 3 levels of DNS, it can connect to:
- A nginx kubernetes service running in the default namespace of the cluster:
PodBash# curl nginx.default.svc.cluster.local
Note: If you use the following commands:
PodBash# cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local
You’ll realize that pods in the default namespace can shorten the above to:
PodBash# curl nginx
A pod in the ingress namespace could shorten the above to:
PodBash# curl nginx.default
- A SQL database running on the LAN:
PodBash# ping mysql.company.lan
S3 storage on the Internet:
A Management Laptop won’t be able to resolve any inner cluster dns names, it’ll only have access to websites defined in LAN DNS and Public Internet DNS.
Kubernetes Inner Cluster Network
The Inner Cluster Network is actually a combination of 2 Networks:
A Kubernetes Service Network and a Pod Network.
The Kubernetes Service Network:
- Has at least 3 implementations: kube-proxy(default), IPVS, and Cilium CNI’s eBPF implementation.
- Kube-proxy implementation of a service network: Stores the IP and routing information of every kubernetes service in the cluster in a file on the disk of the node that represents iptable rules. (because a ClusterIP service’s IP will only exist as a rule in the iptables of a node in the cluster, a laptop can’t ping a service regardless of the CNI used.)
- IPVS: implementation of a service network: Also stores the IP and routing information of every kubernetes service in the cluster, but it stores it as an in kernel memory hashmap for greater scalability, it also adds new inner cluster load balancing options.
- Cilium implementation of a service network is based on extended Berkeley Packet Filtering technology to achieve greater scalability and adds new functionality. The current stable implementation, version 1.6, allows kube-proxy and iptables to be completely replaced.
- The service network has Virtual IPs in a reserved CIDR range, but there’s no network virtualization going on, it only exists from the perspective of the nodes that make up the cluster. A side effect of this is that, as long as your not using any advanced features like federated clusters and multi cluster mesh networking, if you have 2 Kubernetes Clusters on the same LAN they can both use 10.0.0.0/16 as their service network. (For non network engineers, you usually can’t reuse IP space on the same LAN.)
- A Key concept to realize is that when you create a service it doesn’t just exist in etcd, it exists on every node in the cluster. Therefore inner cluster services have high availability, that being said if etcd cluster’s quorum drops below 50% and enters read only mode, kubernetes services will experience degradation in the form of not auto updating.
The Pod Network:
- Has too many CNI (Container Network Interface) implementations to list:
- Pods can end up on the same network as the host nodes where the Inner Cluster Network boundary isn’t as clearly defined as described in the picture above. Example: a Management PC on the same LAN as the nodes could curl a pod ip directly. (This is the case with Azure Container Network Interface, I think it’s also the case with AWS EKS’s VPC CNI, and similar but not quite the same with Calico CNI) An unfortunate side effect of this is that it means a Kubernetes Cluster will waste a lot of private IP space, which can be problematic at big companies that wish to peer a cluster’s LAN with their on premises network using a VPN, Azure Express Route, AWS Direct Connect, or GCPs Cloud Interconnect. It can also needlessly limit your ability to scale: Example if you host a cluster in a /22 which supports ~1024 IPs and configure each Node to support 100 pods, then you can only scale to 10 host nodes.
- Pods can also end up on their own isolated Virtualized Overlay Network. Virtualized Overlay Networks are complicated and come in many flavors, but the short version is that they use network packet encapsulation to create a Virtual Layer 2 Switch based Network over top Layer 3 Network of Routers connecting multiple Layer 2 Networks. To paraphrase that in simpler terms each Kubernetes Node creates a network. All these separate networks join together into a cluster of networks. Then to simplify things / abstract away the complexity network virtualization makes them appear to be a single network.
- In my opinion, it’s usually best practice to use a CNI like Canal or Cilium that leverages a Virtualized Overlay Network for 2 big reasons.
- You end up with a better network boundary as pictured in the diagram above, which increases security. Example if you’re on AKS, then directly curling a pods ip means computers on the LAN could bypass the ingress controller and the HTTPS it offers. I also recall Azure being more permissive to traffic sniffing compared to AWS. When using a CNI that creates an Overlay Network, the pods are behind a virtual NAT router, which means a computer on the LAN can’t curl a pod directly and must go through the Ingress Controller.
- Virtualized Overlay Networks save and reuse private IP space. You can spin up a cluster in a /25 which allows you to easily scale to 100 nodes that can each run 100 pods, not only that but in most cases 2 clusters could reuse the same Subnet for their pod IP space, because it would be a virtualized stub network.
Quick Disclaimer About the Accuracy of Explanations
At this point you’ve come down the rabbit hole far enough that the following statement will make more sense compared to if I had frontloaded it. While I try to be as accurate as possible in my explanations, the mix and match rapidly evolving nature of Kubernetes makes it impossible to give an explanation that’s 100% accurate to all implementations of Kubernetes, the big picture concepts will more or less be the same, but some nitty gritty details may vary. Please keep this in mind when you think about these explanations in the context of your environment. Also be aware that the explanations in this series of posts will assume an overlay network (like that of Canal or Cilium Container Network Interface) is used.
Kubernetes is based on the Linux Kernel, which acts as a base upon which various tools are bolted on to create various Linux Distributions. Every flavor of linux has its own quirks and things that are unique to that Linux Distro like Alpine Linux’s apk add curl, Debian’s apt-get install curl, and RHELs yum install curl. Yet at the same time they all have similarities like Bash and supported file system types.
In a similar vein there are different kubernetes distributions and even different flavors of Ingress Controllers, infact all of Kubernetes is built to be modular/customizable, the kubernetes controller manager component of the masters for example comes in a vanilla flavor and a cloud provider specific flavor that knows how to interact with Cloud Provider APIs to provision things like Cloud Load Balancers. K3s, a Rancher Labs Kubernetes Distro, replaces etcd with an SQLite based implementation.
You may be wondering as I used to wonder: How the hell can Kubernetes be stable when there’s a million different permutations? The short answer to that is API Contracts. The pluggable components that make up Kubernetes conform to standards usually in the form of an api contract, as long as both modules satisfy the contract you can usually swap them out and only need a little integration testing of the module against the pieces it touches, instead of having to rely on end to end testing of every single permutation possible.
Kubernetes Service Types
All Kubernetes Service Types:
- Are automatically highly available because they exist on every node in the cluster.
- Will generate a predictable inner cluster dns name that follows the convention:
There are 4 normal service types:
- ExternalName Service: DNS Redirector
- ClusterIP Service: Inner Cluster Layer 4 Load Balancer
- NodePort Service: ClusterIP Service functionality + uses kube-proxy and a port consistently exposed on every node to enable LAN traffic to Inner Cluster communication.
- LoadBalancer Service: NodePort Service functionality + provisions and configures a Cloud Provider LB that points to a port that’s consistently exposed on every node.
Each of the normal service types has a Static Inner Cluster IP that’s persisted in etcd and kube-proxy/NodePort services can forward traffic directly to any of these service types.
A Kubernetes Load Balancer Service encapsulates other services types, similar to how a Deployment object will create and encapsulate other nested object types.
After creating a deployment:
LaptopBash# kubectl run nginx –image=nginx
You can run
LaptopBash# kubectl get deploy,rs,pod
…and find a match for the object that was just created.
This is because a deployment creates and manages replicasets, and a replicaset creates and manages pods.
Similarly, when you create a service of type load balancer, the created service will have the same properties that a service of type NodePort has, and when a service of type NodePort is created it has the same properties as a service of type ClusterIP.
There are 3 Headless service types:
- Headless ExternalName Service
- Headless ClusterIP Service
- StatefulSet Headless Service (<servicename>-headless is a good convention)
StatefulSet Headless Services will additionally generate a per pod Inner-Cluster DNS name f using the convention:
<statefulset name-#>.<service name>.<namespace>.svc.cluster.local
Headless services don’t get Static Inner Cluster IP a side effect of this is that kube-proxy/NodePort services can’t forward external traffic directly to Headless Services.
An ExternalName service is just a DNS redirector. It can redirect to any dns name: This could be a cluster level, LAN level, or Internet level DNS name.
One use case for ExternalName services is to workaround the inability to externally expose individual pods in a stateful set, these can’t be directly externally exposed due to a limitation associated with headless services. A NodePort service can point to an ExternalName service which can then point to the inner cluster dns name of an individual pod of a statefulset.
A second use case is for implementing an in cluster blue green hard cutover between 2 services of type ClusterIP, which could be good if you have a distributed monolith (several microservices where the versions need to be tightly coupled in order to work due to a lack of API contracts.) that requires several deployments to be updated at the same time, and want to avoid doing a rolling update of components that are not backwards compatible. You may want to implement this in a production environment to avoid having to do several non backwards compatible rolling updates, while live traffic could be coming in. (This also allows you to stage a production deployment and have a 2 second upgrade vs a 10++ minute upgrade window.)
A third use case is to offer a level of consistency between a lower environment and a higher environment, in the diagram below a ClusterIP service in a Dev Environment, could have the same service name as a ExternalName service in a Prod Environment. Consistency usually makes automation and configuration management much easier.
Create predictable static Inner Cluster DNS Names that makes inner cluster communication easier and act like a highly available inner cluster load balancer.
- Create a predictable static inner cluster dns name + static inner cluster IP to act as a static frontend for reaching pods which have unpredictable dynamic IPs.
- Act like highly available inner cluster load balancers: You can create 3 nginx pods and an nginx service of type ClusterIP and use the
service to load balance traffic between the backend pods.
It’s not just a load balancer — it’s a highly available load balancer. To let this sink in let’s think about how this might work with computers. If you have 1 computer load balancing traffic to 3 backend computers, then you don’t have high availability, you have highly available backends, but your load balancer becomes a single point of failure. HAProxy was invented to solve this, it allows you to have 2 computers share a Virtual IP Address. Then 2 computers load balance traffic to 3 backend computes to achieve true high availability.
Now when a pod’s running it’s easy to use “kubectl get pods -o wide” to identify which computer the pod is running on. It’s my hope that going over classical load balancing has caused you to ask the question:
If a service, which is like a load balancer, needs to be highly available, then where does it exist in the cluster?
The short answer is a service exists on every node in the cluster. A service is defined in etcd, and in most implementations of kubernetes kube-proxy edits the ip tables of every node to match the definition of the service. So the only thing that will bring a service down is if etcd or the entire cluster goes down.
- Are declaratively managed using a configuration yaml that define a desired end state using label selectors. Reconciliation loops then automagically discover, add, and remove backend pods based on labels. Auto discovery of pods is very important given that they have dynamic ip addresses and every time a pod reboots it’ll get a new IP address.
Makes LAN to Inner Cluster Communication Possible
NodePort services open a consistent port on every node in the cluster and map traffic that comes in on that port to a service inside the cluster. (Kubeproxy is what’s responsible for redirecting traffic coming in on the Node Port to the service.) NodePort service is somewhat misnamed, NodesPort service would have been a better name as when this service is created a port is, by default, randomly chosen from the range 30000-32767, and every node in the cluster starts listening on that port. If you deployed an rabbitmq pod on your cluster, it’s Management Web GUI/Web API would be available to pods over <servicename>.<namespace>:15672. A management laptop on the same LAN could access the management web GUI via <NodeIP>:<randomly generated port>, which works but isn’t very convenient. Even if you updated LAN DNS to map rabbitmq.lan to the IP of every node, you’d still have to use a randomly generated port, e.g., http://rabbitmq.lan:<random port>
(Note NodePorts will be randomly chosen within the NodePort range by default, but if you want consistency/predictability it is possible to explicitly assign a Node Port to be used in the service yaml. Also pods can listen directly on ports like 80 and 443, this is an advanced scenario that doesn’t use NodePort Service, and will be covered in the 3rd article.)
Make Internet to Inner Cluster communication possible, and can also make LAN to Inner Cluster communication easier.
- LoadBalancer Services don’t work for all kubernetes implementations, most notably bare metal kubernetes implementations, like vinella kubeadm. The LoadBalancer service type works for almost all cloud Provider Implementations of Kubernetes, and instances of kubeadm that are installed using cloud provider flags.
- The way the LoadBalancer service works on Cloud Provider Platforms is by having a cloud provider specific version of the kube controller manager that knows how to interface with Cloud Provider APIs to dynamically provision a Highly Available Cloud Load Balancer as a Service that’s exists external to the cluster but is dynamically configured and declaratively managed by the cluster.
- Scenario 1 Easier Internet to service inside of cluster communication: A Cloud LB is provisioned with a public IP, configured to load balance traffic between every Node in the Cluster, and is configured to remap port 80/443 to whatever Node Port got randomly generated. Internet DNS gets updated so that www.coolsite.com points to the public IP of this Cloud Load Balancer.
- Scenario 2 Easier LAN to service inside of cluster communication: A problem with NodePort service is that even if you configure LAN DNS to point to the IPs of nodes in the cluster, you’d end up with, say, http://rabbitmq.lan:<non-standard port>. A Cloud LB can be provisioned with a private IP within the CIDR of the Cluster’s LAN, configured to load balance traffic between every Node in the Cluster, and configured to remap a desired port to a Node Port. Which means you can end up with http://rabbitmq.lan which is equivalent to http://rabbitmq.lan:80
How Traffic Flows Into a Kubernetes Cluster using Service Type Load Balancer
When a user types in www.website1.com Internet DNS maps their request to 22.214.171.124 and http:// uses port 80, the highly available cloud load balancer on the top load balances traffic between each node, it remaps traffic that came in on port 80 to be directed to port 31111, a Node Port that’s consistent on every node. Any node that receives traffic on that Node Port knows to forward it to website1’s service using kube-proxy. A similar flow occurs for website2.
What about that Ingress Controller concept I keep hearing about? Where does that fit in?
Recall that a Load Balancer Service builds on the functionality of a NodePort service, and a NodePort service builds on the functionality of a ClusterIP service.
Likewise the Ingress Controller concept builds on the LoadBalancer Service functionality. We’ll cover the Ingress concept in depth in the next article.