Menu
The motivation behind writing this article stems from a specific problem we encountered while developingĀ Coroot, an open source observability tool. Coroot leverages eBPF to build a Service Map that covers 100% of your system without the need of modifying your application code.
To build a service map we need to discover how containers in a cluster communicate with each other. Coroot’s agent captures all outbound TCP connections of every container. However, when a container connects to another app through a Kubernetes Service, it becomes challenging to accurately determine the destination container address of such connections.
In this article, we’ll look under the hood at how Kubernetes load balancing works using this seemingly simple task as an example.Let’s deploy an app (nginx) with 2 replicas and a service:
$ kubectl create deployment nginx --image=nginx --replicas=2 deployment.apps/nginx created $ kubectl expose deployment nginx --port=80 service/nginx exposed $ kubectl get pods -l app=nginx -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-748c667d99-pdppx 1/1 Running 0 50s 10.42.0.12 lab <none> <none> nginx-748c667d99-9h6gr 1/1 Running 0 50s 10.42.0.11 lab <none> <none> $ kubectl get services -l app=nginx NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nginx ClusterIP 10.43.209.15 <none> 80/TCP 17s $ kubectl get endpoints -l app=nginx NAME ENDPOINTS AGE nginx 10.42.0.11:80,10.42.0.12:80 114s
Now, let’s run another pod and connect to an nginx instance through the service:
$ kubectl run --rm client -it --image arunvelsriram/utils sh $ telnet nginx 80 Trying 10.43.209.15... ā Service IP Connected to nginx.default.svc.cluster.local. Escape character is '^]'.
From the client pod’s perspective, it connected toĀ 10.43.209.15:80
$ netstat -an |grep EST tcp 0 0 10.42.0.19:50856 10.43.209.15:80 ESTABLISHED
We are aware that the client is in fact connected to one of our nginx pods. However, the question arises: how did this happen, and how can we determine the specific nginx pod the client is connected to?
When the service was created, Kubernetes (specificallyĀ kube-proxy) established iptables rules to distribute traffic randomly among the available nginx pods. These rules change the destination IP address of incoming traffic to the IP address of one of the nginx pods.
This load balancing approach relies on theĀ conntrackĀ table (a part of Linux network stack) to keep track of the connection states. The conntrack table maintains information about established connections, including the source and destination IP addresses and ports.
Hence, we can identify the translated address of the connection in the conntrack table within the root network namespace on the node:
root@lab:~# conntrack -L |grep 50856
tcp 6 86392 ESTABLISHED src=10.42.0.19 dst=10.43.209.15 sport=50856 dport=80 src=10.42.0.12 dst=10.42.0.19 sport=80 dport=50856 [ASSURED] use=1
When a packet is transmitted from the server to the client, the Linux kernel performs a lookup in the conntrack table to identify the corresponding connection. This is the reason why the second IP:PORT pair in the table entry appears in reverse order.
As you can see, in this particular scenario, the connection was established toĀ 10.42.0.12:80Ā (nginx-pod-2).
Now let’s see how the same scenario works with Istio Service Mesh.
A service mesh achieves better control over service-to-service communications by implementing a dedicated infrastructure layer that intercepts and manages network traffic between services. It does this by using sidecar proxies, such as Envoy, that are deployed alongside each application instance.
As Istio is already installed in the cluster, let’s proceed by enabling automatic sidecar injection within our cluster:
$ kubectl label namespace default istio-injection=enabled --overwrite namespace/default labeled
Now let’s run a client pod and connect to the nginx service:
$ kubectl run --rm client -it --image arunvelsriram/utils sh $ telnet nginx 80 Trying 10.43.209.15... ā Service IP Connected to nginx.default.svc.cluster.local. Escape character is '^]'.
From the client’s perspective, nothing has changed. It is connected to the service IP as before:
$ netstat -an|grep EST tcp 0 0 10.42.0.19:32840 10.43.209.15:80 ESTABLISHED
The iptables rules continue to exist in the root network namespace, however there is no relevant record in the root conntrack table anymore. This occurs because the outbound packets from the client are now intercepted directly within the current network namespace of the pod.
When Istio injects a sidecar proxy into a pod, its component,Ā pilot-agent, configures iptables to redirect all outbound traffic to the Envoy proxy within the same network namespace for further processing and control.
Knowing this, we can locate the relevant conntrack table entry within the network namespace of the client pod:
root@lab:~# nsenter -t <telnet_pid> -n conntrack -L |grep 32840 tcp 6 428676 ESTABLISHED src=10.42.0.19 dst=10.43.209.15 sport=32840 dport=80 src=127.0.0.1 dst=10.42.0.19 sport=15001 dport=32840 [ASSURED] use=1
Now we can see that the actual connection destination isĀ 127.0.0.1:15001Ā (envoy). As Envoy establishes new connections to the service endpoints, it’s not feasible to trace the initial connection directly to the final destination.
However, from Coroot’s perspective, this is not a problem. Its agent captures and traces all outbound connections, including those from sidecar proxies. As a result, there is no difference on service maps between connections proxied by Envoy and those that are not.
CiliumĀ is recognized as one of the most powerful network plugins for Kubernetes. It not only provides basic network and security capabilities but also offers an eBPF-based alternative to Kubernetes’ default iptables-based load balancing mechanism.
To set up a k3s cluster with Cilium, I used the following commands:
# curl -sfL https://get.k3s.io | \ INSTALL_K3S_EXEC='--flannel-backend=none --disable-network-policy --disable traefik --disable-kube-proxy' \ sh - $ helm repo add cilium https://helm.cilium.io/ $ helm repo update $ helm install cilium cilium/cilium \ --set k8sServiceHost=<api_server_ip> \ --set k8sServicePort=6443 --set global.containerRuntime.integration="containerd" \ --set global.containerRuntime.socketPath="/var/run/k3s/containerd/containerd.sock" \ --set global.kubeProxyReplacement="strict" \ --namespace kube-system
Now, let’s perform out experiment with the nginx service and telnet again:
$ kubectl run --rm client -it --image arunvelsriram/utils sh $ telnet nginx 80 Trying 10.43.49.44... ā Service IP Connected to nginx.default.svc.cluster.local. Escape character is '^]'.
From the client’s perspective, nothing has changed. It is connected to the service IP as before:
$ netstat -an|grep EST tcp 0 0 10.0.0.65:42014 10.43.49.44:80 ESTABLISHED
As we have completely disabledĀ kube-proxy, there is no entry related to this connection in the conntrack table.
Cilium intercepts all traffic directed towards service IPs and distributes it among the corresponding pods at the eBPF level. In order to perform reverse network address translation, Cilium maintains its own connection tracking table on top of eBPF maps. To access and examine this table, we can use theĀ ciliumĀ CLI tool within the cilium pod:
$ kubectl exec -ti <cilium_pod> -n kube-system -- cilium bpf ct list global|grep 42014 TCP OUT 10.43.49.44:80 -> 10.0.0.65:42014 service expires=30289 RxPackets=0 RxBytes=7 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x13 LastTxReport=8689 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0 TCP OUT 10.0.0.65:42014 -> 10.0.0.80:80 expires=8699 RxPackets=3 RxBytes=206 RxFlagsSeen=0x13 LastRxReport=8689 TxPackets=3 TxBytes=206 TxFlagsSeen=0x13 LastTxReport=8689 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=17722 IfIndex=0 TCP IN 10.0.0.65:42014 -> 10.0.0.80:80 expires=8699 RxPackets=3 RxBytes=206 RxFlagsSeen=0x13 LastRxReport=8689 TxPackets=3 TxBytes=206 TxFlagsSeen=0x13 LastTxReport=8689 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=0 SourceSecurityID=17722 IfIndex=0
As seen, in this particular case, our client pod is connected toĀ 10.0.0.80:80Ā (nginx-pod-2).
Returning to Coroot’s agent, it automatically detects the presence of Cilium in the cluster and leverages its conntrack table to accurately determine the actual destination of every connection.
Kubernetes: My network topology can get pretty complicated!
Docker Swarm: Hold my beer!
Let’s explore how Service-to-Service communications work in a Docker Swarm cluster just for fun!
version: "3.8" services: nginx: image: nginx ports: - target: 80 published: 80 protocol: tcp deploy: mode: replicated replicas: 2 client: image: arunvelsriram/utils command: ['sleep', '100500'] deploy: mode: replicated replicas: 1
By applying this configuration, Docker creates a group of containers and establishes anĀ overlay networkĀ that incorporates a load balancer.
$ docker stack deploy -c demo.yaml demo
Creating network demo_default
Creating service demo_nginx
Creating service demo_client
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
97bf67fc9925 arunvelsriram/utils:latest "sleep 100500" 10 hours ago Up 10 hours demo_client.1.62e6nufzfd4t1g8kde0r40r0h
f0bd5fc75c23 nginx:latest "/docker-entrypoint.ā¦" 2 days ago Up 2 days 80/tcp demo_nginx.2.c2q5ukp66jawe1z0cuyvcfthr
ed101e5d06f7 nginx:latest "/docker-entrypoint.ā¦" 2 days ago Up 2 days 80/tcp demo_nginx.1.uqz8vgp8b7lywoyi1f6io56w7
$ docker network inspect demo_default
[
{
"Name": "demo_default",
"Id": "trjfagvgjgu7iwbzp6ivuvcub",
"Containers": {
"97bf67fc9925ae55c28b16230defad8afe13e54af7bdc24f5b7e05eb8a8eef14": {
"Name": "demo_client.1.62e6nufzfd4t1g8kde0r40r0h",
"EndpointID": "c4af5eb9852e4f2d13086b4ea496bc52cc88e26d5f447f4bf6a7cd1126629d48",
"MacAddress": "02:42:0a:00:02:09",
"IPv4Address": "10.0.2.9/24",
"IPv6Address": ""
},
"ed101e5d06f7c8757187f23b84db415627860f346ea6c585a8e2455e102f024b": {
"Name": "demo_nginx.1.uqz8vgp8b7lywoyi1f6io56w7",
"EndpointID": "81bb68161d02c55f1315a1b3462805ca21f10892a00c0a494d39fa5ceba81e5b",
"MacAddress": "02:42:0a:00:02:03",
"IPv4Address": "10.0.2.3/24",
"IPv6Address": ""
},
"f0bd5fc75c2390dcda9efb9a36f099d668583260b05c48933eca5bd4f0a567b9": {
"Name": "demo_nginx.2.c2q5ukp66jawe1z0cuyvcfthr",
"EndpointID": "f5a50e587f467fc02fb22855e298867a0c86f89da4a78f3608ea0324df708190",
"MacAddress": "02:42:0a:00:02:04",
"IPv4Address": "10.0.2.4/24",
"IPv6Address": ""
},
"lb-demo_default": {
"Name": "demo_default-endpoint",
"EndpointID": "490a5b5bc526ee4aaca34c5a0d443da324d834b20611127769ee3bcff578e0a7",
"MacAddress": "02:42:0a:00:02:05",
"IPv4Address": "10.0.2.5/24",
"IPv6Address": ""
}
},
...
}
]
The load balancer (lb-demo_default) is a dedicated network namespace that has been configured to distribute traffic among containers usingĀ IPVS.
Now let’s run telnet from the client container to the nginx service:
$ docker exec -ti demo_client.1.62e6nufzfd4t1g8kde0r40r0h telnet nginx 80 Trying 10.0.2.2... ā Service IP Connected to nginx. Escape character is '^]'.
# nsenter -t <telnet_pid> -n netstat -an |grep EST
tcp 0 0 10.0.2.9:51504 10.0.2.2:80 ESTABLISHED
The IP addressĀ 10.0.2.2Ā is assigned to the load balancer network namespace:
# nsenter --net=/run/docker/netns/lb_trjfagvgj ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
47: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 02:42:0a:00:02:05 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.2.5/24 brd 10.0.2.255 scope global eth0
valid_lft forever preferred_lft forever
inet 10.0.2.2/32 scope global eth0
valid_lft forever preferred_lft forever
inet 10.0.2.6/32 scope global eth0
valid_lft forever preferred_lft forever
As IPVS also maintains the connection states in the conntrack table, we can locate the actual destination of our connection there:
# nsenter --net=/run/docker/netns/lb_trjfagvgj conntrack -L |grep 51504
tcp 6 431940 ESTABLISHED src=10.0.2.9 dst=10.0.2.2 sport=51504 dport=80 src=10.0.2.4 dst=10.0.2.5 sport=80 dport=51504 [ASSURED] use=1
Coroot’s agent operates in a similar way. It first identifies the appropriate overlay network for each container and then locates the load balancer network namespace. Afterwards, it performs a conntrack table lookup to determine the actual destination of a given connection.
As we have seen, there are multiple ways to organize Service-to-Service communications in Kubernetes. Each approach has its own benefits and drawbacks. However, it is essential to understand how your chosen method works and maintain observability of your system.
Coroot seamlessly integrates with all of these methods, providing you with a comprehensive map of your services within minutes after installation.
Follow the instructions on ourĀ Getting startedĀ page to try Coroot now. Not ready to Get started with Coroot? Check out ourĀ live demo.
If you like Coroot, give us a ā onĀ GitHubļø or share your experience onĀ G2.
Any questions or feedback? Reach out to us onĀ Slack.