A deep dive into Service-to-Service communications in Kubernetes

June 12, 2023

When you deploy an application to a Kubernetes cluster, one of the essential steps is creating a Service. It enables other apps within the cluster or external clients to access this app through the network. A Service in Kubernetes is a straightforward abstraction, but like any abstraction, it adds complexity into the system and can make troubleshooting more challenging.

Motivation

The motivation behind writing this article stems from a specific problem we encountered while developing Coroot, an open source observability tool. Coroot leverages eBPF to build a Service Map that covers 100% of your system without the need of modifying your application code.

To build a service map we need to discover how containers in a cluster communicate with each other. Coroot’s agent captures all outbound TCP connections of every container. However, when a container connects to another app through a Kubernetes Service, it becomes challenging to accurately determine the destination container address of such connections.

In this article, we’ll look under the hood at how Kubernetes load balancing works using this seemingly simple task as an example.

Built-in Kubernetes load balancing based on iptables

Let’s deploy an app (nginx) with 2 replicas and a service:

$ kubectl create deployment nginx --image=nginx --replicas=2
deployment.apps/nginx created
$ kubectl expose deployment nginx --port=80
service/nginx exposed
$ kubectl get pods -l app=nginx -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP           NODE   NOMINATED NODE   READINESS GATES
nginx-748c667d99-pdppx   1/1     Running   0          50s   10.42.0.12   lab    <none>           <none>
nginx-748c667d99-9h6gr   1/1     Running   0          50s   10.42.0.11   lab    <none>           <none>
$ kubectl get services -l app=nginx
NAME    TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
nginx   ClusterIP   10.43.209.15   <none>        80/TCP    17s
$ kubectl get endpoints -l app=nginx
NAME    ENDPOINTS                     AGE
nginx   10.42.0.11:80,10.42.0.12:80   114s

Now, let’s run another pod and connect to an nginx instance through the service:

$ kubectl run --rm client -it --image arunvelsriram/utils sh
$ telnet nginx 80
Trying 10.43.209.15... ← Service IP
Connected to nginx.default.svc.cluster.local.
Escape character is '^]'.

From the client pod’s perspective, it connected to 10.43.209.15:80

$ netstat -an |grep EST
tcp        0      0 10.42.0.19:50856        10.43.209.15:80         ESTABLISHED

We are aware that the client is in fact connected to one of our nginx pods. However, the question arises: how did this happen, and how can we determine the specific nginx pod the client is connected to?

When the service was created, Kubernetes (specifically kube-proxy) established iptables rules to distribute traffic randomly among the available nginx pods. These rules change the destination IP address of incoming traffic to the IP address of one of the nginx pods.

This load balancing approach relies on the conntrack table (a part of Linux network stack) to keep track of the connection states. The conntrack table maintains information about established connections, including the source and destination IP addresses and ports.

Hence, we can identify the translated address of the connection in the conntrack table within the root network namespace on the node:

root@lab:~# conntrack -L |grep 50856
tcp 6 86392 ESTABLISHED src=10.42.0.19 dst=10.43.209.15 sport=50856 dport=80 src=10.42.0.12 dst=10.42.0.19 sport=80 dport=50856 [ASSURED] use=1

When a packet is transmitted from the server to the client, the Linux kernel performs a lookup in the conntrack table to identify the corresponding connection. This is the reason why the second IP:PORT pair in the table entry appears in reverse order.

As you can see, in this particular scenario, the connection was established to 10.42.0.12:80 (nginx-pod-2).

Istio Service Mesh

Now let’s see how the same scenario works with Istio Service Mesh.

A service mesh achieves better control over service-to-service communications by implementing a dedicated infrastructure layer that intercepts and manages network traffic between services. It does this by using sidecar proxies, such as Envoy, that are deployed alongside each application instance.

As Istio is already installed in the cluster, let’s proceed by enabling automatic sidecar injection within our cluster:

$ kubectl label namespace default istio-injection=enabled --overwrite
namespace/default labeled

Now let’s run a client pod and connect to the nginx service:

$ kubectl run --rm client -it --image arunvelsriram/utils sh
$ telnet nginx 80
Trying 10.43.209.15... ← Service IP
Connected to nginx.default.svc.cluster.local.
Escape character is '^]'.

From the client’s perspective, nothing has changed. It is connected to the service IP as before:

$ netstat -an|grep EST
tcp        0      0 10.42.0.19:32840        10.43.209.15:80         ESTABLISHED

The iptables rules continue to exist in the root network namespace, however there is no relevant record in the root conntrack table anymore. This occurs because the outbound packets from the client are now intercepted directly within the current network namespace of the pod.

When Istio injects a sidecar proxy into a pod, its component, pilot-agent, configures iptables to redirect all outbound traffic to the Envoy proxy within the same network namespace for further processing and control.

Knowing this, we can locate the relevant conntrack table entry within the network namespace of the client pod:

root@lab:~# nsenter -t <telnet_pid> -n conntrack -L |grep 32840
tcp 6 428676 ESTABLISHED src=10.42.0.19 dst=10.43.209.15 sport=32840 dport=80 src=127.0.0.1 dst=10.42.0.19 sport=15001 dport=32840 [ASSURED] use=1

Now we can see that the actual connection destination is 127.0.0.1:15001 (envoy). As Envoy establishes new connections to the service endpoints, it’s not feasible to trace the initial connection directly to the final destination.

However, from Coroot’s perspective, this is not a problem. Its agent captures and traces all outbound connections, including those from sidecar proxies. As a result, there is no difference on service maps between connections proxied by Envoy and those that are not.

Cilium as a kube-proxy/iptables replacement

Cilium is recognized as one of the most powerful network plugins for Kubernetes. It not only provides basic network and security capabilities but also offers an eBPF-based alternative to Kubernetes’ default iptables-based load balancing mechanism.

To set up a k3s cluster with Cilium, I used the following commands:

# curl -sfL https://get.k3s.io | 
  INSTALL_K3S_EXEC='--flannel-backend=none --disable-network-policy --disable traefik --disable-kube-proxy' 
  sh -
$ helm repo add cilium https://helm.cilium.io/
$ helm repo update
$ helm install cilium cilium/cilium 
  --set k8sServiceHost=<api_server_ip> 
  --set k8sServicePort=6443
  --set global.containerRuntime.integration="containerd" 
  --set global.containerRuntime.socketPath="/var/run/k3s/containerd/containerd.sock" 
  --set global.kubeProxyReplacement="strict" 
  --namespace kube-system

Now, let’s perform out experiment with the nginx service and telnet again:

$ kubectl run --rm client -it --image arunvelsriram/utils sh
$ telnet nginx 80
Trying 10.43.49.44... ← Service IP
Connected to nginx.default.svc.cluster.local.
Escape character is '^]'.

From the client’s perspective, nothing has changed. It is connected to the service IP as before:

$ netstat -an|grep EST
tcp        0      0 10.0.0.65:42014         10.43.49.44:80          ESTABLISHED

As we have completely disabled kube-proxy, there is no entry related to this connection in the conntrack table.

Cilium intercepts all traffic directed towards service IPs and distributes it among the corresponding pods at the eBPF level. In order to perform reverse network address translation, Cilium maintains its own connection tracking table on top of eBPF maps. To access and examine this table, we can use the cilium CLI tool within the cilium pod:

$ kubectl  exec -ti <cilium_pod> -n kube-system -- cilium bpf ct list global|grep 42014
TCP OUT 10.43.49.44:80 -> 10.0.0.65:42014 service expires=30289 RxPackets=0 RxBytes=7 RxFlagsSeen=0x00 LastRxReport=0 TxPackets=0 TxBytes=0 TxFlagsSeen=0x13 LastTxReport=8689 Flags=0x0012 [ TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=0 IfIndex=0
TCP OUT 10.0.0.65:42014 -> 10.0.0.80:80 expires=8699 RxPackets=3 RxBytes=206 RxFlagsSeen=0x13 LastRxReport=8689 TxPackets=3 TxBytes=206 TxFlagsSeen=0x13 LastTxReport=8689 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=6 SourceSecurityID=17722 IfIndex=0
TCP IN 10.0.0.65:42014 -> 10.0.0.80:80 expires=8699 RxPackets=3 RxBytes=206 RxFlagsSeen=0x13 LastRxReport=8689 TxPackets=3 TxBytes=206 TxFlagsSeen=0x13 LastTxReport=8689 Flags=0x0013 [ RxClosing TxClosing SeenNonSyn ] RevNAT=0 SourceSecurityID=17722 IfIndex=0

As seen, in this particular case, our client pod is connected to 10.0.0.80:80 (nginx-pod-2).

Returning to Coroot’s agent, it automatically detects the presence of Cilium in the cluster and leverages its conntrack table to accurately determine the actual destination of every connection.

Bonus track: Docker Swarm

Kubernetes: My network topology can get pretty complicated!
Docker Swarm: Hold my beer!

Let’s explore how Service-to-Service communications work in a Docker Swarm cluster just for fun!

version: "3.8"
services:
  nginx:
    image: nginx
    ports:
      - target: 80
        published: 80
        protocol: tcp
    deploy:
      mode: replicated
      replicas: 2
  client:
    image: arunvelsriram/utils
    command: ['sleep', '100500']
    deploy:
      mode: replicated
      replicas: 1

By applying this configuration, Docker creates a group of containers and establishes an overlay network that incorporates a load balancer.

$ docker stack deploy -c demo.yaml demo
Creating network demo_default
Creating service demo_nginx
Creating service demo_client
$ docker ps
CONTAINER ID   IMAGE                        COMMAND                  CREATED        STATUS        PORTS     NAMES
97bf67fc9925   arunvelsriram/utils:latest   "sleep 100500"           10 hours ago   Up 10 hours             demo_client.1.62e6nufzfd4t1g8kde0r40r0h
f0bd5fc75c23   nginx:latest                 "/docker-entrypoint.…"   2 days ago     Up 2 days     80/tcp    demo_nginx.2.c2q5ukp66jawe1z0cuyvcfthr
ed101e5d06f7   nginx:latest                 "/docker-entrypoint.…"   2 days ago     Up 2 days     80/tcp    demo_nginx.1.uqz8vgp8b7lywoyi1f6io56w7
$ docker network inspect demo_default
[
    {
        "Name": "demo_default",
        "Id": "trjfagvgjgu7iwbzp6ivuvcub",
        "Containers": {
            "97bf67fc9925ae55c28b16230defad8afe13e54af7bdc24f5b7e05eb8a8eef14": {
                "Name": "demo_client.1.62e6nufzfd4t1g8kde0r40r0h",
                "EndpointID": "c4af5eb9852e4f2d13086b4ea496bc52cc88e26d5f447f4bf6a7cd1126629d48",
                "MacAddress": "02:42:0a:00:02:09",
                "IPv4Address": "10.0.2.9/24",
                "IPv6Address": ""
            },
            "ed101e5d06f7c8757187f23b84db415627860f346ea6c585a8e2455e102f024b": {
                "Name": "demo_nginx.1.uqz8vgp8b7lywoyi1f6io56w7",
                "EndpointID": "81bb68161d02c55f1315a1b3462805ca21f10892a00c0a494d39fa5ceba81e5b",
                "MacAddress": "02:42:0a:00:02:03",
                "IPv4Address": "10.0.2.3/24",
                "IPv6Address": ""
            },
            "f0bd5fc75c2390dcda9efb9a36f099d668583260b05c48933eca5bd4f0a567b9": {
                "Name": "demo_nginx.2.c2q5ukp66jawe1z0cuyvcfthr",
                "EndpointID": "f5a50e587f467fc02fb22855e298867a0c86f89da4a78f3608ea0324df708190",
                "MacAddress": "02:42:0a:00:02:04",
                "IPv4Address": "10.0.2.4/24",
                "IPv6Address": ""
            },
            "lb-demo_default": {
                "Name": "demo_default-endpoint",
                "EndpointID": "490a5b5bc526ee4aaca34c5a0d443da324d834b20611127769ee3bcff578e0a7",
                "MacAddress": "02:42:0a:00:02:05",
                "IPv4Address": "10.0.2.5/24",
                "IPv6Address": ""
            }
        },
        ...
    }
]

The load balancer (lb-demo_default) is a dedicated network namespace that has been configured to distribute traffic among containers using IPVS.

Now let’s run telnet from the client container to the nginx service:

$ docker exec -ti demo_client.1.62e6nufzfd4t1g8kde0r40r0h telnet nginx 80
Trying 10.0.2.2... ← Service IP
Connected to nginx.
Escape character is '^]'.

# nsenter -t <telnet_pid> -n netstat -an |grep EST
tcp        0      0 10.0.2.9:51504          10.0.2.2:80            ESTABLISHED

The IP address 10.0.2.2 is assigned to the load balancer network namespace:

# nsenter --net=/run/docker/netns/lb_trjfagvgj ip a l
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
47: eth0@if48: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:00:02:05 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.2.5/24 brd 10.0.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.2.2/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet 10.0.2.6/32 scope global eth0
       valid_lft forever preferred_lft forever

As IPVS also maintains the connection states in the conntrack table, we can locate the actual destination of our connection there:

# nsenter --net=/run/docker/netns/lb_trjfagvgj conntrack -L |grep 51504
tcp 6 431940 ESTABLISHED src=10.0.2.9 dst=10.0.2.2 sport=51504 dport=80 src=10.0.2.4 dst=10.0.2.5 sport=80 dport=51504 [ASSURED] use=1

In this case telnet is connected to 10.0.2.4:80 (nginx-2).

Coroot’s agent operates in a similar way. It first identifies the appropriate overlay network for each container and then locates the load balancer network namespace. Afterwards, it performs a conntrack table lookup to determine the actual destination of a given connection.

Conclusion

As we have seen, there are multiple ways to organize Service-to-Service communications in Kubernetes. Each approach has its own benefits and drawbacks. However, it is essential to understand how your chosen method works and maintain observability of your system.

Coroot seamlessly integrates with all of these methods, providing you with a comprehensive map of your services within minutes after installation.

Follow the instructions on our Getting started page to try Coroot now. Not ready to Get started with Coroot? Check out our live demo.

If you like Coroot, give us a ⭐ on GitHub️ or share your experience on G2.

Any questions or feedback? Reach out to us on Slack.