Menu
Kubernetes operators for databases have become a real alternative to cloud DBaaS (Database-as-a-Service) offerings. Operators handle most “Day 2” database operations such as high availability, backups, and disaster recovery. Moreover, running databases in Kubernetes is cost-effective and eliminates vendor lock-in.
Despite the high level of automation, no operator can guarantee 100% availability or a flawless performance of your database, so it is very important to get visibility into each database cluster. Coroot can help you with that. For Postgres clusters managed by the Postgres Operator, it not only highlights performance issues of a particular instance but also detects cluster-level issues. Thanks to the eBPF-based connection metrics, Coroot can track all communications between cluster components such as Patroni, Postgres, Pgbouncer, and the Kubernetes API server.
To test Coroot, we reproduce real failure scenarios in our lab and record their telemetry. This allows us to play back any scenario second by second. In this post, I’ll use such recordings to show you how the Postgres operator and its underlying components handle various failures.
apiVersion: "acid.zalan.do/v1" kind: postgresql metadata: name: db-main namespace: default spec: teamId: db volume: size: 10Gi numberOfInstances: 3 users: coroot: [superuser] # TODO: grant pg_monitor to coroot; app: [] databases: app: app postgresql: version: "14" sidecars: - name: coroot-pg-agent image: ghcr.io/coroot/coroot-pg-agent:1.1.0 env: - name: LISTEN value: ":9091" - name: PG_SCRAPE_INTERVAL value: "1s" - name: PGPASSWORD valueFrom: {secretKeyRef: {name: coroot.db-main.credentials.postgresql.acid.zalan.do, key: password}} - name: DSN value: "host=127.0.0.1 port=5432 user=coroot password=$(PGPASSWORD) dbname=postgres connect_timeout=1 statement_timeout=10000" podAnnotations: prometheus.io/scrape: 'true' prometheus.io/port: '9091'
In terms of Kubernetes, the cluster is a StatefulSet of 3 pods. Postgres within each pod is managed by Patroni. Patroni turns a bunch of Postgres instances into an HA (High Availability) cluster, it’s responsible for replication management and failovers/switchovers. When Patroni is deployed by the Postgres operator, it uses the Kubernetes control plane (kube-apiserver) as the DCS (Distributed Config Store) for leader election and storing the cluster state.
The operator creates two Kubernetes services for incoming client connections to each cluster: <cluster-name> pointing to the primary and <cluster-name>-repl pointing to the replicas.
$ kubectl get services -l cluster-name=db-main -l spilo-role NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE db-main ClusterIP 10.96.17.243 <none> 5432/TCP 6d18h db-main-repl ClusterIP 10.99.79.27 <none> 5432/TCP 6d18h $ kubectl get endpoints -l cluster-name=db-main -l spilo-role NAME ENDPOINTS AGE db-main 10.244.3.64:5432 6d18h db-main-repl 10.244.1.92:5432,10.244.2.82:5432 6d18h
Let’s start with the simplest failure scenario when a replica becomes unavailable. In such scenarios, I prefer to use network partitioning because this type of failure is harder to detect than Pod faults.
kind: NetworkChaos apiVersion: chaos-mesh.org/v1alpha1 metadata: name: replica-net-partition spec: selector: labelSelectors: application: spilo spilo-role: replica mode: one action: partition direction: both
The operator uses the hardcoded readiness probe configuration for the database pods. The probe makes a request to /readiness every 10 seconds and marks the pod as Unready after 3 unsuccessful attempts. Thus, we can expect that Kubernetes will stop forwarding connections to the failed pod within 30 seconds.
Let’s check it out!
There were errors due to the inability to connect to the fenced replica, but these errors disappeared after 30 seconds when the replica was marked as Unready.
Now let’s see how the operator handles the failure of the primary Postgres instance using Patroni.
By default Patroni acquires the leader lock in the DCS (Distributed Configuration Store) with the TTL (Time-To-Live) at 30 seconds. In other words, the leader must refresh its lock every 30 seconds. If there is no network connectivity between the primary and the DCS (kube-apiserver in this case), the primary will not be able to refresh its lock. Therefore, the failover process should be started within this time interval.
The following NetworkChaos manifest isolates the primary Postgres pod from the rest of the network:
kind: NetworkChaos apiVersion: chaos-mesh.org/v1alpha1 metadata: name: primary-net-partition spec: selector: labelSelectors: application: spilo spilo-role: master mode: all action: partition direction: both
Let’s apply it and see the automatic failover mechanism in action!
Wow, it works exactly as expected! The failover process  was initiated 23 seconds after the primary Postgres had become unavailable . From the app’s perspective, the outage lasted only 27 seconds.
Another thing worth noting is that the new primary instance performed slowly for a while after being promoted because several processes were causing high disk and CPU utilization:
Read replicas are useful for scaling out read load. However, when using asynchronous replication, it takes some time for the new data to reach replicas. So, the application that executes queries on replicas must be tolerant to a little lag. On the other hand, if data stops replicating, it can affect the users of the application.
To illustrate such a scenario, we implemented a freshness check in our test application: if the delay in the data fetched from a replica and the primary exceeds 10 seconds, the app reports an error.
Our next step is making the replication lag grow. Roughly speaking, streaming replication in Postgres has 3 stages:
So, a network attack seems like the simplest way to cause high replication lag. Let’s simulate an 80% packet loss between the primary and a replica.
kind: NetworkChaos apiVersion: chaos-mesh.org/v1alpha1 metadata: name: replica-to-primary-packet-loss spec: selector: labelSelectors: spilo-role: replica mode: one action: loss loss: loss: '80' direction: both target: selector: labelSelectors: spilo-role: master mode: all
As you can see in the manifest above, Chaos Mesh can simulate a network fault between specific Kubernetes pods. It’s quite useful when we want to simulate partial network outages.
Unfortunately, the operator cannot handle such situations yet. In Patroni, there is the /replica?lag=<max-lag> health check endpoint which verifies that the lag of a particular replica is below the specified <max-lag>. However, it cannot be used as the readiness probe in the StatefulSet because it checks both the primary and the replicas. To address this a compound health check like /readiness?lag=<max-lag> could be implemented in Patroni to check the lag only when the instance is running as the replica.
Now, let’s add a connection pooler (PgBouncer) for the primary:
spec: enableConnectionPooler: true connectionPooler: numberOfInstances: 2 mode: "session"
After applying the changes, the operator created the db-main-pooler Deployment and the relevant Kubernetes service.
The readiness probe of the pooler pods is tcp-socket :5432 delay=0s timeout=1s period=10s #success=1 #failure=3. So, if we completely isolate a pod, the failure will be handled by Kubernetes within 30 seconds. Let’s perform a more interesting experiment in which a pooler pod loses connectivity with Postgres but is still reachable by its clients.
kind: NetworkChaos apiVersion: chaos-mesh.org/v1alpha1 metadata: name: pooler-net-partition spec: selector: labelSelectors: connection-pooler: db-main-pooler mode: one action: partition direction: both target: selector: labelSelectors: application: spilo spilo-role: replica mode: all
As expected, such a failure is not handled automatically. However, it is fair to say that the probability of such an issue occurring is extremely low.
The operator can handle major Postgres upgrades automatically. Depending on the major_version_upgrade_mode option, the operator can upgrade a particular cluster (manual) or all clusters (full).
Below are the main steps the operator performs to upgrade a cluster:
Let’s upgrade the cluster to Postgres 14 and see how long it is unavailable for:
kubectl patch postgresql db-main --type='json' --patch='[{"op": "replace", "path": "/spec/postgresql/version", "value": "14"}]'
The initial rolling update caused 10 seconds of downtime due to the switchover. The downtime caused by the upgrade itself lasted 22 seconds.
Chaos engineering is fun! It not only allows you to test how well a service is able to tolerate failures but also to find out how observable your infrastructure is.
The conducted experiments verify that the Postgres operator and Patroni handle most failures perfectly. I hope this gives you more confidence to use them in production.
If you want to improve the observability of your infrastructure, give Coroot a chance!