Kubernetes operators for databases have become a real alternative to cloud DBaaS (Database-as-a-Service) offerings.
Operators handle most “Day 2” database operations such as high availability, backups, and disaster recovery.
Moreover, running databases in Kubernetes is cost-effective and eliminates vendor lock-in.
Despite the high level of automation, no operator can guarantee 100% availability or a flawless performance of your database,
so it is very important to get visibility into each database cluster.
Coroot can help you with that.
For Postgres clusters managed by the Postgres Operator,
it not only highlights performance issues of a
particular instance but also detects cluster-level issues.
Thanks to the eBPF-based connection metrics,
Coroot can track all communications between cluster components such as
Patroni, Postgres, Pgbouncer, and the Kubernetes API server.
To test Coroot, we reproduce real failure scenarios in our lab and record their telemetry. This allows us to play back any scenario second by second.
In this post, I'll use such recordings to show you how the Postgres operator and its underlying components handle various failures.
The lab
apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
name: db-main
namespace: default
spec:
teamId: db
volume:
size: 10Gi
numberOfInstances: 3
users:
coroot: [superuser] # TODO: grant pg_monitor to coroot;
app: []
databases:
app: app
postgresql:
version: "14"
sidecars:
- name: coroot-pg-agent
image: ghcr.io/coroot/coroot-pg-agent:1.1.0
env:
- name: LISTEN
value: ":9091"
- name: PG_SCRAPE_INTERVAL
value: "1s"
- name: PGPASSWORD
valueFrom: {secretKeyRef: {name: coroot.db-main.credentials.postgresql.acid.zalan.do, key: password}}
- name: DSN
value: "host=127.0.0.1 port=5432 user=coroot password=$(PGPASSWORD) dbname=postgres connect_timeout=1 statement_timeout=10000"
podAnnotations:
prometheus.io/scrape: 'true'
prometheus.io/port: '9091'
In terms of Kubernetes, the cluster is a StatefulSet of 3 pods. Postgres within each pod is managed by Patroni.
Patroni turns a bunch of Postgres instances into an HA (High Availability) cluster, it's responsible for replication management and failovers/switchovers.
When Patroni is deployed by the Postgres operator, it uses the Kubernetes control plane (kube-apiserver)
as the DCS (Distributed Config Store) for leader election and storing the cluster state.
The operator creates two Kubernetes services for incoming client connections to each cluster:
<cluster-name> pointing to the primary
and <cluster-name>-repl pointing to the replicas.
$ kubectl get services -l cluster-name=db-main -l spilo-role
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
db-main ClusterIP 10.96.17.243 <none> 5432/TCP 6d18h
db-main-repl ClusterIP 10.99.79.27 <none> 5432/TCP 6d18h
$ kubectl get endpoints -l cluster-name=db-main -l spilo-role
NAME ENDPOINTS AGE
db-main 10.244.3.64:5432 6d18h
db-main-repl 10.244.1.92:5432,10.244.2.82:5432 6d18h
A replica failure
Let's start with the simplest failure scenario when a replica becomes unavailable.
In such scenarios, I prefer to use network partitioning because
this type of failure is harder to detect than Pod faults.
kind: NetworkChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
name: replica-net-partition
spec:
selector:
labelSelectors:
application: spilo
spilo-role: replica
mode: one
action: partition
direction: both
The operator uses the hardcoded
readiness probe configuration for the database pods.
The probe makes a request to /readiness
every 10 seconds and marks the pod as Unready after 3 unsuccessful attempts.
Thus, we can expect that Kubernetes will stop forwarding connections to the failed pod within 30 seconds.
Let's check it out!
There were errors due to the inability to connect to the fenced replica,
but these errors disappeared after 30 seconds when the replica was marked as Unready.
Failover
Now let's see how the operator handles the failure of the primary
Postgres instance using Patroni.
By default Patroni acquires the leader lock in the DCS (Distributed Configuration Store) with
the TTL
(Time-To-Live) at 30 seconds. In other words, the leader must refresh its lock every 30 seconds.
If there is no network connectivity between the primary and the DCS (kube-apiserver in this case),
the primary will not be able to refresh its lock.
Therefore, the failover process should be started within this time interval.
The following NetworkChaos manifest isolates the primary Postgres pod from the rest of the network:
kind: NetworkChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
name: primary-net-partition
spec:
selector:
labelSelectors:
application: spilo
spilo-role: master
mode: all
action: partition
direction: both
Let's apply it and see the automatic failover mechanism in action!
Wow, it works exactly as expected! The failover process mdi-database-sync-outline was initiated
23 seconds after the primary Postgres had become unavailable mdi-alert-octagon-outline.
From the app's perspective, the outage lasted only 27 seconds.
Another thing worth noting is that the new primary instance performed slowly for a while after being promoted
because several processes were causing high disk and CPU utilization:
- CHECKPOINT initiated by Patroni on the primary after a switchover/failover
- ANALYZE updating table statistics
- warming up of the buffer and page caches
Replication lag
Read replicas are useful for scaling out read load. However, when using asynchronous replication,
it takes some time for the new data to reach replicas. So, the application that executes queries on replicas
must be tolerant to a little lag.
On the other hand, if data stops replicating, it can affect the users of the application.
To illustrate such a scenario, we implemented a freshness check in our test application: if the delay in the data fetched from a replica and the
primary exceeds 10 seconds, the app reports an error.
Our next step is making the replication lag grow.
Roughly speaking, streaming replication in Postgres has 3 stages:
-
Transferring WAL (Write-Ahead Logging) records to the replica through the network (this can be affected by a network attack)
-
Writing WAL records to the disk on the replica (this can be affected by an I/O injection, but it might cause many side effects)
-
Replaying WAL records on the replica (this can be affected by executing a long query on the replica,
but this query will be terminated after max_standby_streaming_delay)
So, a network attack seems like the simplest way to cause high replication lag.
Let's simulate an 80% packet loss between the primary and a replica.
kind: NetworkChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
name: replica-to-primary-packet-loss
spec:
selector:
labelSelectors:
spilo-role: replica
mode: one
action: loss
loss:
loss: '80'
direction: both
target:
selector:
labelSelectors:
spilo-role: master
mode: all
As you can see in the manifest above, Chaos Mesh can simulate a network fault between specific Kubernetes pods.
It's quite useful when we want to simulate partial network outages.
Unfortunately, the operator cannot handle such situations yet.
In Patroni, there is the
/replica?lag=<max-lag> health check endpoint which verifies that the
lag of a particular replica is below the specified <max-lag>.
However, it cannot be used as the readiness probe in the StatefulSet because it checks both the primary and the replicas.
To address this a compound health check like /readiness?lag=<max-lag> could be implemented in Patroni
to check the lag only when the instance is running as the replica.
Connection pooler
Now, let's add a connection pooler (PgBouncer) for the primary:
spec:
enableConnectionPooler: true
connectionPooler:
numberOfInstances: 2
mode: "session"
After applying the changes, the operator created the db-main-pooler Deployment and the relevant Kubernetes service.
The readiness probe of the pooler pods is tcp-socket :5432 delay=0s timeout=1s period=10s #success=1 #failure=3.
So, if we completely isolate a pod, the failure will be handled by Kubernetes within 30 seconds.
Let's perform a more interesting experiment in which a pooler pod loses connectivity with Postgres but is still reachable by its clients.
kind: NetworkChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
name: pooler-net-partition
spec:
selector:
labelSelectors:
connection-pooler: db-main-pooler
mode: one
action: partition
direction: both
target:
selector:
labelSelectors:
application: spilo
spilo-role: replica
mode: all
As expected, such a failure is not handled automatically.
However, it is fair to say that the probability of such an issue occurring is extremely low.
Bonus track: a major Postgres upgrade
The operator can handle major Postgres upgrades automatically.
Depending on the major_version_upgrade_mode option, the operator can upgrade a
particular cluster (manual) or all clusters (full).
Below are the main steps the operator performs to upgrade a cluster:
- changes the desired Postgres version in the database pods:
- performs a rolling update of the replicas [replicas downtime]
- initiates a switchover (planned failover) of the primary to one of the replicas [downtime]
-
executes the /scripts/inplace_upgrade.py script
(a part of Spilo) within the primary pod:
- checks the cluster health and runs sanity checks
- pauses Patroni for the cluster to disable automatic failover
- shuts down the primary Postgres [downtime]
- runs CHECKPOINT on the replicas [downtime]
- executes pg_upgrade on the primary to convert PGDATA to the desired Postgres version [downtime]
- shuts down the replicas [downtime]
- rsyncs PGDATA to the replicas [downtime]
- starts up the primary and replicas
- resumes Patroni for the cluster
- executes ANALYZE
Let's upgrade the cluster to Postgres 14 and see how long it is unavailable for:
kubectl patch postgresql db-main --type='json' --patch='[{"op": "replace", "path": "/spec/postgresql/version", "value": "14"}]'
The initial rolling update caused 10 seconds of downtime due to the switchover.
The downtime caused by the upgrade itself lasted 22 seconds.
Conclusion
Chaos engineering is fun!
It not only allows you to test how well a service is able to tolerate failures
but also to find out how observable your infrastructure is.
The conducted experiments verify that the Postgres operator and Patroni handle most
failures perfectly. I hope this gives you more confidence to use them in production.
If you want to improve the observability of your infrastructure, give Coroot a chance!