Menu
At Coroot, we work on a product that detects the root cause of a service outage and provides our users with a list of possible fixes. Databases are crucial parts of any infrastructure, so our product must be able to identify common pitfalls in this area. According to numerous database rankings, Postgres is the most popular database these days, so it made sense to start there.
Initially, we planned to use metrics collected by existing open-source Prometheus exporters, such as postgres_exporter (1.7k stars on Github). Unfortunately, we realized that the metrics it collects are useless for identifying the most common Postgres issues, like:
Proposed changes to postgres_exporter would be too massive, so we decided to build another exporter focusing on query performance statistics — pg-agent. It’s open-source and distributed under the Apache 2.0 license.
While working on pg-agent, we reproduce the failure scenarios mentioned above and check whether the collected metrics identify a particular failure or not. Here are a few things we realized while testing the agent on real failures:
Let’s take a look at the resulting metrics and how they can help you deal with Postgres-related issues.
The pg_top_query_time_per_second metric is quite useful when you need to find the most CPU-intensive queries. This metric is a combination of clock_timestamp()-pg_stat_activity.query_start and pg_stat_statements.total_time for a particular query.
topk(10, pg_top_query_time_per_second{rds_instance="us-west-1/pg-main-1", db="products"})
The pg_top_query_io_time_per_second metric helps to investigate I/O issues because it makes it easy to find the most I/O-intensive Postgres queries.
topk(10, pg_top_query_io_time_per_second{rds_instance="us-west-1/pg-main-1", db="products"})
The number of client connections that a Postgres server can handle is always limited by its settings (max_connections). This is because Postgres forks a new process for each client connection. Every process consumes memory and CPU resources, even when idle. If the number of connections reaches max_connections, Postgres will reject clients’ attempts to connect.
The pg_connections metric is the number of connections broken down by their state, db, user, wait_event_type, and query. Such metric cardinality allows you to drill down and see the most recent queries executed within the active or idle transactions.
For example, if a Postgres server is rejecting connections, we can aggregate connections by their states:
sum by(state) (pg_connections{rds_instance="us-west-1/pg-main-1"})
As seen above, most connections are in the idle in transaction state. Now, we can drill down to see the last executed queries before the transactions became idle:
sum by(query) (pg_connections{rds_instance="us-west-1/pg-main-1", state="idle in transaction"})
By understanding which transactions are in the idle in transaction state, we can figure out the cause of this behavior. For example, in this scenario, the application communicated with a remote service within transactions.
To ensure data consistency, all databases use locks. The waiting time to acquire a lock is usually negligible, but there still may be cases where locking is an issue. To find the queries waiting for locks we can use the pg_connections metric:
sum by(query) (pg_connections{rds_instance="us-west-1/pg-main-1", state="active", wait_event_type="Lock"})
It is now clear that the issue is caused by lock contention. The next step is to figure out which query is holding the lock. To do so, we can use the pg_lock_awaiting_queries metric. This shows the number of queries awaiting the lock that are being held by a particular query. This metric is gathered by calling pg_blocking_pids() for every locked query and then identifying the relevant query for the returned PID.
topk(10, pg_lock_awaiting_queries{rds_instance="us-west-1/pg-main-1"})
The main concern about using queries as metric labels is the cardinality of the resulting metrics. Handling high cardinality metrics are challenging for most Time-Series Databases (TSDBs). Pg-agent addresses this problem in the following ways:
Pg-agent can gather metrics from Postgres servers starting from version 9.6. It requires a database user with the assigned pg_monitor role. You will also need to enable the pg_stat_statement extension.
create role <USER> with login password '<PASSWORD>'; grant pg_monitor to <USER> create extension pg_stat_statements;
Then, run the agent:
docker run --detach --name coroot-pg-agent \ --env DSN="postgresql://<USER>:<PASSWORD>@<HOST>:5432/postgres?connect_timeout=1&statement_timeout=30000" \ ghcr.io/coroot/coroot-pg-agent
Flags:
usage: coroot-pg-agent [<flags>] <dsn> Flags: --listen="0.0.0.0:80" Listen address (env: LISTEN) - "<ip>:<port>" or ":<port>". --scrape-interval=15s How often to snapshot system views (env: PG_SCRAPE_INTERVAL). --version Show application version. Args: <dsn> Data source name (env: DSN) - "postgresql://<user>:<password>@<host>:5432/postgres?connect_timeout=1&statement_timeout=30000".
If you run Postgres in a Kubernetes cluster, you can deploy pg-agent as a sidecar container. Here is an example of the sidecars spec section for the Postgres Operator by Zalando:
... podAnnotations: prometheus.io/scrape: 'true' prometheus.io/port: '80' sidecars: - name: pg-agent image: "ghcr.io/coroot/coroot-pg-agent:1.0.5" ports: - containerPort: 80 protocol: TCP resources: requests: cpu: 100m memory: 100Mi env: - name: PGUSER valueFrom: secretKeyRef: name: postgres.product-pg.credentials key: username - name: PGPASSWORD valueFrom: secretKeyRef: name: postgres.product-pg.credentials key: password - name: DSN value: postgresql://$(PGUSER):$(PGPASSWORD)@127.0.0.1:5432/postgres
Aws-agent is a dedicated agent that discovers running Postgres & Aurora for Postgres RDS instances in a specified AWS region and instruments them with built-in pg-agent. In addition to Postgres metrics, it gathers:
You can find step-by-step instructions on how to install aws-agent and set up an IAM role for it on GitHub.
We hope that the metrics collected by pg-agent and aws-agent will make troubleshooting Postgres-related issues easier not only for experienced DBAs but also for engineers not specialized in databases.