Menu
There are few data sources in Postgres where you can get query statistics.
As seen in the diagrams above, we can’t get a full picture based on just one of these sources.
Coroot-pg-agent combines statistics from pg_stat_activity and pg_stat_statements, so the resulting metrics reflect the actual latency of both completed queries and those still in progress. Implementation of such a combination is challenging for several reasons:
Having combined the statistics from pg_stat_statements and pg_stat_activity, now it’s time to calculate a summary of Postgres latency. For a particular type of query, we can only calculate the average execution time as total_execution_time / calls. However, each Postgres server usually performs many different queries, so we can use the average execution time of each query to calculate overall latency quantiles. Let’s see how it works using the example below.
A server performed four types of queries. We know the number of queries of each type and the total execution time, so we can calculate the average query time for each group:
Query | Calls | Total Time | Average Time |
---|---|---|---|
query-1 | 100 | 500ms | 5ms |
query-2 | 150 | 300ms | 2ms |
query-3 | 200 | 2000ms | 10ms |
query-4 | 500 | 500ms | 1ms |
In this case, the server served 100+150+200+500=950 queries. We can assume, that the latency of each individual query from a group is the average latency of this group. As a result, we get:
This rather simple approach works more accurately in a real environment where the number of query groups is much higher than in my example.
Unfortunately, Postgres does not report in any system view the number of errors that occurred. Therefore, the only way to count errors is to parse the log.
Our Prometheus exporters node-agent and aws-agent parse logs and expose the log-based metrics container_log_messages_total and aws_rds_log_messages_total. Since these metrics are broken down by message severity and pattern, we can easily calculate the number of errors in the Postgres log regardless of whether the instance is RDS or self-managed.
# HELP aws_rds_log_messages_total Number of messages grouped by the automatically extracted repeated pattern # TYPE aws_rds_log_messages_total counter aws_rds_log_messages_total{rds_instance_id: "us-east-2/pg-main-1", level: "critical", sample: "FATAL: connection to client lost", pattern_hash: "4174a0b78f3a79718ab011317c9f413b"} 11 aws_rds_log_messages_total{rds_instance_id: "us-east-2/pg-main-1", level: "error", sample: "ERROR: canceling statement due to user request", pattern_hash: "13f3416616c667c6d36d9a4bcf742008"} 22
Now we have metrics that are the basic SLIs (Service Level Indicators) of a Postgres server. Let’s try to define SLOs (Service Level Objectives) — the target levels of each indicator.
Choosing a universal threshold for database latency is a complicated task since application requirements can significantly vary from one app to another. For example, some apps need to execute queries in less than a millisecond, while for others 100ms is more than enough.
At Coroot, we realized that the SLOs of any database depends on the SLOs of its consumers. For each database client not meeting its SLOs, we check the correlation between their affected SLIs and the SLIs of the database. The existence of a strong correlation between them means that this database has an issue affecting an app.
On the one hand, this approach detects even the slightest performance degradations that are affecting the database clients. On the other hand, the probability of false positives is slim since such checks are not fired if all database clients meet their SLOs.
The example below shows how the SLIs of the client application correlate with the SLI of the Postgres server:
SRE practices, such as SLIs & SLOs, are quite applicable to databases. The success rate and latency of queries are great basic SLIs of any database server. Our open-source Prometheus exporters can help you in collecting these and many other useful metrics.