What counts as a "slow database query"? It's surprisingly hard to say

There's more than one right answer It seems like a simple question with an obvious answer, so it took us as a bit of a surprise when we started getting inconsistent SQL when asking ChatGPT and Claude to help us find the "top slow queries" in a Postgres database. Sometimes we would get the equivalent of: SELECT query, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10; other times we would get something more like: SELECT query, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10; Debating which of these is right misses the point. One ranks queries by the average time they take to complete, the other ranks them by the total time spent running the query. Both interpretations are reasonable, useful, and consistent with the question, "find me the top slow queries": Which queries do applications spend the most time waiting on? Which queries take the longest from start to finish? As developers of the Postgres MCP Pro, we have to contend with developers who mean either one interpretation and those who mean the other. Some might also know they have a problem, but not really know—or even care—which form the question takes. How to find problem queries In this case, we best thing we can do is to look for the question behind the question. When investigating a performance problem, people usually want to answer a question like: What queries are using significant system resources or causing performance problems for application users? To identify queries that are causing problems for application users, the ranking by mean_time is a good way to go. We still need to distinguish between queries run by an interactive application and those run by a background job of some kind. If the MCP Client is a code editor, it may be able to help with this. Let us now turn our attention to those queries that consume a lot of resources. In Postgres, the pg_stat_statements view gives a lot of data. However, that allows us to answer this question in various ways. In Postgres version 17, the pg_stat_statements view has 49 columns, not only mean_exec_time and total_exec_time, but also rows, calls, shared_blks_read, shared_blks_hit, shared_blks_dirtied, wal_bytes, and others. Each of these measures is a candidate for sorting the top queries list. Time doubly relevant because it can reflect user experience and because any time the system is doing anything, it takes time. Other measures provide granular insight about resources that could become bottlenecks. For example,shared_blks_read tells us how many disk reads the query generates and shared_blks_dirtied, along with wal_bytes, tells us about the disk writes it generates. A DBA will often consider all of these measures, in context, to determine whether it makes sense to tune a query. Time as a proxy for resource consumption Early efforts at self-tuning databases encountered a problem: how to reason systematically about system resources when they are expressed as non-comparable metrics. Oracle describes the problem in 2006: Traditionally, performance of various subsystems of the database is measured using different metrics. For example, the efficiency of the data-block buffer cache is expressed as a percentage in buffer hit-ratio; the I/O subsystem is measured using average read and write latencies. They introduced a common metric, DbTime, to serve as a uniform measure of resource consumption, describing it as: DbTime serves as a common currency for the measurement of a subsystem’s performance impact. For example, the performance impact of an under-sized buffer cache would be measured as the total database time spent in performing additional I/O requests that could have been avoided if the buffer cache was larger. The intuition here is that regardless of whether the session is using CPU, doing I/O, or waiting for a lock, it is taking time. This leads to a remarkably straightforward way to reason about system resource consumption: simply measure the time spent inside the database. The key takeaway is that total_exec_time, in pg_stat_statements, is a good proxy for overall resource consumption. The pg_stat_statements view does not break down time with as much granularity as Oracle. In older Postgres versions, there is no breakdown of time at all. Prior to Postgres 15, the breakdown is just planning time, block read time, and block write time. Postgres 15 separates temp block read and write time from overall block read and write time. Postgres 17 further separates shared block read time from shared block write time. Getting insights about other resources, say time spent waiting on locks, requires sampling wait events from the pg_stat_activity view. Amdahl's law and what's worth optimizing While time is a useful proxy for resource consumption, Postgres maintains a number of counters that can help us understand resource consumption more granularly. The challeng

Apr 16, 2025 - 07:10
 0
What counts as a "slow database query"? It's surprisingly hard to say

There's more than one right answer

It seems like a simple question with an obvious answer, so it took us as a bit of a surprise when we started getting inconsistent SQL when asking ChatGPT and Claude to help us find the "top slow queries" in a Postgres database.

Sometimes we would get the equivalent of:

SELECT query, total_exec_time FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;

other times we would get something more like:

SELECT query, mean_exec_time FROM pg_stat_statements
ORDER BY mean_exec_time DESC LIMIT 10;

Debating which of these is right misses the point. One ranks queries by the average time they take to complete, the other ranks them by the total time spent running the query.

Both interpretations are reasonable, useful, and consistent with the question, "find me the top slow queries":

  1. Which queries do applications spend the most time waiting on?
  2. Which queries take the longest from start to finish?

As developers of the Postgres MCP Pro, we have to contend with developers who mean either one interpretation and those who mean the other. Some might also know they have a problem, but not really know—or even care—which form the question takes.

How to find problem queries

In this case, we best thing we can do is to look for the question behind the question. When investigating a performance problem, people usually want to answer a question like:

What queries are using significant system resources or causing performance problems for application users?

To identify queries that are causing problems for application users, the ranking by mean_time is a good way to go. We still need to distinguish between queries run by an interactive application and those run by a background job of some kind. If the MCP Client is a code editor, it may be able to help with this.

Let us now turn our attention to those queries that consume a lot of resources.

In Postgres, the pg_stat_statements view gives a lot of data. However, that allows us to answer this question in various ways.

In Postgres version 17, the pg_stat_statements view has 49 columns, not only mean_exec_time and total_exec_time, but also rows, calls, shared_blks_read, shared_blks_hit, shared_blks_dirtied, wal_bytes, and others. Each of these measures is a candidate for sorting the top queries list.

Time doubly relevant because it can reflect user experience and because any time the system is doing anything, it takes time. Other measures provide granular insight about resources that could become bottlenecks. For example,shared_blks_read tells us how many disk reads the query generates and shared_blks_dirtied, along with wal_bytes, tells us about the disk writes it generates. A DBA will often consider all of these measures, in context, to determine whether it makes sense to tune a query.

Time as a proxy for resource consumption

Early efforts at self-tuning databases encountered a problem: how to reason systematically about system resources when they are expressed as non-comparable metrics. Oracle describes the problem in 2006:

Traditionally, performance of various subsystems of the database is measured using different metrics. For example, the efficiency of the data-block buffer cache is expressed as a percentage in buffer hit-ratio; the I/O subsystem is measured using average read and write latencies.

They introduced a common metric, DbTime, to serve as a uniform measure of resource consumption, describing it as:

DbTime serves as a common currency for the measurement of a subsystem’s performance impact. For example, the performance impact of an under-sized buffer cache would be measured as the total database time spent in performing additional I/O requests that could have been avoided if the buffer cache was larger.

The intuition here is that regardless of whether the session is using CPU, doing I/O, or waiting for a lock, it is taking time.

This leads to a remarkably straightforward way to reason about system resource consumption: simply measure the time spent inside the database.

The key takeaway is that total_exec_time, in pg_stat_statements, is a good proxy for overall resource consumption. The pg_stat_statements view does not break down time with as much granularity as Oracle. In older Postgres versions, there is no breakdown of time at all. Prior to Postgres 15, the breakdown is just planning time, block read time, and block write time. Postgres 15 separates temp block read and write time from overall block read and write time. Postgres 17 further separates shared block read time from shared block write time.

Getting insights about other resources, say time spent waiting on locks, requires sampling wait events from the pg_stat_activity view.

Amdahl's law and what's worth optimizing

While time is a useful proxy for resource consumption, Postgres maintains a number of counters that can help us understand resource consumption more granularly. The challenge is how to make use of them.

Our approach is intuitive: if optimizing a query is going to give us a meaningful improvement in overall system performance, then that query must be responsible for a similarly meaningful portion of at least one system metric.

This intuition is formalized in Amdahl's law, which provides a model for calculating how the completion time of a task is affected by optimizing part of the workload. In the original formulation, it gives the speedup gained from parallelism, but you can use it as a model for performance improvements more generally.

In this context, Amdahl's law is:

Overall Speedup = 1 / ((1 - Optimized Portion) + (Optimized Portion / Speedup of Optimized Portion))

Here, the Optimized Portion is the fraction of the workload that is optimized and the Speedup of Optimized Portion is the speedup in completion time achieved by optimizing that portion of the workload. The Overall Speedup is the impact on the overall completion time of the task.

For example, if we optimize a query accounting for 25% of the workload, reducing its completion time by 5x, the overall speedup is:

1 / ((1 - 0.25) + (0.25 / 5)) = 1 / (0.75 + 0.05) = 1.25

That is, the overall system improves by 25%.

Now consider the case where the query accounts for 5% of the workload rather than 25%. Then the overall speedup is:

1 / ((1 - 0.05) + (0.05 / 5)) = 1 / (0.95 + 0.01) = 1.05

In this case, we improved the performance one query 400%, but the overall system improves by only 5%.

If tuning a query is going to have a large impact on overall system performance, it must account for a meaningful fraction of at least one system metric.

The way we measure this in Postgres MCP Pro, is to calculate the fraction each resource that the query accounts for, then apply a threshold of 5% to determine if the query is worth optimizing.

In a simplified example, we can look at blocks read and blocks written. The SQL is the following:

SELECT query, shared_blks_read_frac, shared_blks_dirtied_frac
FROM (
    SELECT
        query,
        shared_blks_read / SUM(shared_blks_read) OVER () AS shared_blks_read_frac,
        shared_blks_dirtied / SUM(shared_blks_dirtied) OVER () AS shared_blks_dirtied_frac
    FROM pg_stat_statements
)
WHERE shared_blks_read_frac > 0.05 OR shared_blks_dirtied_frac > 0.05
ORDER BY shared_blks_read_frac + shared_blks_dirtied_frac DESC;

In this query, we use the OVER () clause in conjunction with the SUM function to compute the total of shared_blks_read and shared_blks_dirtied for all queries. Dividing each query's shared_blks_read and shared_blks_dirtied by this denominator gives us the fraction of the total that each query represents. Finally, we filter the results to only include queries that account for at least 5% of one resource or the other.

Summary

Looking at the "top N queries" is a common starting point for database performance analysis. It seems simple, yet the concept is surprisingly slippery and hard to grasp concretely.

In this post, we have described two ways of interpreting this question, and three meaningful ways to answer it.

  1. If we are interested in end-user experience, then it makes sense to look at the queries that take the longest on a per-call basis.
  2. If we are interested in resource consumption, as we may be if we want to reduce costs or alleviate a system bottleneck, then we can a) start out by looking at time consumed in aggregate, rather than on a per-call basis, and, b) look at queries that consume a meaningful fraction of any one system resource, like blocks read or blocks written.

Postgres MCP Pro incorporates all of these approaches. It integrates with Claude Desktop, Cursor, and other tools, allowing its performance analysis to feed AI-driven tuning performance. Try out with your database to see what improvements you can get.