Postgres: Ordered Queries and the Planner

Most SQL queries require the results of the particular query to be ordered in a some way. The returned data can still be implicitly ordered by the primary key(if present) if the user does not apply their own ordering to the query. In Postgres and other SQL like databases, the way to order the result set from a query is by using the ORDER BY clause. Sorting rows has an additional cost attached to it, which can lead to interesting choice of plans to execute a query. The planner generates multiple paths for query execution with each path representing a different way to get the results for the query. The path with the lowest cost is checked if it produces already sorted results, if not it would have to be sorted and the final costs re-compared against other paths. If it is already sorted, it would remain the cheapest path and used to execute the query. B+tree indexes are already sorted by default. By default, tables with a primary key are sorted on this field. When building index paths the planner considers both forward and backward scan paths on the index thus sorting in ascending or descending order does not have significant difference. Sorting on an indexed field in most cases performs better than non-indexed fields. postgres=# CREATE TABLE bar(id1 INT, id2 INT, id3 INT, id4 INT, descr TEXT); -- new table postgres=# INSERT INTO bar SELECT i, i3, i+i, i2, 'hello' || i FROM generate_series(1, 10000000) i; -- 10M records postgres=# CREATE INDEX idx_id1_id2_id3 ON bar(id1, id2, id3); postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id1; -- there is an index in id1 QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_id1_id2_id3 on bar (cost=0.43..168152.90 rows=998886 width=28) (actual time=0.046..139.920 rows=1000001.00 loo ps=1) Index Cond: ((id1 >= 1000000) AND (id1 < 2000001)) Index Searches: 1 Buffers: shared hit=11188 Planning Time: 0.126 ms Execution Time: 187.269 ms (6 rows) At a minimum 2 IOs are issued. The index is scanned then there is a heap lookup to get rows that are not already in the index, with no explicit sort step. Low CPU usage because there are no comparisons being done because the returned data is already ordered on the sort key. Using a non-indexed field postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id4; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=291593.20..294090.41 rows=998886 width=28) (actual time=404.509..509.841 rows=1000001.00 loops=1) Sort Key: id4 Sort Method: external merge Disk: 38208kB Buffers: shared hit=11188, temp read=9547 written=9564 -> Index Scan using idx_id1_id2_id3 on bar (cost=0.43..168152.90 rows=998886 width=28) (actual time=0.033..142.248 rows=1000001. 00 loops=1) Index Cond: ((id1 >= 1000000) AND (id1 < 2000001)) Index Searches: 1 Buffers: shared hit=11188 Planning Time: 0.112 ms Execution Time: 565.804 ms (10 rows) Other than 2 IOs being issued for each index, there is an explicit sort step at the top level leading to higher CPU usage, for key comparisons. Also, every query only has work_mem amount of memory for sorting purposes. If the data returned does not fit the allocated work_mem memory, additional IOs are issued in the sort step to read in 9547 pages from disk and 9564 pages written out to disk. In an attempt to reduce the disk IOs during sorting, we can increase the work_mem size(Remember work_mem is per query, so be careful) postgres=# SET work_mem = 102400; SET postgres=# SHOW work_mem; work_mem ---------- 100MB (1 row) postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id4; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- ------------- Sort (cost=209251.10..211748.31 rows=998886 width=28) (actual time=245.392..296.690 rows=1000001.00 loops=1) Sort Key: id4 Sort Method: quicksort Memory: 71452kB Buffers: shared hit=10089 -> Bitmap Heap Scan on bar (cost=21199.02..109712.31 rows=998886 width=28) (actual time=26.995..129.239 rows=1000001.00 loops=1) Recheck Cond: ((id1 >= 1000000) AND (id1 < 2000001)) Heap Blocks: exact=7354 Buffers: shared hit=10089 -> Bitmap Index Scan on idx_id1_id2_id3 (cost=0.00..20949.30 rows=998886 width=0) (actual time=25.790..25.791 rows=1000001 .00 loops=1) Index Cond: ((id1 >= 1000000) AND (id1 < 2000001)) I

May 2, 2025 - 00:19

Postgres: Ordered Queries and the Planner

Most SQL queries require the results of the particular query to be ordered in a some way. The returned data can still be implicitly ordered by the primary key(if present) if the user does not apply their own ordering to the query. In Postgres and other SQL like databases, the way to order the result set from a query is by using the ORDER BY clause.

Sorting rows has an additional cost attached to it, which can lead to interesting choice of plans to execute a query. The planner generates multiple paths for query execution with each path representing a different way to get the results for the query. The path with the lowest cost is checked if it produces already sorted results, if not it would have to be sorted and the final costs re-compared against other paths. If it is already sorted, it would remain the cheapest path and used to execute the query.

B+tree indexes are already sorted by default. By default, tables with a primary key are sorted on this field. When building index paths the planner considers both forward and backward scan paths on the index thus sorting in ascending or descending order does not have significant difference. Sorting on an indexed field in most cases performs better than non-indexed fields.

postgres=# CREATE TABLE bar(id1 INT, id2 INT, id3 INT, id4 INT, descr TEXT);  -- new table
postgres=# INSERT INTO bar SELECT i, i*3, i+i, i*2, 'hello' || i FROM generate_series(1, 10000000) i; -- 10M records
postgres=# CREATE INDEX idx_id1_id2_id3 ON bar(id1, id2, id3);

postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id1; -- there is an index in id1
                                                                QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_id1_id2_id3 on bar  (cost=0.43..168152.90 rows=998886 width=28) (actual time=0.046..139.920 rows=1000001.00 loo
ps=1)
   Index Cond: ((id1 >= 1000000) AND (id1 < 2000001))
   Index Searches: 1
   Buffers: shared hit=11188
 Planning Time: 0.126 ms
 Execution Time: 187.269 ms
(6 rows)

At a minimum 2 IOs are issued. The index is scanned then there is a heap lookup to get rows that are not already in the index, with no explicit sort step. Low CPU usage because there are no comparisons being done because the returned data is already ordered on the sort key.

Using a non-indexed field

postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id4;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=291593.20..294090.41 rows=998886 width=28) (actual time=404.509..509.841 rows=1000001.00 loops=1)
   Sort Key: id4
   Sort Method: external merge  Disk: 38208kB
   Buffers: shared hit=11188, temp read=9547 written=9564
   ->  Index Scan using idx_id1_id2_id3 on bar  (cost=0.43..168152.90 rows=998886 width=28) (actual time=0.033..142.248 rows=1000001.
00 loops=1)
         Index Cond: ((id1 >= 1000000) AND (id1 < 2000001))
         Index Searches: 1
         Buffers: shared hit=11188
 Planning Time: 0.112 ms
 Execution Time: 565.804 ms
(10 rows)

Other than 2 IOs being issued for each index, there is an explicit sort step at the top level leading to higher CPU usage, for key comparisons. Also, every query only has work_mem amount of memory for sorting purposes. If the data returned does not fit the allocated work_mem memory, additional IOs are issued in the sort step to read in 9547 pages from disk and 9564 pages written out to disk.

In an attempt to reduce the disk IOs during sorting, we can increase the work_mem size(Remember work_mem is per query, so be careful)

postgres=# SET work_mem = 102400;
SET
postgres=# SHOW work_mem;
 work_mem
----------
 100MB
(1 row)

postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id4;
                                                                    QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
-------------
 Sort  (cost=209251.10..211748.31 rows=998886 width=28) (actual time=245.392..296.690 rows=1000001.00 loops=1)
   Sort Key: id4
   Sort Method: quicksort  Memory: 71452kB
   Buffers: shared hit=10089
   ->  Bitmap Heap Scan on bar  (cost=21199.02..109712.31 rows=998886 width=28) (actual time=26.995..129.239 rows=1000001.00 loops=1)
         Recheck Cond: ((id1 >= 1000000) AND (id1 < 2000001))
         Heap Blocks: exact=7354
         Buffers: shared hit=10089
         ->  Bitmap Index Scan on idx_id1_id2_id3  (cost=0.00..20949.30 rows=998886 width=0) (actual time=25.790..25.791 rows=1000001
.00 loops=1)
               Index Cond: ((id1 >= 1000000) AND (id1 < 2000001))
               Index Searches: 1
               Buffers: shared hit=2735
 Planning Time: 0.141 ms
 Execution Time: 348.114 ms
(14 rows)

There is some improvement in execution time as the sorting is done entirely in memory but also the plan changes to an in-memory bitmap scan. It is not advisable to 'blindly' increase the work_mem since it could mislead the planner in choosing an inferior plan and also does lead to sub-optimal usage of memory -- each query is allocated work_mem sized chunk of memory. It is used here only for demo purposes.

Using both indexed and non-indexed fields

postgres=# explain analyze select * from bar where id1 >= 1000000 and id1 < 2000001 order by id1, id4;
                                                                   QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
 Incremental Sort  (cost=0.61..208698.77 rows=998886 width=28) (actual time=0.061..293.947 rows=1000001.00 loops=1)
   Sort Key: id1, id4
   Presorted Key: id1
   Full-sort Groups: 31251  Sort Method: quicksort  Average Memory: 26kB  Peak Memory: 26kB
   Buffers: shared hit=10089
   ->  Index Scan using idx_id1_id2_id3 on bar  (cost=0.43..163748.90 rows=998886 width=28) (actual time=0.041..144.617 rows=1000001.
00 loops=1)
         Index Cond: ((id1 >= 1000000) AND (id1 < 2000001))
         Index Searches: 1
         Buffers: shared hit=10089
 Planning Time: 0.112 ms
 Execution Time: 341.344 ms
(11 rows)

The planner opts for incremental sort plan where the order-by clause is only partially sorted on a few columns, but not all. In this case, the sort key is on 2 fields id1 and id4. There is an index on id1, so it is already sorted on this column. Instead of doing a full sort on both columns, the pre-sorted rows need only be sorted again on the id4 column. Therefore, saving some CPU cycles and performs better than a case where both fields would have be sorted. There have been past cases where incremental sort has produced plans with issues. Incremental sort can be turned off with the SET enable_incremental_sort = off command.

Well, it helps to sort data on already ordered columns. In other cases incremental sort plan can help with partially ordered columns in an order-by clause. A helpful way to build indexes would be to have a multiple column index or even covering indexes, with the first key column used to limit data returned while next columns used to aid sorting.

SELECT * FROM test WHERE fld1 >= [start_limit] AND fld1 < [end_limit]  ORDER BY fld2; -- index on both fld1 and fld2; CREATE INDEX idx_f1_f2 ON test(fld1, fld2);

Unused indexes can lead to unnecessary bloat and have negative effects on performance. It is important to only add useful columns to the index to keep the index as small as possible. Test and measure to determine what works best for your case.