PostgreSQL JSONB Indexing Limitations with B-Tree and GIN

In a previous post, I highlighted a significant limitation of relational databases: a One-to-Many relationship requires two tables, yet no index can optimize access with selective predicates on both of them, resulting in unnecessary rows that must be joined before elimination. A comment from @dogacel suggested continuing with PostgreSQL's JSON approach, so here it is. SQL databases have evolved into general-purpose or converged databases capable of storing documents, like JSONB in PostgreSQL. However, they face challenges with indexing compared to document databases like MongoDB, which offer multi-key indexes for optimizing equality, sort, and range filtering. Let's explore JSONB documents and their indexing options. Instead of an Order Details table, I embed the details into the Orders table: create table orders( primary key(id) , id bigserial , country_id int , created_at timestamptz default clock_timestamp() , details jsonb ); I insert one million orders similar to the previous post: insert into orders (country_id, details) select (10 * random())::int as country_id, -- generate a random country_id ( select jsonb_agg( jsonb_build_object( 'line', gs, 'product_id', ((log(2, (1 + o.id * random())::numeric)::int)), 'quantity', (100 * random())::int ) ) from generate_series(1, 10) as gs -- create 10 lines per order ) as details from generate_series(1, 1000000) as o(id); -- generate 1,000,000 orders Here is an example of Order with its details: postgres=# select * from orders limit 1; -[ RECORD 1 ]-------------------------------------------------------- id | 1 country_id | 8 created_at | 2025-04-15 19:12:27.525731+00 details | [ {"line": 1, "quantity": 6, "product_id": 0}, {"line": 2, "quantity": 91, "product_id": 1}, {"line": 3, "quantity": 66, "product_id": 1}, {"line": 4, "quantity": 100, "product_id": 1}, {"line": 5, "quantity": 15, "product_id": 0}, {"line": 6, "quantity": 80, "product_id": 0}, {"line": 7, "quantity": 96, "product_id": 1}, {"line": 8, "quantity": 9, "product_id": 0}, {"line": 9, "quantity": 89, "product_id": 0}, {"line": 10, "quantity": 14, "product_id": 0} ] I want to run a simple query to retrieve the last ten orders from a specific country that contain a specific product: SELECT country_id, details FROM orders WHERE country_id = 1 AND details @> '[{"product_id": 15}]' ORDER BY created_at DESC LIMIT 10 ; Be aware of filters at various levels: on the order for country and date, and on the details for the product. With a normalized model and two tables, it was impossible to have one index covering all those filters. However, using MongoDB made it straightforward to create an index on "country_id", "order_details.product_id", and "created_at", resulting in only ten documents being read. Expression index on scalar values The normalized columns in the row can easily be indexed with a B-Tree: CREATE INDEX orders1 ON orders ("country_id", "created_at" desc) ; A similar B-Tree index can be created on the JSONB attributes as long as they are top-level, or sub-documents without repeating groups (without JSON array). However, it is impossible to create such index when the JSON path has an array because in PostgreSQL a B-Tree index cannot have multiple index key for the same row. Of course, it is possible to create any index, as there's no schema declared for the JSONB content: CREATE INDEX orders0 ON orders ( (("details"->>'product_id')::int) ) ; However, such index will never find a document where "product_id" is an array as it indexes a scalar expression. A query on the content of the array cannot use the index: postgres=# set enable_seqscan=off; SET postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF) SELECT * from orders WHERE details @> '[{"product_id": 15}]' ; QUERY PLAN --------------------------------------------------------------------- Seq Scan on orders (actual time=38.631..1293.388 rows=446355 loops=1) Filter: (details @> '[{"product_id": 15}]'::jsonb) Rows Removed by Filter: 553645 Buffers: shared hit=16128 read=94984 Generalized Inverted Index (GIN) for arrays In order to index though an array, we need an inverted index and PostgreSQL has GIN: CREATE INDEX orders2 ON orders using GIN ( details ) ; This index is used for the predicate on the array: postgres=# set enable_seqscan=off; SET postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF) SELECT * from orders WHERE details @> '[{"product_id": 15}]' ; QUERY PLAN ---------------------------------------------------

Apr 15, 2025 - 22:14

PostgreSQL JSONB Indexing Limitations with B-Tree and GIN

In a previous post, I highlighted a significant limitation of relational databases: a One-to-Many relationship requires two tables, yet no index can optimize access with selective predicates on both of them, resulting in unnecessary rows that must be joined before elimination. A comment from @dogacel suggested continuing with PostgreSQL's JSON approach, so here it is.

SQL databases have evolved into general-purpose or converged databases capable of storing documents, like JSONB in PostgreSQL. However, they face challenges with indexing compared to document databases like MongoDB, which offer multi-key indexes for optimizing equality, sort, and range filtering. Let's explore JSONB documents and their indexing options.

Instead of an Order Details table, I embed the details into the Orders table:

create table orders(
 primary key(id)
 , id bigserial 
 , country_id int
 , created_at timestamptz default clock_timestamp()
 , details jsonb
);

I insert one million orders similar to the previous post:

insert into orders (country_id, details)
select
 (10 * random())::int as country_id,  -- generate a random country_id
 (
  select jsonb_agg(
   jsonb_build_object(
    'line', gs,
    'product_id', ((log(2, (1 + o.id * random())::numeric)::int)),
    'quantity', (100 * random())::int
   )
  )
  from generate_series(1, 10) as gs  -- create 10 lines per order
 ) as details
from generate_series(1, 1000000) as o(id);  -- generate 1,000,000 orders

Here is an example of Order with its details:

postgres=# select * from orders limit 1;
-[ RECORD 1 ]--------------------------------------------------------
id         | 1
country_id | 8
created_at | 2025-04-15 19:12:27.525731+00
details    | [
                {"line": 1,  "quantity": 6,   "product_id": 0},
                {"line": 2,  "quantity": 91,  "product_id": 1},
                {"line": 3,  "quantity": 66,  "product_id": 1},
                {"line": 4,  "quantity": 100, "product_id": 1},
                {"line": 5,  "quantity": 15,  "product_id": 0},
                {"line": 6,  "quantity": 80,  "product_id": 0},
                {"line": 7,  "quantity": 96,  "product_id": 1},
                {"line": 8,  "quantity": 9,   "product_id": 0},
                {"line": 9,  "quantity": 89,  "product_id": 0},
                {"line": 10, "quantity": 14,  "product_id": 0}
               ]

I want to run a simple query to retrieve the last ten orders from a specific country that contain a specific product:

SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;

Be aware of filters at various levels: on the order for country and date, and on the details for the product. With a normalized model and two tables, it was impossible to have one index covering all those filters. However, using MongoDB made it straightforward to create an index on "country_id", "order_details.product_id", and "created_at", resulting in only ten documents being read.

Expression index on scalar values

The normalized columns in the row can easily be indexed with a B-Tree:

CREATE INDEX orders1 ON orders ("country_id", "created_at" desc)
;

A similar B-Tree index can be created on the JSONB attributes as long as they are top-level, or sub-documents without repeating groups (without JSON array). However, it is impossible to create such index when the JSON path has an array because in PostgreSQL a B-Tree index cannot have multiple index key for the same row.

Of course, it is possible to create any index, as there's no schema declared for the JSONB content:

CREATE INDEX orders0 ON orders ( (("details"->>'product_id')::int) )
;

However, such index will never find a document where "product_id" is an array as it indexes a scalar expression.

A query on the content of the array cannot use the index:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                              QUERY PLAN                               
---------------------------------------------------------------------
 Seq Scan on orders (actual time=38.631..1293.388 rows=446355 loops=1)
   Filter: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Filter: 553645
   Buffers: shared hit=16128 read=94984

Generalized Inverted Index (GIN) for arrays

In order to index though an array, we need an inverted index and PostgreSQL has GIN:

CREATE INDEX orders2 ON orders using GIN ( details )
;

This index is used for the predicate on the array:

postgres=# set enable_seqscan=off;
SET
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT * from orders WHERE details @> '[{"product_id": 15}]'
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Bitmap Heap Scan on orders (actual time=95.272..1196.876 rows=446355 loops=1)
   Recheck Cond: (details @> '[{"product_id": 15}]'::jsonb)
   Rows Removed by Index Recheck: 303886
   Heap Blocks: exact=42718 lossy=65930
   Buffers: shared hit=371 read=108648 written=3415
   ->  Bitmap Index Scan on orders2 (actual time=63.730..63.730 rows=499460 loops=1)
         Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
         Buffers: shared hit=371

I disabled Seq Scan because this predicate alone is not selective enough, returning half of the rows, but it proves that the GIN index can be used.

However, when I run the full query with additional WHERE and ORDER BY LIMIT filters, which are covered by the B-Tree index, the GIN index is no longer used:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.036..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.035..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32

To understand the reason, I remove the LIMIT clause:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                              QUERY PLAN                                               
-------------------------------------------------------------------------------------------------------
 Gather Merge (actual time=237.895..257.523 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=158 read=61738, temp read=4584 written=4592
   ->  Sort (actual time=222.683..226.142 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12688kB
         Buffers: shared hit=158 read=61738, temp read=4584 written=4592
         Worker 0:  Sort Method: external merge  Disk: 11616kB
         Worker 1:  Sort Method: external merge  Disk: 12368kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=74.822..211.433 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 96641
               Heap Blocks: exact=9701 lossy=11292
               Buffers: shared hit=145 read=61733
               ->  BitmapAnd (actual time=78.189..78.190 rows=0 loops=1)
                     Buffers: shared hit=145 read=613
                     ->  Bitmap Index Scan on orders1 (actual time=11.496..11.497 rows=100362 loops=1)
                           Index Cond: (country_id = 1)
                           Buffers: shared read=387
                     ->  Bitmap Index Scan on orders2 (actual time=64.445..64.445 rows=499460 loops=1)
                           Index Cond: (details @> '[{"product_id": 15}]'::jsonb)
                           Buffers: shared hit=145 read=226

PostgreSQL can utilize both indexes, but it requires a Bitmap Scan to combine them, which does not preserve the B-Tree index order. Consequently, it must perform a Sort operation that reads all rows before returning the first one, making it unsuitable for the LIMIT clause due to excessive row reading.

Composite B-Tree + GIN with extension

With the btree_gin extension available in contrib, I can create a single index that is supposed to combine the advantages of B-Tree and GIN indexes:

postgres=# CREATE EXTENSION BTREE_GIN;
CREATE EXTENSION

postgres=# CREATE INDEX orders3 ON orders
           using GIN (country_id , details, created_at);
CREATE INDEX

Running it without LIMIT shows that it doesn't solve the problem:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 -- LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Gather Merge (actual time=123.675..143.726 rows=44864 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=237 read=40117, temp read=4585 written=4594
   ->  Sort (actual time=109.979..113.574 rows=14955 loops=3)
         Sort Key: created_at DESC
         Sort Method: external merge  Disk: 12456kB
         Buffers: shared hit=237 read=40117, temp read=4585 written=4594
         Worker 0:  Sort Method: external merge  Disk: 11720kB
         Worker 1:  Sort Method: external merge  Disk: 12504kB
         ->  Parallel Bitmap Heap Scan on orders (actual time=18.096..98.799 rows=14955 loops=3)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 1760
               Heap Blocks: exact=13486
               Buffers: shared hit=226 read=40110
               ->  Bitmap Index Scan on orders3 (actual time=25.197..25.197 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=226 read=251

Only one index was scanned, with all conditions in its Index Cond, but it is still a Bitmap Scan which doesn't preserve the index key order, because that's how PostgreSQL inverted index works. A GIN index, even with the BTREE_GIN extension, is not an equivalent to MongoDB multi-key indexes which have their index entries ordered.

Even if my new index has all columns, it is not used by the pagination query:

postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Limit (actual time=0.034..0.117 rows=10 loops=1)
   Buffers: shared hit=32
   ->  Index Scan using orders1 on orders (actual time=0.034..0.115 rows=10 loops=1)
         Index Cond: (country_id = 1)
         Filter: (details @> '[{"product_id": 15}]'::jsonb)
         Rows Removed by Filter: 38
         Buffers: shared hit=32

If I drop the B-Tree index, the B-Tree GIN is used, but not efficient as it has to read and sort all rows before returning the ten row for the result:

postgres=# drop index orders1;
DROP INDEX
postgres=# EXPLAIN (ANALYZE, COSTS OFF, BUFFERS, SUMMARY OFF)
SELECT    country_id, details
 FROM     orders
 WHERE    country_id = 1 
 AND      details @> '[{"product_id": 15}]'
 ORDER BY created_at DESC
 LIMIT    10
;
                                            QUERY PLAN                                             
---------------------------------------------------------------------------------------------------
 Limit (actual time=215.701..215.704 rows=10 loops=1)
   Buffers: shared hit=229 read=40107 written=5
   ->  Sort (actual time=215.700..215.701 rows=10 loops=1)
         Sort Key: created_at DESC
         Sort Method: top-N heapsort  Memory: 46kB
         Buffers: shared hit=229 read=40107 written=5
         ->  Bitmap Heap Scan on orders (actual time=31.466..204.149 rows=44864 loops=1)
               Recheck Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
               Rows Removed by Index Recheck: 5280
               Heap Blocks: exact=39859
               Buffers: shared hit=229 read=40107 written=5
               ->  Bitmap Index Scan on orders3 (actual time=24.544..24.544 rows=50144 loops=1)
                     Index Cond: ((country_id = 1) AND (details @> '[{"product_id": 15}]'::jsonb))
                     Buffers: shared hit=229 read=248

Seeing all conditions in the Index Cond is not sufficient, especially with GIN indexes because they return false positives that must be rechecked later. Even if you don't have an ORDER BY LIMIT, the scan itself reads more rows than necessary, as reported by Rows Removed by Index Recheck.

Conclusion

PostgreSQL's JSONB with GIN indexing allows for schema-on-read capabilities within a schema-on-write relational model. While it enables document storage, it does not convert PostgreSQL into a document database.
Although GIN indexes efficiently handle equality predicates on array items, they are less effective for range or sorting operations compared to MongoDB's multi-key indexes, which support equality, sorting, and range queries transparently. You may use JSON datatypes to provide more flexibility to the relational model, but SQL databases should be used as relational databases because their engine is optimized for it. Always check the execution plan to avoid surprises.