Behind the Query: The Secret Life of Database Indexes

Hi developers, I hope your database queries are not taking much time to find the records. But if they are... well, that's why I am writing this article. Before diving into the secret life of our database indexes, let's first understand why do we even need them in the first place. Can't we use our database without them? Are my queries slow if I don't use them? Answer is, "Yes" and "No" both. Let's look at it. Why should you even care about indexes? Database can work without using indexes as well but think of the index as a separate lookup table where only the indexed fields are stored along with a pointer to the memory location of the actual record. Simply put, An index is a data structure that helps find rows fast without scanning the whole table. Think of it like this: Suppose you have a 1000 page dictionary with you. Now if you want to search a specific word, you don't search every page, instead you search it alphabetically which reduces the number of lookups in dictionary to find the right word. Indexes work the same way, when you create index on a field, the field is now copied in a separate collection which only holds the indexed field's values and a pointer. Now, when a query is performed on the database, the database determines which indexes to use. If it finds an index on the filtered or sorted field, then it uses index table to find the record instead of going through a million records. And obviously, it is fast. In a nutshell we can say, "Indexes are the difference between instant results and your app timing out." But, How does it makes my query faster? To simplify this, let's take an example. As we know that all the data is written into the disk memory and in our disk, the data is stored in blocks. For our example, let's say one block stores 4 kilo bytes of data (just assume). Now, Let's look at our database table which we will use for the demonstration. The table looks something like this: Let's assume one row of this table takes 400 bytes of disk storage. Now if we calculate, that means ~10 rows/records are stored in one block of disk. That means 1-10 rows are stored in block-1, 11-20 rows are stored in block-2, and so on. Our table has 1000 rows, which means 100 blocks are required to store our table. Everytime we need to read even a record, we need to read the whole block, load it into the memory and then do the operations. Okay, now our user wants all the students whose age is 16. If user performs this query, we need to go through each block, load it into the memory and then read the 10 rows and check every row if it contains the student with age 16. Hypothetically, let's assume that one block read takes 1 second to read. According to this, we can say that, to read 100 blocks we need 100s that means your simple query is now taking 100 seconds to fetch the results. What if we now have indexes on the field age: The indexed field is now copied to new collection like this: Now, let's say this row takes 40 bytes of storage. And we have 1000 rows, that means 40,000 bytes we need. To store this in the disk, we need ~10 blocks to store the records. And now if we consider the same query, and this time the indexes are used. Read the whole index table, 10 seconds (10 blocks) And let's say we found 10 records and those all records are distributed in 10 blocks(worst case). To fetch these records, we need to read the whole block, means 10 seconds again. If we calculate, 10s(to read index table) + 10s(to fetch the actual records) = 20 seconds. This is a significant gain in our case, from 100 seconds to 20 seconds. This is ~5x gain in results fetching, and that's huge. And if you scale this to 1 million rows, the effects are significant. Again, these calculations are just hypothetical, just to prove the index fields are faster and why. But hey I heard that indexes uses B-tree, and this is just a lookup table. Yes, let's get into some real optimization. Now, if you know that indexes are stored in B-Tree, we can divide this lookup table into a B-tree where one node(with multiple values) acts like one block of storage. Now, if I store the index table blocks (10 blocks) to a B-tree like structure(except you've always seen B-tree vertically, but here think of it as horizontally mounted B-tree). If the value is between 8 and 9 look into block 1, if it's between 9 and 10, look into block 2 and so on. And to store this new index-lookup table (B-tree), we only need 1 block as there are only 10 entries in the table. Then in this case, user is searching for age is 16. Then, we only read the root node of B-tree (1 block) and found that which block we have to read to get the data, then read that exact indexed table entry and then the actual records. Load B-tree in memory (1 block) Find required pointer(pointing to the index table block) using log(n) time Load the founded block in the memory Finally, load all those blocks required from the main table Fin

May 4, 2025 - 07:42

Behind the Query: The Secret Life of Database Indexes

Hi developers, I hope your database queries are not taking much time to find the records. But if they are... well, that's why I am writing this article.

Before diving into the secret life of our database indexes, let's first understand why do we even need them in the first place.
Can't we use our database without them?
Are my queries slow if I don't use them?

Answer is, "Yes" and "No" both. Let's look at it.

Why should you even care about indexes?

Database can work without using indexes as well but think of the index as a separate lookup table where only the indexed fields are stored along with a pointer to the memory location of the actual record.
Simply put, An index is a data structure that helps find rows fast without scanning the whole table.

Think of it like this:
Suppose you have a 1000 page dictionary with you. Now if you want to search a specific word, you don't search every page, instead you search it alphabetically which reduces the number of lookups in dictionary to find the right word.

Indexes work the same way, when you create index on a field, the field is now copied in a separate collection which only holds the indexed field's values and a pointer. Now, when a query is performed on the database, the database determines which indexes to use. If it finds an index on the filtered or sorted field, then it uses index table to find the record instead of going through a million records.
And obviously, it is fast.

In a nutshell we can say,

"Indexes are the difference between instant results and your app timing out."

But, How does it makes my query faster?

To simplify this, let's take an example. As we know that all the data is written into the disk memory and in our disk, the data is stored in blocks. For our example, let's say one block stores 4 kilo bytes of data (just assume).

Now, Let's look at our database table which we will use for the demonstration.

The table looks something like this:

Let's assume one row of this table takes 400 bytes of disk storage. Now if we calculate, that means ~10 rows/records are stored in one block of disk. That means 1-10 rows are stored in block-1, 11-20 rows are stored in block-2, and so on.

Our table has 1000 rows, which means 100 blocks are required to store our table.

Everytime we need to read even a record, we need to read the whole block, load it into the memory and then do the operations.

Okay, now our user wants all the students whose age is 16.
If user performs this query, we need to go through each block, load it into the memory and then read the 10 rows and check every row if it contains the student with age 16.

Hypothetically, let's assume that one block read takes 1 second to read. According to this, we can say that, to read 100 blocks we need 100s that means your simple query is now taking 100 seconds to fetch the results.

What if we now have indexes on the field age:
The indexed field is now copied to new collection like this:

Now, let's say this row takes 40 bytes of storage. And we have 1000 rows, that means 40,000 bytes we need.
To store this in the disk, we need ~10 blocks to store the records.

And now if we consider the same query, and this time the indexes are used.

Read the whole index table, 10 seconds (10 blocks)
And let's say we found 10 records and those all records are distributed in 10 blocks(worst case). To fetch these records, we need to read the whole block, means 10 seconds again.

If we calculate,
10s(to read index table) + 10s(to fetch the actual records) = 20 seconds.

This is a significant gain in our case, from 100 seconds to 20 seconds. This is ~5x gain in results fetching, and that's huge. And if you scale this to 1 million rows, the effects are significant.

Again, these calculations are just hypothetical, just to prove the index fields are faster and why.

But hey I heard that indexes uses B-tree, and this is just a lookup table.

Yes, let's get into some real optimization.
Now, if you know that indexes are stored in B-Tree, we can divide this lookup table into a B-tree where one node(with multiple values) acts like one block of storage.

Now, if I store the index table blocks (10 blocks) to a B-tree like structure(except you've always seen B-tree vertically, but here think of it as horizontally mounted B-tree).

If the value is between 8 and 9 look into block 1, if it's between 9 and 10, look into block 2 and so on.
And to store this new index-lookup table (B-tree), we only need 1 block as there are only 10 entries in the table.

Then in this case, user is searching for age is 16. Then, we only read the root node of B-tree (1 block) and found that which block we have to read to get the data, then read that exact indexed table entry and then the actual records.

Load B-tree in memory (1 block)
Find required pointer(pointing to the index table block) using log(n) time
Load the founded block in the memory
Finally, load all those blocks required from the main table

Finally, this is how our table looks like:

Now, if we take a look, 1 block to read the main root table (B-tree), find the blocks where the the required block is stored that lies in user-filtered range, then read those block(s), 1 block again and then find the 10 items that are (worst case) splitted into different blocks.

Calculations:
1 block(B-tree) + 1 block(indexed records) + 10 blocks(full records) = results
1s + 1s + 10s = 12 seconds, all the way from 100 seconds.

Huh, that's a lot of things going on there. Hope, I was able to make you understand how indexes improve searching time.

I get the gist, but can you explain B-Tree more?

Gotcha friend, The indexes are not stored in the table itself where it is created. It is in a separate database structure in itself. A whole collection where only indexed fields are stored. And The most common approach to store the indexes are in "Balanced Trees" (also known as B-Trees).

Data is stored like this:

Image source: qwertee.io
If you want to learn more about B-tree index in detail, refer this blog by qwertee.io.

Why B-tree? because searching time in B-tree is very less compared to other data structures.

You might argue that Balanced BST and B-Tree are having similar times, what's the difference?

Think of a B-Tree node as a block (like a disk page) — reading one node gets you many keys at once, minimizing costly I/O. The height of the tress is low as compared to Balanced BST as B-Tree has broader nodes.

How DB determines when to use index table or full table scan?

SQL: A Multi-Stage Process

Relational databases like PostgreSQL, MySQL, and SQL Server follow a classic architecture for query execution:

Parsing – SQL is parsed into a syntax tree.

Planning/Optimization – The query planner analyzes different strategies, using available indexes, stats, and constraints.

Execution – The most efficient plan is executed, often using indexes to avoid full table scans.

Example: If there's an index on email, the planner chooses an index scan instead of a sequential scan of the whole table.

SQL: Parser → Optimizer → Executor → Storage Engine

NoSQL: Indexing is Not Optional

In NoSQL databases like MongoDB or Cassandra, there's no table scan fallback (or it's extremely slow). Efficient querying requires indexing.

MongoDB uses a query planner similar to SQL engines but works with JSON-like documents.

Cassandra relies heavily on partition keys and secondary indexes, designed around high-throughput, distributed access.

If there's no index in MongoDB, a full collection scan is performed—which can be expensive.

NoSQL (e.g., MongoDB): Query Engine → Query Planner → Cursor Traversal.

How can I determine if my query is taking benefit of indexes?

MySQL – Use EXPLAIN

Syntax:

EXPLAIN SELECT * FROM users WHERE email = 'test@example.com';

What it shows:
Whether it's using an index (type = ref or index)
What index is being used (key)
How many rows are being scanned

To see all indexes on a table:

SHOW INDEXES FROM users;

MongoDB – Use .explain()

Syntax:

db.users.find({ email: "test@example.com" }).explain("executionStats")

If you see "stage": "IXSCAN" → you're using an index

If you see "stage": "COLLSCAN" → full collection scan ❌ (slow!)

To see all indexes:

db.users.getIndexes()

Some common pitfalls

If you think that you apply index at every field and your database now becomes blazingly fast, where it can query in nano-seconds, fellow dev, that's not the case.

Too many indexes = slower writes Yes, if you apply indexes on a certain field that means a index tree will be created and every time the field value gets updated, the index-ed tree needs to be updated as well to place the new value at the right place(leaf).
Wrong order in composite indexes While you can create composite indexes, which includes multiple fields, the order of the index matters. If you search in the exact order then only the indexes will be used.
Indexing low-cardinality columns (e.g., gender, boolean) If the field only has a limited variation of values like if it's a gender field or an isActive boolean flag, the index tree created is only divided into less number of branches, resulting in a normal scan. It's faster than not having an index, but not much faster.
Forgetting to reindex after big migrations or bulk operations After adding new fields or data migrations, look at the fields again to ensure that the new field is indexed (if required).

When NOT to Use an Index

Let's see when you should not use indexes:

Small tables where full scan is faster
If the table is small, has less data, couple of thousand records then creating and maintaining indexes are heavy tasks compared to light weight searches.
Frequently updated/deleted fields (causes index maintenance overhead)
Pretty self explanatory, if an indexed field is a updates or gets deleted frequently, then indexing it might not be a great idea as it increases an index updation overhead on the database.
Write-heavy workloads where read speed isn’t crucial
Indexes slow down the write operations. If the field is a write heavy field, better if leave it as is.

So, let's wrap up the indexes section this time. Hope you learned something new or brushed up some already known indexes life.

Let's end this post, with an un-indexed line :

An unindexed query walks into a bar. It takes forever to find a seat.

That's Umang, signing off!!
Thanks for reading this post.