A Deep Dive into Apache Doris Indexes

A Deep Dive into Apache Doris Indexes Developers in the big data field know that quickly retrieving data from a vast amount of information is like searching for a specific star in the constellations—extremely challenging. But don't worry! Database indexes are our “positioning magic tools,” capable of significantly boosting query efficiency. Take Apache Doris, a popular analytical database, for example. It supports several types of indexes, each with its own unique features, enabling it to excel in various query scenarios. Today, let's explore Apache Doris indexes in detail and uncover the secrets behind their remarkable performance. I. Classification and Principles of Indexes (A) Point Query Indexes: Precise Targeting of Data Points Prefix Indexes: Shortcuts on Sorted Keys Apache Doris stores data in an ordered structure similar to SSTable, sorted by specified columns. For the three data models—Aggregate, Unique, and Duplicate—when creating a table, they are sorted based on the Aggregate Key, Unique Key, and Duplicate Key specified in the table creation statements. These sorted keys are like the category labels on a well - organized bookshelf. Prefix indexes, on the other hand, are sparse indexes built on these sorted keys. Imagine that every 1,024 rows of data form a logical data block, similar to a partition on a bookshelf. Each partition has an index entry in the prefix index table. This index entry serves as a “mini - directory” for the partition, with its content being the prefix formed by the sorted columns of the first row in the partition. When queries involve these sorted columns, the system can quickly locate the relevant data block through this compact “directory,” much like finding the required book category through the partition directory on a bookshelf. This significantly reduces the search range and accelerates queries. Note ⚠️ The length of Doris prefix indexes does not exceed 36 bytes. For example, if the sorted columns of a table are user_id (8 Bytes), age (4 Bytes), message (VARCHAR (100)), the prefix index might consist of user_id + age + the first 20 bytes of message (if the total length does not exceed 36 bytes). When the query condition is SELECT * FROM table WHERE user_id = 1829239 and age = 20;, the prefix index can quickly locate the logical data block containing the matching data. The query efficiency is much higher than SELECT * FROM table WHERE age = 20; because the latter cannot effectively utilize the prefix index. Inverted Indexes: Keyword Locators for Information Retrieval Since version 2.0.0, Doris has introduced inverted indexes, a powerful tool that plays a crucial role in the field of information retrieval. In the world of Doris, a row in a table is like a document, and a column is a field within the document. Inverted indexes are like highly efficient “keyword locators,” breaking down text into individual words and constructing an index from words to document numbers (i.e., rows in the table). For example, for a table containing user comments, after creating an inverted index on the comment column, when we want to query comments containing a specific keyword (such as “OLAP”), the inverted index can quickly locate the rows containing the keyword. It not only accelerates full - text retrieval for string types, supporting various keyword matching methods like matching multiple keywords simultaneously (MATCH_ALL), matching any one of the keywords (MATCH_ANY), and phrase queries (MATCH_PHRASE), but also accelerates ordinary equality and range queries, replacing the previous BITMAP index function. In terms of storage, inverted indexes use independent files, physically separated from data files. This allows indexes to be created and deleted without rewriting data files, greatly reducing processing overhead. Note ⚠️ Floating - point types with precision issues (FLOAT and DOUBLE) and some complex data types (such as MAP, STRUCT, etc.) do not currently support inverted indexes. (B) Skip - Indexes: Smartly Skipping “Irrelevant Data Blocks” ZoneMap Indexes: Statistical Detectives for Data Blocks ZoneMap indexes are like silent “statistical detectives,” automatically maintaining statistical information for each column. For each data file (Segment) and data block (Page), they record the maximum value, minimum value, and whether there are NULL values. When performing equality queries, range queries, or IS NULL queries, they can quickly determine whether the data file or data block is likely to contain data that meets the conditions based on this statistical information. If it is determined not to contain such data, just like a detective eliminating an irrelevant clue, the file or data block is skipped without being read, reducing I/O operations and accelerating queries. For example, in a table containing user ages, when querying data within a certain age range, the ZoneMap index can quick

Mar 31, 2025 - 11:03
 0
A Deep Dive into Apache Doris Indexes

A Deep Dive into Apache Doris Indexes

Developers in the big data field know that quickly retrieving data from a vast amount of information is like searching for a specific star in the constellations—extremely challenging. But don't worry! Database indexes are our “positioning magic tools,” capable of significantly boosting query efficiency.

Take Apache Doris, a popular analytical database, for example. It supports several types of indexes, each with its own unique features, enabling it to excel in various query scenarios. Today, let's explore Apache Doris indexes in detail and uncover the secrets behind their remarkable performance.

I. Classification and Principles of Indexes

(A) Point Query Indexes: Precise Targeting of Data Points

Prefix Indexes: Shortcuts on Sorted Keys

Apache Doris stores data in an ordered structure similar to SSTable, sorted by specified columns. For the three data models—Aggregate, Unique, and Duplicate—when creating a table, they are sorted based on the Aggregate Key, Unique Key, and Duplicate Key specified in the table creation statements.

These sorted keys are like the category labels on a well - organized bookshelf. Prefix indexes, on the other hand, are sparse indexes built on these sorted keys.

Imagine that every 1,024 rows of data form a logical data block, similar to a partition on a bookshelf. Each partition has an index entry in the prefix index table. This index entry serves as a “mini - directory” for the partition, with its content being the prefix formed by the sorted columns of the first row in the partition.

When queries involve these sorted columns, the system can quickly locate the relevant data block through this compact “directory,” much like finding the required book category through the partition directory on a bookshelf. This significantly reduces the search range and accelerates queries.

Note ⚠️ The length of Doris prefix indexes does not exceed 36 bytes.

For example, if the sorted columns of a table are user_id (8 Bytes), age (4 Bytes), message (VARCHAR (100)), the prefix index might consist of user_id + age + the first 20 bytes of message (if the total length does not exceed 36 bytes).

When the query condition is SELECT * FROM table WHERE user_id = 1829239 and age = 20;, the prefix index can quickly locate the logical data block containing the matching data. The query efficiency is much higher than SELECT * FROM table WHERE age = 20; because the latter cannot effectively utilize the prefix index.

Inverted Indexes: Keyword Locators for Information Retrieval

Since version 2.0.0, Doris has introduced inverted indexes, a powerful tool that plays a crucial role in the field of information retrieval. In the world of Doris, a row in a table is like a document, and a column is a field within the document.

Inverted indexes are like highly efficient “keyword locators,” breaking down text into individual words and constructing an index from words to document numbers (i.e., rows in the table).

For example, for a table containing user comments, after creating an inverted index on the comment column, when we want to query comments containing a specific keyword (such as “OLAP”), the inverted index can quickly locate the rows containing the keyword.

It not only accelerates full - text retrieval for string types, supporting various keyword matching methods like matching multiple keywords simultaneously (MATCH_ALL), matching any one of the keywords (MATCH_ANY), and phrase queries (MATCH_PHRASE), but also accelerates ordinary equality and range queries, replacing the previous BITMAP index function.

In terms of storage, inverted indexes use independent files, physically separated from data files. This allows indexes to be created and deleted without rewriting data files, greatly reducing processing overhead.

Note ⚠️ Floating - point types with precision issues (FLOAT and DOUBLE) and some complex data types (such as MAP, STRUCT, etc.) do not currently support inverted indexes.

(B) Skip - Indexes: Smartly Skipping “Irrelevant Data Blocks”

ZoneMap Indexes: Statistical Detectives for Data Blocks

ZoneMap indexes are like silent “statistical detectives,” automatically maintaining statistical information for each column. For each data file (Segment) and data block (Page), they record the maximum value, minimum value, and whether there are NULL values.

When performing equality queries, range queries, or IS NULL queries, they can quickly determine whether the data file or data block is likely to contain data that meets the conditions based on this statistical information. If it is determined not to contain such data, just like a detective eliminating an irrelevant clue, the file or data block is skipped without being read, reducing I/O operations and accelerating queries.

For example, in a table containing user ages, when querying data within a certain age range, the ZoneMap index can quickly exclude data blocks that clearly do not meet the conditions based on the maximum and minimum ages of the data blocks, improving query efficiency.

BloomFilter Indexes: Probabilistic Fast Sieves

BloomFilter indexes are skip - indexes based on the BloomFilter algorithm, acting like highly efficient “fast sieves.” BloomFilter is a space - efficient probabilistic data structure consisting of an extremely long binary array and a series of hash functions.

In Doris, BloomFilter indexes are constructed on a per - data - block (page) basis. When writing data, each value in the data block is hashed and stored in the corresponding BloomFilter. During queries, based on the value of the equality condition, it is determined whether the BloomFilter contains the value. If not, the corresponding data block is skipped.

For example, in a table containing a large number of user IDs, after creating a BloomFilter index on the user ID column, when querying for a specific user ID, if the BloomFilter determines that the user ID is not in the BloomFilter corresponding to a certain data block, the data block can be skipped without being read, greatly reducing I/O.

Note ⚠️ It is only effective for IN and = equality queries. It does not support Tinyint, Float, or Double type columns, and has limited acceleration effects for low - cardinality fields.

NGram BloomFilter Indexes: Boosters for Text LIKE Queries

NGram BloomFilter indexes are specifically designed for text LIKE queries, serving as “boosters” for text queries. They are similar to BloomFilter indexes, but instead of storing the original text values in the BloomFilter, each word obtained by NGram tokenization of the text is stored.

For LIKE queries, the LIKE pattern is also tokenized using NGram, and it is determined whether each word is in the BloomFilter. If a word is not present, the corresponding data block does not meet the LIKE condition, and the data block can be skipped.

For example, in a table storing product descriptions, after creating an NGram BloomFilter index on the description column, when querying for product descriptions containing a specific phrase (such as “super awesome”), it can quickly filter out data blocks that may contain the phrase, accelerating the query.

However, it only supports string columns and requires the number of consecutive characters in the LIKE pattern to be greater than or equal to N in the NGram defined in the index.

II. Detailed Comparison of Index Characteristics

Different types of indexes have their own advantages and limitations. Let's compare them intuitively through the following table:

Type Index Advantages Limitations
Point Query Indexes Prefix Indexes Built - in index, best performance Only one set of prefix indexes per table
Point Query Indexes Inverted Indexes Supports tokenization and keyword matching, indexes can be created on any column, supports multi - condition combinations, and continuously adds function acceleration Large index storage space, approximately equivalent to the original data
Skip - Indexes ZoneMap Indexes Built - in index, small index storage space Limited supported query types, only supports equality and range queries
Skip - Indexes BloomFilter Indexes More refined than ZoneMap, moderate index space Limited supported query types, only supports equality queries
Skip - Indexes NGram BloomFilter Indexes Supports LIKE acceleration, moderate index space Limited supported query types, only supports LIKE acceleration

III. List of Operators and Functions Accelerated by Indexes

Understanding the support of indexes for different operators and functions helps us better utilize indexes to accelerate queries:

Operators / Functions Prefix Indexes Inverted Indexes ZoneMap Indexes BloomFilter Indexes NGram BloomFilter Indexes
= YES YES YES YES NO
!= YES YES NO NO NO
IN YES YES YES YES NO
NOT IN YES YES NO NO NO
>, >=, <, <=, BETWEEN YES YES YES NO NO
IS NULL YES YES YES NO NO
IS NOT NULL YES YES NO NO NO
LIKE NO NO NO NO YES
MATCH, MATCH_* NO YES NO NO NO
array_contains NO YES NO NO NO
array_overlaps NO YES NO NO NO
is_ip_address_in_range NO YES NO NO NO

IV. Guide to Index Usage

(A) Suggestions for Selecting Prefix Indexes

Select the Most Frequently Filtered Fields

Since there is only one set of prefix indexes per table, it is advisable to use the fields most frequently used in WHERE filtering conditions as the Key.

For example, in a user behavior analysis table, if queries are often made based on user IDs, it is a wise choice to use user ID as the Key column of the prefix index.

The Order of Fields Matters

The more frequently used fields should be placed at the front. Prefix indexes are only effective when the fields in the WHERE condition are in the prefix of the Key.

For instance, if the query condition is often WHERE user_id = 123 AND age = 25, it is better to place user_id before age as the sorted columns when creating the table to make better use of the prefix index for query acceleration.

(B) Suggestions for Selecting Other Indexes

Filtering of Non - Key Fields

For non - Key fields that require filtering acceleration, it is advisable to create inverted indexes first because of their wide applicability and support for multi - condition combinations.

For example, in a table containing user comments and ratings, if queries need to be filtered based on both comment content and rating range, an inverted index can meet the requirements effectively.

String LIKE Matching

If there is a need for string LIKE matching, an NGram BloomFilter index can be added. For example, in a product description search scenario, using an NGram BloomFilter index can effectively accelerate LIKE queries.

Sensitivity to Index Storage Space

When sensitivity to index storage space is high, inverted indexes can be replaced with BloomFilter indexes.

For example, in a table storing a massive amount of low - cardinality user attribute data, BloomFilter indexes can reduce storage space while meeting the acceleration requirements for equality queries.

(C) Performance Optimization and Analysis

If the performance does not meet expectations, analyze the amount of data filtered by indexes and the time consumed through QueryProfile. Refer to the detailed documentation of each index for specific analysis.

For example, evaluate the filtering effect of indexes by checking indicators such as RowsKeyRangeFiltered (the number of rows filtered by prefix indexes) and RowsInvertedIndexFiltered (the number of rows filtered by inverted indexes), and then optimize the index design.

V. Management and Usage of Indexes

(A) Prefix Indexes

Management

Prefix indexes do not require a specific syntax for definition. When creating a table, the first 36 bytes of the table's Key are automatically taken as the prefix index.

Usage

They are used to accelerate equality and range queries in WHERE conditions. They take effect automatically when applicable, with no special syntax required.

For example, in a query like SELECT * FROM table WHERE user_id = 123 AND age > 20;, if user_id and age are sorted columns, the prefix index will automatically play its role.

(B) Inverted Indexes

Management

Definition at Table Creation: In the table creation statement, define the index after the COLUMN definition. For example,

CREATE TABLE table_name (

   column_name1 TYPE1,

   column_name2 TYPE2,

   INDEX idx_name1(column_name1) USING INVERTED [PROPERTIES(...)] [COMMENT 'your comment']

);

Specify the index column name, index type (USING INVERTED), and additional attributes such as tokenizers.

Adding to Existing Tables: Both CREATE INDEX and ALTER TABLE ADD INDEX syntaxes are supported. After adding a new index definition, new data written will generate inverted indexes. For existing data, use BUILD INDEX to trigger index construction. For example,

CREATE INDEX idx_name ON table_name(column_name) USING INVERTED;

BUILD INDEX index_name ON table_name;

Deleting from Existing Tables: Use DROP INDEX idx_name ON table_name; or ALTER TABLE table_name DROP INDEX idx_name; to delete inverted indexes.

Usage

Full - Text Retrieval Keyword Matching: Achieved through MATCH_ANY, MATCH_ALL, etc. For example, SELECT * FROM table_name WHERE column_name MATCH_ANY 'keyword1 ...';

Full - Text Retrieval Phrase Matching: Achieved through MATCH_PHRASE. For example, SELECT * FROM table_name WHERE content MATCH_PHRASE 'keyword1 keyword2'; Note that the support_phrase attribute needs to be set.

Ordinary Equality, Range, IN, NOT IN Queries: Use normal SQL statements. For example, SELECT * FROM table_name WHERE id = 123; Analyze the acceleration effect of inverted indexes through Query Profile indicators such as RowsInvertedIndexFiltered and InvertedIndexFilterTime.

(C) BloomFilter Indexes

Management

Creation at Table Creation: Specify which fields to create BloomFilter indexes on through the table's PROPERTIES "bloom_filter_columns", for example, PROPERTIES ("bloom_filter_columns" = "column_name1,column_name2");

Adding and Deleting from Existing Tables: Modify the bloom_filter_columns property of the table through ALTER TABLE. For example,

ALTER TABLE table_name SET ("bloom_filter_columns" = "column_name1,column_name2,column_name3");

ALTER TABLE table_name SET ("bloom_filter_columns" = "column_name2,column_name3");

The former is to add indexes, and the latter is to delete indexes.

Usage

They are used to accelerate equality queries in WHERE conditions, taking effect automatically with no special syntax required. Analyze the acceleration effect through Query Profile indicators such as RowsBloomFilterFiltered and BlockConditionsFilteredBloomFilterTime.

(D) NGram BloomFilter Indexes

Management

Creation: Define the index after the COLUMN definition in the table creation statement. For example,

INDEX `idx_column_name` (`column_name`) USING NGRAM_BF

PROPERTIES("gram_size"="3", "bf_size"="1024")

COMMENT 'username ngram_bf index'

Specify the index column name, index type (USING NGRAM_BF), and tokenization - related attributes.

Viewing: Use SHOW CREATE TABLE table_name; or SHOW INDEX FROM idx_name; to view indexes.

Deleting: Use ALTER TABLE table_ngrambf DROP INDEX idx_ngrambf; to delete indexes.

Modifying: Use CREATE INDEX or ALTER TABLE ADD INDEX syntax to modify index definitions.

Usage

They are used to accelerate LIKE queries, for example, SELECT count() FROM table1 WHERE message LIKE '%error%'; Analyze the acceleration effect through Query Profile indicators such as RowsBloomFilterFiltered and BlockConditionsFilteredBloomFilterTime.

VI. Conclusion

In conclusion, the Apache Doris index system is rich and powerful, with various indexes having their own strengths. Prefix indexes locate data based on the sorted structure, inverted indexes facilitate full - text retrieval, ZoneMap indexes skip irrelevant data blocks using statistical information, and BloomFilter indexes and NGram BloomFilter indexes accelerate equality and text LIKE queries respectively.

By thoroughly understanding their principles, application scenarios, and usage methods, users can make accurate selections according to their needs, maximizing the performance of Doris in data queries. Whether it's point queries on massive data or complex text retrievals, Doris can handle them with ease.

If you're really stuck, check the QueryProfile to see if the indexes are taking effect. There's nothing worse than implementing indexes that don't work!