Beyond LIKE: Mastering PostgreSQL's Built-in Full-Text Search

A few months ago, my manager assigned me to a project that required improving its search functionality. The team initially considered Elasticsearch as the primary solution, though they were open to exploring alternatives. I researched the topic and discovered PostgreSQL's full-text search. Since our data was already stored in PostgreSQL, I compared the two tools. The application I worked on served private users and had strict privacy regulations, with complex conditions determining who could view or edit specific data within the company. These intricate queries proved challenging for Elasticsearch to handle effectively, while PostgreSQL's full-text search, being built on top of SQL syntax, required no extra work to handle this. It also integrates seamlessly with our existing database, requires no additional infrastructure, and has no licensing costs, unlike Elasticsearch. Additionally, PostgreSQL efficiently manages relational data, while Elasticsearch can be less flexible outside its ecosystem and demands more system resources. But what the heck is full-text search? It’s a way to dig through text data fast, finding matches based on words or phrases, not just exact strings. Think of it like googling something, but for your own database. Instead of saying "find me this exact sentence," you can ask "show me anything with ‘book’ and ‘read’ in it, "and it'll pull up stuff like "I read a book" or "books for reading." It ignores stop words like "the", "and", "I", etc, to pull more relevant results. In postgres, it uses tricks like turning text into searchable chunks and matching them up, so you’re not stuck with slow, basic searches that miss the point. It’s all about making search feel natural, not robotic. PostgreSQL’s full-text search is powered by two core concepts: tsvector and tsquery. A tsvector is a data type that stores pre-processed, searchable data. Think of it as a transformed, indexed version of your text. When you run a search, PostgreSQL compares your search terms with these indexed values rather than the raw text. This allows for faster and more efficient lookups. A tsquery, on the other hand, is essentially a search query. It represents the words and phrases you're looking for, possibly enriched with operators to define how terms should be combined or modified. Examples include using & for logical AND, | for OR, and ! for NOT, giving you the power to craft complex search conditions. Here’s a quick example to illustrate how these work together in practice: Say you have a table documents with a column named content, and you want to know which documents contain both "book" and "read" in the content. SELECT * FROM documents WHERE to_tsvector('english', content) @@ to_tsquery('english', 'book & read'); Okay, so there is a lot of stuff going on after the WHERE keyword. Let's break it down: to_tsvector('english', content): This function converts the content column into a tsvector. The english argument specifies the language of the text, which is important for language-specific stemming and stopword lists. to_tsquery('english', 'book & read'): This function converts the search query into a tsquery. Again, the english argument specifies the language of the text. Finally, the @@ operator checks if the tsvector matches the tsquery and return TRUE or FALSE. But what if we want to search for documents containing either "book" or "read" in the content? Then we can use the | operator like this: SELECT * FROM documents WHERE to_tsvector('english', content) @@ to_tsquery('english', 'book | read'); If you want to search for documents containing "book" and "read" but not "fantasy", you can use the ! operator like this: ... @@ to_tsquery('english', 'book & read & !fantasy'); You can craft more complex queries by grouping terms with parentheses. For example, to find documents which must contain "book" and either "read" or "fantasy", you can use: ... @@ to_tsquery('english', 'book & (read | fantasy)'); While all these operators are great for building "Advanced Search" features (like Gmail or Wikipedia) in your application, but most of the time, users just want a simple search. For that, you can use the plainto_tsquery function. It's similar to to_tsquery but doesn't interpret special characters as operators. SELECT * FROM documents WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'book read'); In plainto_tsquery, each space is treated as an & operator. So, the query above will return documents containing both "book" and "read". Searching Across Multiple Columns Often, you'll want to search across multiple columns in a table. For example, you might want to search both the title and content of documents. PostgreSQL makes this easy through vector concatenation. When searching across multiple columns, you can concatenate tsvectors using the || operator. This combines the lexemes from different columns into a single tsvector that c

May 17, 2025 - 19:18
 0
Beyond LIKE: Mastering PostgreSQL's Built-in Full-Text Search

A few months ago, my manager assigned me to a project that required improving its search functionality. The team initially considered Elasticsearch as the primary solution, though they were open to exploring alternatives. I researched the topic and discovered PostgreSQL's full-text search. Since our data was already stored in PostgreSQL, I compared the two tools. The application I worked on served private users and had strict privacy regulations, with complex conditions determining who could view or edit specific data within the company.

These intricate queries proved challenging for Elasticsearch to handle effectively, while PostgreSQL's full-text search, being built on top of SQL syntax, required no extra work to handle this. It also integrates seamlessly with our existing database, requires no additional infrastructure, and has no licensing costs, unlike Elasticsearch. Additionally, PostgreSQL efficiently manages relational data, while Elasticsearch can be less flexible outside its ecosystem and demands more system resources.

But what the heck is full-text search? It’s a way to dig through text data fast, finding matches based on words or phrases, not just exact strings.
Think of it like googling something, but for your own database.
Instead of saying "find me this exact sentence," you can ask "show me anything with ‘book’ and ‘read’ in it, "and it'll pull up stuff like "I read a book" or "books for reading." It ignores stop words like "the", "and", "I", etc, to pull more relevant results. In postgres, it uses tricks like turning text into searchable chunks and matching them up, so you’re not stuck with slow, basic searches that miss the point. It’s all about making search feel natural, not robotic.

PostgreSQL’s full-text search is powered by two core concepts: tsvector and tsquery. A tsvector is a data type that stores pre-processed, searchable data. Think of it as a transformed, indexed version of your text. When you run a search, PostgreSQL compares your search terms with these indexed values rather than the raw text. This allows for faster and more efficient lookups. A tsquery, on the other hand, is essentially a search query. It represents the words and phrases you're looking for, possibly enriched with operators to define how terms should be combined or modified. Examples include using & for logical AND, | for OR, and ! for NOT, giving you the power to craft complex search conditions.
Here’s a quick example to illustrate how these work together in practice:
Say you have a table documents with a column named content, and you want to know which documents contain both "book" and "read" in the content.

SELECT * FROM documents 
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'book & read');

Okay, so there is a lot of stuff going on after the WHERE keyword. Let's break it down:

  • to_tsvector('english', content): This function converts the content column into a tsvector. The english argument specifies the language of the text, which is important for language-specific stemming and stopword lists.
  • to_tsquery('english', 'book & read'): This function converts the search query into a tsquery. Again, the english argument specifies the language of the text.
  • Finally, the @@ operator checks if the tsvector matches the tsquery and return TRUE or FALSE.

But what if we want to search for documents containing either "book" or "read" in the content? Then we can use the | operator like this:

SELECT * FROM documents 
WHERE to_tsvector('english', content) @@ to_tsquery('english', 'book | read');

If you want to search for documents containing "book" and "read" but not "fantasy", you can use the ! operator like this:

... @@ to_tsquery('english', 'book & read & !fantasy');

You can craft more complex queries by grouping terms with parentheses. For example, to find documents which must contain "book" and either "read" or "fantasy", you can use:

... @@ to_tsquery('english', 'book & (read | fantasy)');

While all these operators are great for building "Advanced Search" features (like Gmail or Wikipedia) in your application, but most of the time, users just want a simple search.
For that, you can use the plainto_tsquery function. It's similar to to_tsquery but doesn't interpret special characters as operators.

SELECT * FROM documents 
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', 'book read');

In plainto_tsquery, each space is treated as an & operator. So, the query above will return documents containing both "book" and "read".

Searching Across Multiple Columns

Often, you'll want to search across multiple columns in a table. For example, you might want to search both the title and content of documents. PostgreSQL makes this easy through vector concatenation.

When searching across multiple columns, you can concatenate tsvectors using the || operator. This combines the lexemes from different columns into a single tsvector that can be searched with a tsquery.

Here's how you might search across both the title and content columns of a documents table:

SELECT * FROM documents 
WHERE to_tsvector('english', title) || to_tsvector('english', content) 
  @@ to_tsquery('english', 'test');

In this example, test is the phrase you're searching for. The || operator concatenates the title and content columns into a single tsvector, which is then searched using the @@ operator.
You might be thinking, why not just do the string concatenation in the query like this:

 SELECT * FROM documents 
 WHERE to_tsvector('english', title || ' ' || content)
   @@ to_tsquery('english', 'test');

This approach is not recommended because the parser treats the title and content as one continuous string, which can lead to unexpected results.
Moreover, column-level weighting (which we will discuss later) couldn't be applied.

Speeding Up The Searches

While these examples work, calling to_tsvector() on every query is inefficient.
Each time you run the query, PostgreSQL has to process and transform the entire text content into a tsvector.
For large tables or frequent searches, this can significantly impact performance.
A better approach is to store the tsvector in a separate column and keep it updated
using database triggers or your backend code.

ALTER TABLE documents ADD COLUMN "vector" tsvector;
UPDATE documents 
SET "vector" = to_tsvector('english', title) || to_tsvector('english', content);

To further speed up the searches, you can create a GIN index on the vector column.
GIN index is a specialized index type that efficiently handles queries on composite data types, such as arrays, jsonb and tsvector.

CREATE INDEX idx_vector ON documents USING GIN (vector);

If you want to search on columns spread across multiple tables, you can a materialized view to store the concatenated tsvector.

CREATE MATERIALIZED VIEW documents_search AS
SELECT 
    documents.id,
    documents.title,
    to_tsvector('english', documents.title) || to_tsvector('english', authors.name) AS vector
FROM documents
LEFT JOIN authors ON documents.author_id = authors.id;

You can also create a GIN index on the vector column of the materialized view.

Setting Weights

PostgreSQL's Full-Text Search allows you to assign different weights to different parts of your document,
influencing how search results are ranked. This is achieved using the setweight function.

The setweight function takes two arguments: a tsvector and a weight label ('A', 'B', 'C', or 'D').
'A' is the highest weight, and 'D' is the lowest. By default, lexemes are assigned weight 'D'.

Here's how you can use setweight to prioritize matches in the title over matches in the content:

UPDATE documents
SET "vector" = setweight(to_tsvector('english', title), 'A') || setweight(to_tsvector('english', content), 'B');

Setting weights is useless if you don't need to rank the results. There are two ranking functions in PostgreSQL:

  • ts_rank: Basic ranking that considers the number of matching lexemes and their weights
  • ts_rank_cd: Also considers the distance between matching lexemes (coverage density)

Here's how to use them:

SELECT * FROM documents
WHERE vector @@ to_tsquery('english', 'test')
ORDER BY ts_rank_cd(vector, to_tsquery('english', 'test')) DESC;

This query will return documents that match the query "test" and rank them by how closely they match the query.

Earlier we talked about why you shouldn't concatenate columns (string concatenation) before creating the tsvector. Because if you do that,
you won't be able to set weights to individual columns.

Highlighting Matches

PostgreSQL provides the ts_headline function to highlight matching terms in the search results.
This function takes the original text and a tsquery as input, and returns the text with matching terms highlighted using tags.

Here's the basic syntax:

WITH cte AS (
  SELECT to_tsquery('english', 'test') AS query
)
SELECT
  ts_headline(
    'english',
    title,
    query
  ) AS title_match,
  ts_headline(
    'english',
    content,
    query
  ) AS content_match
FROM documents, cte
WHERE vector @@ query;

The result will look like this:
| title_match | content_match |
|-------------|---------------|
| A test document | This is a test document that contains the word test multiple times to test the highlighting feature |
| A title | The second document also contains a test word |

If you want replace the default tags with something else, you can define them in the fourth argument of the ts_headline function.

ts_headline(
  'english',
  content,
  query,
  'StartSel={, StopSel=}'
)

Here we replaced with { and with }.

Conclusion

PostgreSQL's full-text search feature is simple, flexible and powerful. It can fulfill most of your search needs.
Though every tool has limitations, PostgreSQL's FTS is no exception. There are scenarios where it might not be the optimal choice, and dedicated search engines like Elasticsearch, OpenSearch, or Solr could be more suitable:

  • Extremely Large Datasets and High Query Throughput: For applications dealing with terabytes or petabytes of data, or requiring tens of thousands of queries per second, distributed search engines are designed for horizontal scalability and can offer better performance. PostgreSQL FTS operates within the context of the database server, which might become a bottleneck at such scales for search-heavy workloads.
  • Complex Relevancy Tuning and Machine Learning: If your application demands highly sophisticated ranking algorithms, personalized search results, "learning to rank" (LTR) models, or advanced features like "more like this" that go beyond PostgreSQL's ts_rank and ts_rank_cd capabilities, dedicated search platforms provide more extensive tools and flexibility.
  • Requirement for a Decoupled Search Service: Architecturally, you might prefer to decouple your search infrastructure from your primary database. This allows for independent scaling, resource allocation, different maintenance windows, and potentially different technology choices for the search component. Using a dedicated search service facilitates this separation.

In these situations, the additional complexity and operational overhead of managing a separate search system can be justified by the advanced capabilities and scalability they provide. However, for a vast majority of applications, PostgreSQL's built-in FTS offers an excellent balance of power, simplicity, and integration directly within your database.