A Practical Guide to BERTopic for Transformer-Based Topic Modeling

A deep dive into BERTopic’s 6 modules to transform financial news into insightful topics The post A Practical Guide to BERTopic for Transformer-Based Topic Modeling appeared first on Towards Data Science.

May 8, 2025 - 06:12
 0
A Practical Guide to BERTopic for Transformer-Based Topic Modeling

Topic modeling has a wide range of use cases in the natural language processing (NLP) domain, such as document tagging, survey analysis, and content organization. It falls under the realm of unsupervised learning technique, making it a very cost-effective technique that reduces the resources required to collect human-annotated data. We will dive deeper into BERTopic, a popular python library for transformer-based topic modeling, to help us process financial news faster and reveal how the trending topics change overtime.
BERTopic consists of 6 core modules that can be customized to suit different use cases. In this article, we’ll examine, experiment with each module individually and explore how they work together coherently to produce the end results.

BERTopic: Transformer-Based Topic Modeling
BERTopic: Transformer-Based Topic Modeling (unless otherwise noted, all images are by the author)

At a high level, a typical BERTopic architecture is composed of:

  • Embeddings: transform text into vector representations (i.e. embeddings) that capture semantic meaning using sentence-transformer models.
  • Dimensionality Reduction: reduce the high-dimensional embeddings to a lower-dimensional space while preserving important relationships, including PCA, UMAP …
  • Clustering: group similar documents together based on their embeddings with reduced dimensionality to form distinct topics, including HDBSCAN, K-Means algorithms …
  • Vectorizers: after topic clusters are formed, vectorizers convert text into numerical features that can be used for topic analysis, including count vectorizer, online vectorizer …
  • c-TF-IDF: calculate importance scores for words within and across topic clusters to identify key terms.
  • Representation Model: leverage semantic similarity between the embedding of candidate keywords and the embedding of documents to find the most representative topic keywords, including KeyBERT, LLM-based techniques …

Project Overview

In this practical application, we will use Topic Modeling to identify trending topics in Apple financial news. Using NewsAPI, we collect daily top-ranked Apple stock news from Google Search and compile them into a dataset of 250 documents, with each document containing financial news for one specific day. However, this is not the main focus of this article so feel free to replace it with your own dataset. The objective is to demonstrate how to transform raw text documents containing top Google search results into meaningful topic keywords and refine those keywords to be more representative.


BERTopic’s 6 Fundamental Modules

1. Embeddings

embeddings

BERTopic uses sentence transformer models as its first building block, converting sentences into dense vector representations (i.e. embeddings) that capture semantic meanings. These models are based on transformer architectures like BERT and are specifically trained to produce high-quality sentence embeddings. We then compute the semantic similarity between sentences using cosine distance between the embeddings. Common models include:

  • all-MiniLM-L6-v2: lightweight, fast, good general performance
  • BAAI/bge-base-en-v1.5: larger model with strong semantic understanding hence gives much slower training and inference speed.

There are a massive range of pre-trained sentence transformers for you to choose from on the “Sentence Transformer” website and Huggingface model hub. We can use a few lines of code to load a sentence transformer model and encode the text sequences into high dimensional numerical embeddings.

from sentence_transformers import SentenceTransformer

# Initialize model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Convert sentences to embeddings
sentences = ["First sentence", "Second sentence"]
embeddings = model.encode(sentences)  # Returns numpy array of embeddings

In this instance, we input a collection of financial news data from October 2024 to March 2025 into the sentence transformer “bge-base-en-v1.5”. As shown in the result below. these text documents are transformed into vector embedding with the shape of 250 rows and each with 384 dimensions.

embeddings result

We can then feed this sentence transformer to BERTopic pipeline and keep all other modules as the default settings.

from sentence_transformers import SentenceTransformer
from bertopic import BERTopic

emb_minilm = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic(
    embedding_model=emb_minilm,
)

topic_model.fit_transform(docs)
topic_model.get_topic_info()

As the end result, we get the following topic representation.

topic result

Compared to the more powerful and larger “bge-base-en-v1.5” model, we get the following result which is slightly more meaningful than the smaller “all-MiniLM-L6-v2” model but still leaves large room for improvement.

One area for improvement is reducing the dimensionality, because sentence transformers typically results in high-dimensional embeddings. As BERTopic relies on comparing the spatial proximity between embedding space to form meaningful clusters, it is crucial to apply a dimensionality reduction technique to make the embeddings less sparse. Therefore, we are going to introduce various dimensionality reduction techniques in the next section.

2. Dimensionality Reduction

dimensionality reduction

After converting the financial news documents into embeddings, we face the problem of high dimensionality. Since each embedding contains 384 dimensions, the vector space becomes too sparse to create meaningful distance measurement between two vector embeddings. Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP) are common techniques to reduce dimensionalities while preserving the maximum variance in the data. We will look at UMAP, BERTopic’s default dimensionality reduction technique, in more details. It is a non-linear algorithm adopted from topology analysis that seeks diverse structure within the data. It works by extending a radius outwards from each data point and connecting points with its close neighbors. You can dive more into the UMAP visualization on this website “Understanding UMAP“.

UMAP n_neighbours Experimentation

An important UMAP parameter is n_neighbours that controls how UMAP balances local and global structure in the data. Low values of n_neighbors will force UMAP to concentrate on local structure, while large values will look at larger neighborhoods of each point.
The diagram below shows multiple scatterplots demonstrating the effect of different n_neighbors values, with each plot visualizing the embeddings in an 2-dimensional space after applying UMAP dimensionality reduction.

With smaller n_neighbors values (e.g. n=2, n=5), the plots show more tightly coupled micro clusters, indicating a focus on local structure. As n_neighbors increases (towards n=100, n=150), the points form more cohesive global patterns, demonstrating how larger neighborhood sizes help UMAP capture broader relationships in the data.

UMAP experimentation


UMAP min_dist Experimentation

The min_dist parameter in UMAP controls how tightly points are allowed to be packed together in the lower dimensional representation. It sets the minimum distance between points in the embedding space. A smaller min_dist allows points to be packed very closely together whereas a larger min_dist forces points to be more scattered and evenly spread out. The diagram below shows an experimentation on min_dist value from 0.0001 to 1 when setting the n_neighbors=5. When min_dist is set to smaller values, UMAP emphasizes on preserving local structure whereas larger values transform the embeddings into a circular shape.

UMAP experimentation

We decide to set n_neighbors=5 and min_dist=0.01 based on the hyperparameter tuning results, as it forms more distinct data clusters that are easier for the subsequent clustering model to process.

import umap

UMAP_N = 5
UMAP_DIST = 0.01
umap_model = umap.UMAP(
    n_neighbors=UMAP_N,
    min_dist=UMAP_DIST, 
    random_state=0
)

3. Clustering

clustering

Following the dimensionality reduction module, it’s the process of grouping embeddings with close proximity into clusters. This process is fundamental to topic modeling, as it categorizes relevant text documents together by looking at their semantic relationships. BERTopic employs HDBSCAN model by default, which has the advantage in capturing structures with diverse densities. Additionally, BERTopic provides the flexibility of choosing other clustering models based on the nature of the dataset, such as K-Means (for spherical, equally-sized clusters) or agglomerative clustering (for hirerarchical clusters).

HDBSCAN Experimentation

We will explore how two important parameters, min_cluster_size and min_samples, impact the behavior of HDBSCAN model.
min_cluster_size determines the minimum number of data points allowed to form a cluster and clusters not meeting the threshold are treated as outliers. When setting min_cluster_size too low, you might get many small, unstable clusters which might be noise. If setting it too high, you might merge multiple clusters into one, losing their distinct characteristics.

min_samples calculates the distance between a point and its k-th nearest neighbor, determining how strict the cluster formation process is. The larger the min_samples value, the more conservative the clustering becomes, as clusters will be restricted to form in dense areas, classifying sparse points as noise.

Condensed Tree is a useful technique to help us decide appropriate values of these two parameters. Clusters that persist for a large range of lambda values (shown as the left vertical axis in a condense tree plot) are considered stable and more meaningful. We prefer the selected clusters to be both tall (more stable) and wide (large cluster size). We use condensed_tree_ from HDBSCAN to compare min_cluster_size from 3 to 50, then visualize the data points in their vector space, color coded by the predicted cluster labels. As we progress through different min_cluster_size, we can identify optimal values that group close data points together.

In this experimentation, we selected min_cluster_size=15 as it generates 4 clusters (highlighted in red in the condensed tree plot below) with good stability and cluster size. Additionally the scatterplot also indicates reasonable cluster formation based on proximity and density.

Condensed Tree for HDBSCAN min_cluster_size
Condensed Trees for HDBSCAN min_cluster_size Experimentation
Condensed Tree for HDBSCAN min_samples
Scatterplots for HDBSCAN min_cluster_size Experimentation

We then carry out a similar exercise to compare min_samples from 1 to 80 and selected min_samples=5. As you can observe from the visuals, the parameters min_samples and min_cluster_size exert distinct impacts on the clustering process.

Condensed Trees for HDBSCAN min_samples Experimentation
Scatterplots for HDBSCAN min_samples Experimentation
import hdbscan

MIN_CLUSTER _SIZE= 15
MIN_SAMPLES = 5
clustering_model = hdbscan.HDBSCAN(
    min_cluster_size=MIN_CLUSTER_SIZE,
    metric='euclidean',
    cluster_selection_method='eom',
    min_samples=MIN_SAMPLES,
    random_state=0
)

topic_model = BERTopic(
    embedding_model=emb_bge,
    umap_model=umap_model,
    hdbscan_model=clustering_model, 
)

topic_model.fit_transform(docs)
topic_model.get_topic_info()

K-Means Experimentation

Compared to HDBSCAN, using K-Means clustering allows us to generate more granular topics by specifying the n_cluster parameter, consequently, controlling the number of topics generated from the text documents.

This image shows a series of scatter plots demonstrating different clustering results when varying the number of clusters (n_cluster) from 3 to 50 using K-Means. With n_cluster=3, the data is divided into just three large groups. As n_cluster increases (5, 8, 10, etc.), the data points are split into more granular groupings. Overall, it forms rounded-shape clusters compared to HDBSCAN. We selected n_cluster=8 where the clusters are neither too broad (losing important distinctions) nor too granular (creating artificial divisions). Additionally, it is a right amount of topics for categorizing 250 days of financial news. However, feel free to adjust the code snippet to your requirements if need to identify more granular or broader topics.

Scatterplots for K-Means n_cluster Experimentation
from sklearn.cluster import KMeans

N_CLUSTER = 8
clustering_model = KMeans(
    n_clusters=N_CLUSTER,
    random_state=0
)

topic_model = BERTopic(
    embedding_model=emb_bge,
    umap_model=umap_model,
    hdbscan_model=clustering_model, 
)

topic_model.fit_transform(docs)
topic_model.get_topic_info()

Comparing the topic cluster results of K-Means and HDBSCAN reveals that K-Means produces more distinct and meaningful topic representations. However, both methods still generate many stop words, indicating that subsequent modules are critical to refine the topic representations.

HDBSCAN Output
HDBSCAN Output
K-Means Output
K-Means Output

4. Vectorizer

vectorizer

Previous modules serve the role of grouping documents into semantically similar clusters, and starting from this module the main focus is to fine-tune the topics by choosing more representative and meaningful keywords. BERTopic offers various Vectorizer options from the basic CountVectorizer to more advanced OnlineCountVectorizer which incrementally update topic representations. For this exercise, we will experiment on CountVectorizer, a text processing tool that creates a matrix of token counts out of a collection of documents. Each row in the matrix represents a document and each column represents a term from the vocabulary, with the values showing how many times each term appears in each document. This matrix representation enables machine learning algorithms to process the text data mathematically.

Vectorizer Experimentation

We will go through a few important parameters of the CountVectorizer and see how they might affect the topic representations.

  • ngram_range specifies how many words to combine together into topic phrases. It is particularly useful for documents consists of short phrases, which is not needed in this situation.
    example output if we set ngram_range=(1, 3)
0                -1_apple nasdaq aapl_apple stock_apple nasdaq_nasdaq aapl   
1  0_apple warren buffett_apple stock_berkshire hathaway_apple nasdaq aapl   
2           1_apple nasdaq aapl_nasdaq aapl apple_apple stock_apple nasdaq   
3              2_apple aapl stock_apple nasdaq aapl_apple stock_aapl stock   
4           3_apple nasdaq aapl_cramer apple aapl_apple nasdaq_apple stock 
  • stop_words determines whether stop words are removed from the topics, which significantly improves topic representations.
  • min_df and max_df determines the frequency thresholds for terms to be included in the vocabulary. min_df sets the minimum number of documents a term must appear while max_df sets the maximum document frequency above which terms are considered too common and discarded.

We explore the effect of adding CountVectorizer with max_df=0.8 (i.e. ignore words appearing in more than 80% of the documents) to both HDBSCAN and K-Means models from the previous step.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(
		max_df=0.8, 
		stop_words="english"
)

topic_model = BERTopic(
    embedding_model=emb_bge,
    umap_model=umap_model,
    hdbscan_model=clustering_model, 
    vectorizer_model=vectorizer_model
)

Both shows improvements after introducing the CountVectorizer, significantly reducing keywords frequently appeared in all documents and not bringing extra values, such as “appl”, “stock”, and “apple”.

HDBSCAN Output with Vectorizer
HDBSCAN Output with Vectorizer
K-Means Output with Vectorizer
K-Means Output with Vectorizer

5. c-TF-IDF

c-TF-IDF

While the Vectorizer module focuses on adjusting the topic representation at the document level, c-TF-IDF mainly look at the cluster level to reduce frequently encountered topics across clusters. This is achieved by converting all documents belonging to one cluster as a single document and calculated the keyword importance based on the traditional TF-IDF approach.

c-TF-IDF Experimentation

  • reduce_frequent_words: determines whether to down-weight frequently occurring words across topics
  • bm25_weighting: when set to True, uses BM25 weighting instead of standard TF-IDF, which can help better handle document length variations. In smaller datasets, this variant can be more robust to stop words.

We use the following code snippet to add c-TF-IDF (with bm25_weighting=True) into our BERTopic pipeline.

from bertopic.vectorizers import ClassTfidfTransformer

ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
topic_model = BERTopic(
    embedding_model=emb_bge,
    umap_model=umap_model,
    hdbscan_model=clustering_model, 
    vectorizer_model=vectorizer_model,
    ctfidf_model=ctfidf_model
)

The topic cluster outputs below show that adding c-TF-IDF has no major impact to the end results when CountVectorizer has already been added. This is potentially because our CountVectorizer has already set a high bar of eliminating words appearing in more than 80% at the document level. Subsequently, this already reduces overlapping vocabularies at the topic cluster level, which is what c-TF-IDF is intended to achieve.

HDBSCAN Output with Vectorizer and c-TF-IDF
K-Means Output with Vectorizer and c-TF-IDF

However, If we replace CountVectorizer with c-TF-IDF, although the result below shows slight improvements compared to when both are not added, there are too many stop words present, making the topic representations less valuable. Therefore, it appears that for the documents we are dealing with in this scenario, c-TF-IDF module does not bring extra value.

HDBSCAN Output with c-TF-IDF only
K-Means Output with c-TF-IDF only

6. Representation Model

The last module is the representation model which has been observed having a significant impact on tuning the topic representations. Instead of using the frequency based approach like Vectorizer and c-TF-IDF, it leverages semantic similarity between the embeddings of candidate keywords and the embeddings of documents to find the most representative topic keywords. This can result in more semantically coherent topic representations and reducing the number of synonymically similar keywords. BERTopic also offers various customization options for representation models, including but not limited to the following:

  • KeyBERTInspired: employ KeyBERT technique to extract topic words based semantic similarity.
  • ZeroShotClassification: make the most of open-source transformers in the Huggingface model hub to assign labels to topics.
  • MaximalMarginalRelevance: decrease synonyms in topics (e.g. stock and stocks).

KeyBERTInspired Experimentation

We found that KeyBERTInspired is a very cost-effective approach as it significantly improves the end result by adding a few extra lines of code, without the need of extensive hyperparameter tuning.

from bertopic.representation import KeyBERTInspired

representation_model = KeyBERTInspired()

topic_model = BERTopic(gh
    embedding_model=emb_bge,
    umap_model=umap_model,
    hdbscan_model=clustering_model, 
    vectorizer_model=vectorizer_model,
    representation_model=representation_model
)

After incorporating the KeyBERT-Inspired representation model, we now observe that both models generate noticeably more coherent and valuable themes.

HDBSCAN Output with KeyBERTInspired
HDBSCAN Output with KeyBERTInspired
K-Means Output with KeyBERTInspired
K-Means Output with KeyBERTInspired

Take-Home Message

This article explores BERTopic technique and implementation for topic modeling, detailing its six key modules with practical examples using Apple stock market news data to demonstrate each component’s impact on the quality of topic representations.

  • Embeddings: use transformer-based embedding models to convert documents into numerical representations that capture semantic meaning and contextual relationships in text.
  • Dimensionality Reduction: employ UMAP or other dimensionality reduction techniques to reduce high-dimensional embeddings while preserving both local and global structure of the data
  • Clustering: compare HDBSCAN (density-based) and K-Means (centroid-based) clustering algorithm to group similar documents into coherent topics
  • Vectorizers: use Count Vectorizer to create document-term matrices and refine topics based on statistical approach.
  • c-TF-IDF: update topic representations by analyzing term frequency at cluster level (topic class) and reduce common words across different topics.
  • Representation Model: refine topic keywords using semantic similarity, offering options like KeyBERTInspired and MaximalMarginalRelevance for better topic descriptions

The post A Practical Guide to BERTopic for Transformer-Based Topic Modeling appeared first on Towards Data Science.