Part 2: Syncing Normalized PostgreSQL Data to Denormalized ClickHouse Using Airbyte + DBT

From Transactional Trenches to Analytical Ascent: PostgreSQL to ClickHouse with Airbyte and DBT In Part 1, we delved into the fundamental reasons why shoehorning your PostgreSQL data model directly into ClickHouse is a recipe for analytical sluggishness. We highlighted the contrasting strengths of row-oriented OLTP databases like PostgreSQL and column-oriented OLAP powerhouses like ClickHouse. Now, let's roll up our sleeves and translate that theory into a tangible, real-world solution. In this article, we'll embark on a journey to build a robust data pipeline that seamlessly syncs your normalized Online Transaction Processing (OLTP) data residing in PostgreSQL into a highly performant, denormalized schema optimized for Online Analytical Processing (OLAP) within ClickHouse. Our trusty companions on this expedition will be Airbyte for Change Data Capture (CDC) based ingestion and dbt (data build tool) for elegant transformations and nimble schema evolution. Our Architectural Blueprint Here's a visual representation of the data flow we'll be constructing: Step 1: Laying the Foundation - Defining Your Source Schema in PostgreSQL Let's consider a common scenario: a basic e-commerce application. Our transactional data in PostgreSQL is structured in a normalized fashion, ensuring data integrity and minimizing redundancy for efficient writes. -- users CREATE TABLE users ( id UUID PRIMARY KEY, name TEXT, email TEXT ); -- orders CREATE TABLE orders ( id UUID PRIMARY KEY, user_id UUID REFERENCES users(id), total_amount NUMERIC, created_at TIMESTAMPTZ ); This normalized structure, with separate users and orders tables linked by foreign keys, is ideal for handling transactional operations. Step 2: Setting Sail with Airbyte - Ingesting Data from Postgres Airbyte steps in as our reliable vessel for data ingestion. Its robust support for CDC (Change Data Capture) via the PostgreSQL Write-Ahead Log (WAL) allows us to stream changes in near real-time into ClickHouse. This approach ensures low latency and captures every modification made to our source data. To get this working, you'll need to configure Airbyte with the following: Source: Connect to your PostgreSQL instance. Ensure you've enabled logical replication and created a replication slot, as these are prerequisites for CDC. Destination: Configure your ClickHouse instance as the destination. Leverage the HTTP destination with compression for efficient data transfer. Sync Mode: Choose a sync mode that supports incremental updates with change tracking. "Incremental + Append" or a dedicated "CDC" mode (if available for the Postgres connector) are suitable options. Upon successful configuration, Airbyte will land the raw data in ClickHouse within tables named something like: _airbyte_raw_users _airbyte_raw_orders The data within these tables will typically be structured as raw JSON blobs, with each row containing metadata and the actual data: { "_ab_id": "a unique identifier for the Airbyte record", "_ab_emitted_at": "timestamp of when Airbyte processed the record", "data": { "id": "...", "user_id": "...", "...": "..." } } Step 3: Crafting Insights with DBT - Transformation and Denormalization Now comes the crucial step of shaping this raw data into an analytical powerhouse. This is where dbt shines. By connecting dbt to your ClickHouse instance (using adapters like dbt-clickhouse), you can write SQL-based models to extract, transform, and load the data into your desired denormalized schema. Let's look at an example dbt model, orders_flat.sql, that denormalizes the orders data by joining it with relevant information from the users table: WITH raw_orders AS ( SELECT JSONExtractString(_airbyte_data, 'id') AS order_id, JSONExtractString(_airbyte_data, 'user_id') AS user_id, toDecimal128OrZero(JSONExtractString(_airbyte_data, 'total_amount')) AS total_amount, parseDateTimeBestEffort(JSONExtractString(_airbyte_data, 'created_at')) AS created_at FROM _airbyte_raw_orders ), enriched_orders AS ( SELECT o.order_id, o.user_id, u.name AS user_name, u.email AS user_email, o.total_amount, o.created_at FROM raw_orders o LEFT JOIN ( SELECT JSONExtractString(_airbyte_data, 'id') AS user_id, JSONExtractString(_airbyte_data, 'name') AS name, JSONExtractString(_airbyte_data, 'email') AS email FROM _airbyte_raw_users ) u ON o.user_id = u.user_id ) SELECT * FROM enriched_orders In this model: We first extract the relevant fields from the raw JSON data ingested by Airbyte using ClickHouse's JSON functions like JSONExtractString. We also perform basic type casting. Then, we join the extracted orders data with the relevant fields from the users data based on the user_id. This denormalizes the data, bringing related information into a single table. Step 4: Optimizing for

May 10, 2025 - 16:43
 0
Part 2: Syncing Normalized PostgreSQL Data to Denormalized ClickHouse Using Airbyte + DBT

From Transactional Trenches to Analytical Ascent: PostgreSQL to ClickHouse with Airbyte and DBT

In Part 1, we delved into the fundamental reasons why shoehorning your PostgreSQL data model directly into ClickHouse is a recipe for analytical sluggishness. We highlighted the contrasting strengths of row-oriented OLTP databases like PostgreSQL and column-oriented OLAP powerhouses like ClickHouse.

Now, let's roll up our sleeves and translate that theory into a tangible, real-world solution. In this article, we'll embark on a journey to build a robust data pipeline that seamlessly syncs your normalized Online Transaction Processing (OLTP) data residing in PostgreSQL into a highly performant, denormalized schema optimized for Online Analytical Processing (OLAP) within ClickHouse.

Our trusty companions on this expedition will be Airbyte for Change Data Capture (CDC) based ingestion and dbt (data build tool) for elegant transformations and nimble schema evolution.

Our Architectural Blueprint

Here's a visual representation of the data flow we'll be constructing:

Image description

Step 1: Laying the Foundation - Defining Your Source Schema in PostgreSQL

Let's consider a common scenario: a basic e-commerce application. Our transactional data in PostgreSQL is structured in a normalized fashion, ensuring data integrity and minimizing redundancy for efficient writes.

-- users
CREATE TABLE users (
  id UUID PRIMARY KEY,
  name TEXT,
  email TEXT
);

-- orders
CREATE TABLE orders (
  id UUID PRIMARY KEY,
  user_id UUID REFERENCES users(id),
  total_amount NUMERIC,
  created_at TIMESTAMPTZ
);

This normalized structure, with separate users and orders tables linked by foreign keys, is ideal for handling transactional operations.

Step 2: Setting Sail with Airbyte - Ingesting Data from Postgres

Airbyte steps in as our reliable vessel for data ingestion. Its robust support for CDC (Change Data Capture) via the PostgreSQL Write-Ahead Log (WAL) allows us to stream changes in near real-time into ClickHouse. This approach ensures low latency and captures every modification made to our source data.

To get this working, you'll need to configure Airbyte with the following:

  • Source: Connect to your PostgreSQL instance. Ensure you've enabled logical replication and created a replication slot, as these are prerequisites for CDC.
  • Destination: Configure your ClickHouse instance as the destination. Leverage the HTTP destination with compression for efficient data transfer.
  • Sync Mode: Choose a sync mode that supports incremental updates with change tracking. "Incremental + Append" or a dedicated "CDC" mode (if available for the Postgres connector) are suitable options.

Upon successful configuration, Airbyte will land the raw data in ClickHouse within tables named something like:

  • _airbyte_raw_users
  • _airbyte_raw_orders

The data within these tables will typically be structured as raw JSON blobs, with each row containing metadata and the actual data:

{
  "_ab_id": "a unique identifier for the Airbyte record",
  "_ab_emitted_at": "timestamp of when Airbyte processed the record",
  "data": {
    "id": "...",
    "user_id": "...",
    "...": "..."
  }
}

Step 3: Crafting Insights with DBT - Transformation and Denormalization

Now comes the crucial step of shaping this raw data into an analytical powerhouse. This is where dbt shines. By connecting dbt to your ClickHouse instance (using adapters like dbt-clickhouse), you can write SQL-based models to extract, transform, and load the data into your desired denormalized schema.

Let's look at an example dbt model, orders_flat.sql, that denormalizes the orders data by joining it with relevant information from the users table:

WITH raw_orders AS (
  SELECT
    JSONExtractString(_airbyte_data, 'id') AS order_id,
    JSONExtractString(_airbyte_data, 'user_id') AS user_id,
    toDecimal128OrZero(JSONExtractString(_airbyte_data, 'total_amount')) AS total_amount,
    parseDateTimeBestEffort(JSONExtractString(_airbyte_data, 'created_at')) AS created_at
  FROM _airbyte_raw_orders
),

enriched_orders AS (
  SELECT
    o.order_id,
    o.user_id,
    u.name AS user_name,
    u.email AS user_email,
    o.total_amount,
    o.created_at
  FROM raw_orders o
  LEFT JOIN (
    SELECT
      JSONExtractString(_airbyte_data, 'id') AS user_id,
      JSONExtractString(_airbyte_data, 'name') AS name,
      JSONExtractString(_airbyte_data, 'email') AS email
    FROM _airbyte_raw_users
  ) u ON o.user_id = u.user_id
)

SELECT * FROM enriched_orders

In this model:

  • We first extract the relevant fields from the raw JSON data ingested by Airbyte using ClickHouse's JSON functions like JSONExtractString. We also perform basic type casting.
  • Then, we join the extracted orders data with the relevant fields from the users data based on the user_id. This denormalizes the data, bringing related information into a single table.

Step 4: Optimizing for Speed in ClickHouse - Partitioning and Materialization

To truly unlock ClickHouse's analytical prowess, we need to structure our tables for optimal query performance. Partitioning and ordering are key techniques. Let's materialize our enriched_orders model into a ClickHouse table with these optimizations:

CREATE TABLE orders_flat
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_at)
ORDER BY (user_id, created_at)
AS
SELECT * FROM enriched_orders;

Here's why this is important:

  • ENGINE = MergeTree: This is a family of powerful table engines in ClickHouse designed for high-performance data processing and analytics.
  • PARTITION BY toYYYYMM(created_at): Partitioning the data by year and month of the created_at column allows ClickHouse to efficiently skip irrelevant data during queries that filter by date ranges.
  • ORDER BY (user_id, created_at): Specifying an order key helps ClickHouse organize the data within each partition, enabling faster data retrieval for queries that filter or sort by these columns.
  • AS SELECT * FROM enriched_orders: This creates the table and populates it with the results of our dbt transformation.

Step 5: Unleashing Analytical Power - Querying Your Denormalized Data

With our denormalized and optimized data now residing in ClickHouse, we can execute analytical queries that would be prohibitively slow on our normalized PostgreSQL database, especially on large datasets.

For example, to count daily orders and calculate total revenue over the last 30 days:

SELECT
  toDate(created_at) AS order_date,
  COUNT(*) AS order_count,
  SUM(total_amount) AS total_revenue
FROM orders_flat
WHERE created_at >= now() - INTERVAL '30 day'
GROUP BY order_date
ORDER BY order_date;

This query, leveraging ClickHouse's columnar storage and indexing capabilities, will execute almost instantly, providing valuable business insights.

Automating Your Data Pipeline with CI/CD

To ensure a smooth and reliable data flow, consider automating your dbt transformations. You can achieve this by:

  • Using DBT Cloud: This managed service provides a web-based interface for developing, scheduling, and monitoring your dbt projects.
  • Implementing CI/CD Pipelines: Integrate your dbt runs into your Continuous Integration/Continuous Deployment (CI/CD) pipelines (e.g., using GitLab CI, GitHub Actions) to automatically trigger transformations whenever new code is merged or on a scheduled basis.
  • Data Contracts and Schema Registry (Optional): For more complex environments, consider implementing data contracts or using a schema registry to track and manage schema changes across your PostgreSQL and ClickHouse systems, preventing breaking changes.

Gotchas to Navigate

While this pattern is powerful, here are some common pitfalls and their solutions:

Issue Solution
ClickHouse schema drift Use DBT to re-materialize views and avoid dynamic columns
Timestamp mismatches Normalize to UTC early in the pipeline (ideally within dbt)
NULL vs empty string in JSON Use ifNullOrDefault() or assumeNotNull() carefully in ClickHouse queries
Data bloat in Airbyte raw tables Apply retention policies or configure auto-dropping of raw staging tables

Final Thoughts: A Powerful Paradigm

This architecture, leveraging the strengths of PostgreSQL for transactional integrity and ClickHouse for analytical speed, orchestrated by Airbyte for seamless ingestion and dbt for elegant transformation, offers a compelling solution for modern data pipelines.

It allows you to:

  • Safeguard your OLTP workloads in PostgreSQL without compromising analytical performance.
  • Offload demanding analytics to the lightning-fast columnar engine of ClickHouse.
  • Maintain a clear separation of concerns, avoiding the complexities of trying to fit an analytical workload onto a transactional database.
  • Future-proof your data infrastructure by adopting decoupled and specialized tools.

With Airbyte and dbt as your allies, your data becomes a fluid asset, readily transformed and analyzed to drive meaningful insights.

Next Up: Sharing Your Analytical Treasures

In Part 3, we'll explore how to expose your meticulously crafted ClickHouse OLAP layer as an API or embeddable dashboard for your customers, all while implementing robust access controls and cost attribution strategies. Stay tuned!