From Views to Tables: How I Optimized dbt Models with Delta on Databricks

When I started using dbt with Databricks, I was mostly focused on writing models and building transformations. But as the data volume grew, I realized the performance and cost implications of materializations and storage formats. In this post, I want to share some real lessons learned while choosing between view, table, and incremental materializations—and why switching from Parquet to Delta Lake made a huge difference. Views vs Tables vs Incremental: What Actually Works? Here’s a quick breakdown of how I use each materialization in real projects: Materialization Use Case Notes view Lightweight models (renaming, filtering) Great for development, but recomputes every time table Medium-size dimensions Useful for reference tables that rarely change incremental Large facts, high-frequency data Only processes new or changed rows (ideal for big datasets) DAG and ref() – How dbt Manages Dependencies Instead of hardcoding table names, I use ref() to link models. For example: SELECT * FROM {{ ref('stg_orders') }} This function does more than just alias a table. Behind the scenes, ref() helps dbt: Build a Directed Acyclic Graph (DAG) of your models Determine the correct run order during dbt run Resolve fully qualified table names dynamically across dev/prod environments Track lineage for documentation and testing This becomes incredibly useful in Databricks where the environment, catalog, and schema may differ between workspaces. For example, {{ ref('stg_orders') }} might compile into: `dev_catalog`.`analytics_schema`.`stg_orders` This makes your project environment-agnostic and easier to manage with Git-based workflows. Real Problem I Faced with Views on Databricks In one pipeline, I had several dbt models materialized as view, assuming it would keep things fast and light. But as models started chaining together—one ref() leading to another—the final model failed due to out-of-memory errors. That’s because views in dbt are logical; Databricks has to recompute all the upstream SQL each time. As complexity grew, so did query length and memory usage. Solution: Switching from view to table I changed key intermediate models to use table materialization: {{ config(materialized='table') }} This persisted the results, reduced recomputation, and stabilized the pipeline. Why I Use Delta Format on Databricks If you're using Databricks and still materializing models as regular Parquet files, you're missing out. Delta Lake adds: ACID compliance (safe concurrent writes) Time travel (query past versions) Efficient upserts and merges Z-Ordering (for faster filtering) In my dbt config, I now explicitly define: {{ config(materialized='table', file_format='delta') }} And for large tables, I follow up with: OPTIMIZE analytics.fct_orders ZORDER BY (order_date) This alone made queries run significantly faster for date-filtered dashboards. Summary Use view when developing, but be cautious of deep chains. Switch to table when performance matters or complexity grows. Always use ref() to link models—hardcoding paths will break your project at scale. On Databricks, prefer Delta over Parquet for better reliability, query speed, and flexibility. Tags: dbt #databricks #dataengineering #delta #sql #analyticsengineering

Jun 24, 2025 - 19:30
 0
From Views to Tables: How I Optimized dbt Models with Delta on Databricks

databricks-dbt-delta

When I started using dbt with Databricks, I was mostly focused on writing models and building transformations. But as the data volume grew, I realized the performance and cost implications of materializations and storage formats.

In this post, I want to share some real lessons learned while choosing between view, table, and incremental materializations—and why switching from Parquet to Delta Lake made a huge difference.

Views vs Tables vs Incremental: What Actually Works?

Here’s a quick breakdown of how I use each materialization in real projects:

Materialization Use Case Notes
view Lightweight models (renaming, filtering) Great for development, but recomputes every time
table Medium-size dimensions Useful for reference tables that rarely change
incremental Large facts, high-frequency data Only processes new or changed rows (ideal for big datasets)

DAG and ref() – How dbt Manages Dependencies

Instead of hardcoding table names, I use ref() to link models. For example:

SELECT *
FROM {{ ref('stg_orders') }}

This function does more than just alias a table. Behind the scenes, ref() helps dbt:

  • Build a Directed Acyclic Graph (DAG) of your models
  • Determine the correct run order during dbt run
  • Resolve fully qualified table names dynamically across dev/prod environments
  • Track lineage for documentation and testing

This becomes incredibly useful in Databricks where the environment, catalog, and schema may differ between workspaces.

For example, {{ ref('stg_orders') }} might compile into:

`dev_catalog`.`analytics_schema`.`stg_orders`

This makes your project environment-agnostic and easier to manage with Git-based workflows.

Real Problem I Faced with Views on Databricks

In one pipeline, I had several dbt models materialized as view, assuming it would keep things fast and light. But as models started chaining together—one ref() leading to another—the final model failed due to out-of-memory errors.

That’s because views in dbt are logical; Databricks has to recompute all the upstream SQL each time. As complexity grew, so did query length and memory usage.

Solution: Switching from view to table

I changed key intermediate models to use table materialization:

{{ config(materialized='table') }}

This persisted the results, reduced recomputation, and stabilized the pipeline.

Why I Use Delta Format on Databricks

If you're using Databricks and still materializing models as regular Parquet files, you're missing out.

Delta Lake adds:

  • ACID compliance (safe concurrent writes)
  • Time travel (query past versions)
  • Efficient upserts and merges
  • Z-Ordering (for faster filtering)

In my dbt config, I now explicitly define:

{{ config(materialized='table', file_format='delta') }}

And for large tables, I follow up with:

OPTIMIZE analytics.fct_orders ZORDER BY (order_date)

This alone made queries run significantly faster for date-filtered dashboards.

Summary

  • Use view when developing, but be cautious of deep chains.
  • Switch to table when performance matters or complexity grows.
  • Always use ref() to link models—hardcoding paths will break your project at scale.
  • On Databricks, prefer Delta over Parquet for better reliability, query speed, and flexibility.

Tags:

dbt #databricks #dataengineering #delta #sql #analyticsengineering