Study Note dlt Fundamentals Course - Lesson 2: dlt Sources and Resources, Create First dlt Pipeline

Overview In this lesson, you learned about creating a dlt pipeline by grouping resources into a source. You also learned about dlt transformers and how to use them to perform additional steps in the pipeline. Key Concepts dlt Resources: A resource is a logical grouping of data within a data source, typically holding data of similar structure and origin. dlt Sources: A source is a logical grouping of resources, e.g., endpoints of a single API. dlt Transformers: Special dlt resources that can be fed data from another resource to perform additional steps in the pipeline. Creating a dlt Pipeline A dlt pipeline is created by grouping resources into a source. Here's an example of how to create a pipeline using a list of dictionaries: @dlt.resource def my_dict_list(): return [ {"id": 1, "name": "Pikachu"}, {"id": 2, "name": "Charizard"} ] Using dlt Sources A source is a logical grouping of resources. You can declare a source by decorating a function that returns or yields one or more resources with @dlt.source. @dlt.source def my_source(): return [ my_dict_list(), other_resource() ] Using dlt Transformers dlt transformers are special resources that can be fed data from another resource to perform additional steps in the pipeline. @dlt.transformer def get_pokemon_info(data): for pokemon in data: response = requests.get(f"") pokemon['info'] = response.json() return data Exercise 1: Create a Pipeline for GitHub API - Repos Endpoint Explore the GitHub API and understand the endpoint to list public repositories for an organization. Build the pipeline using dlt.pipeline, dlt.resource, and dlt.source to extract and load data into a destination. Use duckdb connection, sql_client, or pipeline.dataset() to check the number of columns in the github_repos table. Exercise 2: Create a Pipeline for GitHub API - Stargazers Endpoint Create a dlt.transformer for the "stargazers" endpoint for the dlt-hub organization. Use the github_repos resource as a main resource for the transformer. Use duckdb connection, sql_client, or pipeline.dataset() to check the number of columns in the github_stargazer table. Reducing Nesting Level of Generated Tables You can limit how deep dlt goes when generating nested tables and flattening dicts into columns. By default, the library will descend and generate nested tables for all nested lists, without limit. @dlt.source def my_source(): return [ my_dict_list(nesting_level=1) ] Typical Settings nesting_level: The number of levels to descend and generate nested tables. Next Steps Proceed to the next lesson to learn more about dlt pipelines and how to use them to extract and load data into a destination.

Feb 17, 2025 - 17:46
 0
Study Note dlt Fundamentals Course - Lesson 2: dlt Sources and Resources, Create First dlt Pipeline

Overview

In this lesson, you learned about creating a dlt pipeline by grouping resources into a source. You also learned about dlt transformers and how to use them to perform additional steps in the pipeline.

Key Concepts

  • dlt Resources: A resource is a logical grouping of data within a data source, typically holding data of similar structure and origin.
  • dlt Sources: A source is a logical grouping of resources, e.g., endpoints of a single API.
  • dlt Transformers: Special dlt resources that can be fed data from another resource to perform additional steps in the pipeline.

Creating a dlt Pipeline

A dlt pipeline is created by grouping resources into a source. Here's an example of how to create a pipeline using a list of dictionaries:

@dlt.resource
def my_dict_list():
    return [
        {"id": 1, "name": "Pikachu"},
        {"id": 2, "name": "Charizard"}
    ]

Using dlt Sources

A source is a logical grouping of resources. You can declare a source by decorating a function that returns or yields one or more resources with @dlt.source.

@dlt.source
def my_source():
    return [
        my_dict_list(),
        other_resource()
    ]

Using dlt Transformers

dlt transformers are special resources that can be fed data from another resource to perform additional steps in the pipeline.

@dlt.transformer
def get_pokemon_info(data):
    for pokemon in data:
        response = requests.get(f"{pokemon['id']}>")
        pokemon['info'] = response.json()
    return data

Exercise 1: Create a Pipeline for GitHub API - Repos Endpoint

  • Explore the GitHub API and understand the endpoint to list public repositories for an organization.
  • Build the pipeline using dlt.pipeline, dlt.resource, and dlt.source to extract and load data into a destination.
  • Use duckdb connection, sql_client, or pipeline.dataset() to check the number of columns in the github_repos table.

Exercise 2: Create a Pipeline for GitHub API - Stargazers Endpoint

  • Create a dlt.transformer for the "stargazers" endpoint for the dlt-hub organization.
  • Use the github_repos resource as a main resource for the transformer.
  • Use duckdb connection, sql_client, or pipeline.dataset() to check the number of columns in the github_stargazer table.

Reducing Nesting Level of Generated Tables

You can limit how deep dlt goes when generating nested tables and flattening dicts into columns. By default, the library will descend and generate nested tables for all nested lists, without limit.

@dlt.source
def my_source():
    return [
        my_dict_list(nesting_level=1)
    ]

Typical Settings

  • nesting_level: The number of levels to descend and generate nested tables.

Next Steps

  • Proceed to the next lesson to learn more about dlt pipelines and how to use them to extract and load data into a destination.