DUCKDB, S3 Tables with iceberg using Iceberg Rest API

I wrote my previous article about Duckberg, a combination of PyIceberg with Duckdb to access iceberg tables on aws. But on Pi-Day (3 March) DuckDB released a preview with native iceberg glue integration for read and write. I'll show you how this works AWS, create S3 table bucket Open S3 Select Table Bucket Enable the integration with AWS analytics services once. Create Table Bucket Give it a name like testtablebucket AWS, create s3 Table Select the bucket you created before Click Create Table with Athena Create namespace or select an existing one. This is similar to a Glue Database, but not the same. Create testnamespace This opens Athena If you didn't do this before, create a bucket and edit the workgroup to use this bucket as output location Execute the Create Table Statement CREATE TABLE `testnamespace`.daily_sales ( sale_date date, product_category string, sales_amount double) PARTITIONED BY (month(sale_date)) TBLPROPERTIES ('table_type' = 'iceberg') Create a new query to insert some test values INSERT INTO daily_sales VALUES (DATE '2024-01-15', 'Laptop', 900.00), (DATE '2024-01-15', 'Monitor', 250.00), (DATE '2024-01-16', 'Laptop', 1350.00), (DATE '2024-02-01', 'Monitor', 300.00), (DATE '2024-02-01', 'Keyboard', 60.00), (DATE '2024-02-02', 'Mouse', 25.00), (DATE '2024-02-02', 'Laptop', 1050.00), (DATE '2024-02-03', 'Laptop', 1200.00), (DATE '2024-02-03', 'Monitor', 375.00); Test that the data is correctly loaded SELECT product_category, COUNT(*) as units_sold, SUM(sales_amount) as total_revenue, AVG(sales_amount) as average_price FROM daily_sales WHERE sale_date BETWEEN DATE '2024-02-01' and DATE '2024-02-29' GROUP BY product_category ORDER BY total_revenue DESC; Duckdb, setup This follows the duckdb blog. Set the credentials in your console, with export AWS_PROFILE=myuser or the aws key credentials Make sure you have duckdb 1.21 or newer installed. Run the new duckdb webinterface with : duckdb -ui FORCE INSTALL aws FROM core_nightly; FORCE INSTALL httpfs FROM core_nightly; FORCE INSTALL iceberg FROM core_nightly; CREATE SECRET ( TYPE s3, PROVIDER credential_chain ); ATTACH 'arn:aws:s3tables:eu-west-1::bucket/testtablebucket' AS s3_tables_db ( TYPE iceberg, ENDPOINT_TYPE s3_tables ); Replace the with your account id and use the correct s3 table bucket Run SHOW ALL TABLES; to view the internal tables and the available s3 tables Duckdb , Query Data Run: FROM s3_tables_db.testnamespace.daily_sales; Duckdb , Alter data In the blogpost there is a option to use alter table, unfortunatly i still get: Not implemented Error: Alter Schema Entry alter table s3_tables_db.testnamespace.daily_sales add column test STRING; Maybe this works only in the iceberg rest method. Duckdb , Attach Catalog via iceberg The steps before access the tables directly on S3 Tables, but the latest extension also support the Iceberg Rest API from Glue Catalog/AWS Sagemaker Lakehouse to access the data through a standardized method ATTACH '311141556126:s3tablescatalog/testnamespace' AS glue_db ( TYPE iceberg, ENDPOINT_TYPE glue ); show all tables; Gives me: IO Error: Failed to query https://glue.eu-central-1.amazonaws.com/iceberg/v1/catalogs/311141556126:s3tablescatalog:testnamespace/namespaces, http error 404 thrown. Message: {"error":{"code":404,"message":"Catalog not found.","type":"EntityNotFoundException"}} My pyiceberg script only sees glue-databases also. Not the s3table namespace. region = 'eu-west-1' rest_catalog = load_catalog( "testnamespace", **{ "type": "rest", "uri": f"https://glue.{region}.amazonaws.com/iceberg", "rest.sigv4-enabled": "true", "rest.signing-name": "glue", "rest.signing-region": region } ) print(rest_catalog.list_namespaces()) Time for some more investigation

Mar 15, 2025 - 17:23

DUCKDB, S3 Tables with iceberg using Iceberg Rest API

I wrote my previous article about Duckberg, a combination of PyIceberg with Duckdb to access iceberg tables on aws.

But on Pi-Day (3 March) DuckDB released a preview with native iceberg glue integration for read and write. I'll show you how this works

AWS, create S3 table bucket

Open S3
Select Table Bucket
Enable the integration with AWS analytics services once.
Create Table Bucket
Give it a name like testtablebucket

AWS, create s3 Table

Select the bucket you created before
Click Create Table with Athena
Create namespace or select an existing one. This is similar to a Glue Database, but not the same. Create testnamespace
This opens Athena
If you didn't do this before, create a bucket and edit the workgroup to use this bucket as output location
Execute the Create Table Statement

CREATE TABLE `testnamespace`.daily_sales (
sale_date date, 
product_category string, 
sales_amount double)
PARTITIONED BY (month(sale_date))
TBLPROPERTIES ('table_type' = 'iceberg')

Create a new query to insert some test values

INSERT INTO daily_sales
VALUES
(DATE '2024-01-15', 'Laptop', 900.00),
(DATE '2024-01-15', 'Monitor', 250.00),
(DATE '2024-01-16', 'Laptop', 1350.00),
(DATE '2024-02-01', 'Monitor', 300.00),
(DATE '2024-02-01', 'Keyboard', 60.00),
(DATE '2024-02-02', 'Mouse', 25.00),
(DATE '2024-02-02', 'Laptop', 1050.00),
(DATE '2024-02-03', 'Laptop', 1200.00),
(DATE '2024-02-03', 'Monitor', 375.00);

Test that the data is correctly loaded

SELECT 
product_category,
COUNT(*) as units_sold,
SUM(sales_amount) as total_revenue,
AVG(sales_amount) as average_price
FROM daily_sales
WHERE sale_date BETWEEN DATE '2024-02-01' and DATE '2024-02-29'
GROUP BY product_category
ORDER BY total_revenue DESC;

Duckdb, setup

This follows the duckdb blog.

Set the credentials in your console, with export AWS_PROFILE=myuser or the aws key credentials

Make sure you have duckdb 1.21 or newer installed.

Run the new duckdb webinterface with : duckdb -ui

FORCE INSTALL aws FROM core_nightly;
FORCE INSTALL httpfs FROM core_nightly;
FORCE INSTALL iceberg FROM core_nightly;

CREATE SECRET (
    TYPE s3,
    PROVIDER credential_chain
);


ATTACH 'arn:aws:s3tables:eu-west-1::bucket/testtablebucket'
    AS s3_tables_db (
        TYPE iceberg,
        ENDPOINT_TYPE s3_tables
    );

Replace the with your account id and use the correct s3 table bucket

Run SHOW ALL TABLES; to view the internal tables and the available s3 tables

Duckdb , Query Data

Run: FROM s3_tables_db.testnamespace.daily_sales;

Duckdb , Alter data

In the blogpost there is a option to use alter table, unfortunatly i still get: Not implemented Error: Alter Schema Entry

alter table s3_tables_db.testnamespace.daily_sales 
add column test STRING;

Maybe this works only in the iceberg rest method.

Duckdb , Attach Catalog via iceberg

The steps before access the tables directly on S3 Tables, but the latest extension also support the Iceberg Rest API from Glue Catalog/AWS Sagemaker Lakehouse to access the data through a standardized method

ATTACH '311141556126:s3tablescatalog/testnamespace'
    AS glue_db (
        TYPE iceberg,
        ENDPOINT_TYPE glue
    );

show all tables;

Gives me:

IO Error: Failed to query https://glue.eu-central-1.amazonaws.com/iceberg/v1/catalogs/311141556126:s3tablescatalog:testnamespace/namespaces, http error 404 thrown. Message: {"error":{"code":404,"message":"Catalog not found.","type":"EntityNotFoundException"}}

My pyiceberg script only sees glue-databases also. Not the s3table namespace.

    region = 'eu-west-1'

    rest_catalog = load_catalog(
    "testnamespace",
    **{
        "type": "rest",
        "uri": f"https://glue.{region}.amazonaws.com/iceberg",
             "rest.sigv4-enabled": "true",
               "rest.signing-name": "glue",
              "rest.signing-region": region
    }
    )
    print(rest_catalog.list_namespaces())

Time for some more investigation