mirror of https://github.com/dlt-hub/dlt.git synced 2025-12-17 19:31:30 +00:00

Go to file

Rakesh V. 34669f1ac7 Feat/iceberg advanced partitioning (#3053 )

* feat: implement advanced Iceberg partitioning with explicit ordering

- Add support for advanced partition transforms (year, month, day, hour, bucket, truncate)
- Implement explicit partition ordering via index property
- Add custom partition naming support
- Implement priority system: advanced partitioning overrides legacy partition: True
- Add comprehensive validation for partition specifications
- Add graceful error handling for PyIceberg limitations
- Add performance optimization with early exit for non-partitioned schemas
- Update schema typing to support dict/list partition syntax
- Add pyiceberg-core>=0.6.0 dependency for advanced transforms
- Add comprehensive test suite with 22+ test cases covering all scenarios

Backward compatible: existing partition: True syntax continues to work
Resolves partition ordering limitations in Iceberg table format

* Port iceberg_partition and build_iceberg_partition_spec to dlt core

* update type hint in IcebergLoadFilesystemJob

* Add tests for Iceberg advanced partitioning; remove unused partition extraction code

* Add docs for iceberg_adapter

---------

Co-authored-by: Anton Burnashev <anton.burnashev@gmail.com>

2025-12-12 10:57:56 +01:00

.github

docs: add runtime docs to CLI reference (#3445 )

2025-12-09 17:30:53 +01:00

deploy/dlt

docs: add runtime docs to CLI reference (#3445 )

2025-12-09 17:30:53 +01:00

dlt

Feat/iceberg advanced partitioning (#3053 )

2025-12-12 10:57:56 +01:00

docs

Feat/iceberg advanced partitioning (#3053 )

2025-12-12 10:57:56 +01:00

tests

Feat/iceberg advanced partitioning (#3053 )

2025-12-12 10:57:56 +01:00

tools

migrate to uv (#2766 )

2025-06-19 10:11:24 +02:00

.dockerignore

adds tests to build containers, removes psutil

2023-06-02 20:10:43 +02:00

.editorconfig

adds basic docstrings

2022-12-11 21:54:04 +01:00

.git-blame-ignore-revs

Create .git-blame-ignore-revs

2023-11-23 10:34:25 +01:00

.gitignore

implements run artifacts sync to a bucket using filesystem (#3339 )

2025-12-04 15:48:39 +01:00

compiled_packages.txt

drop python 3.8, enable python 3.13, and enable full linting for 3.12 (#2194 )

2025-01-12 16:40:41 +01:00

CONTRIBUTING.md

refresh docs intro (#3270 )

2025-10-31 17:14:49 +01:00

LICENSE.txt

docs/removes dlt plus docs and adds eula (#3079 )

2025-09-21 00:15:08 +02:00

Makefile

(chore) adds hub extra (#3428 )

2025-12-05 16:15:19 +01:00

mypy.ini

ingests parquet into mssql, mysql and sqlite via ADBC (#3333 )

2025-11-28 17:13:19 +01:00

pyproject.toml

Feat/iceberg advanced partitioning (#3053 )

2025-12-12 10:57:56 +01:00

README.md

docs: update incorrect LLM-native workflow link (404 error) (#3294 )

2025-11-06 15:43:18 +01:00

tox.ini

fixes leaking datasets tests (#2730 )

2025-06-11 22:17:05 +02:00

uv.lock

Feat/iceberg advanced partitioning (#3053 )

2025-12-12 10:57:56 +01:00

README.md

data load tool (dlt) — the open-source Python library that automates all your tedious data loading tasks

Be it a Google Colab notebook, AWS Lambda function, an Airflow DAG, your local laptop,
or a GPT-4 assisted development playground—dlt can be dropped in anywhere.

🚀 Join our thriving community of likeminded developers and build the future together!

Installation

dlt supports Python 3.9 through Python 3.14. Note that some optional extras are not yet available for Python 3.14, so support for this version is considered experimental.

pip install dlt

Quick Start

Load chess game data from chess.com API and save it in DuckDB:

import dlt
from dlt.sources.helpers import requests

# Create a dlt pipeline that will load
# chess player data to the DuckDB destination
pipeline = dlt.pipeline(
    pipeline_name='chess_pipeline',
    destination='duckdb',
    dataset_name='player_data'
)

# Grab some player data from Chess.com API
data = []
for player in ['magnuscarlsen', 'rpragchess']:
    response = requests.get(f'https://api.chess.com/pub/player/{player}')
    response.raise_for_status()
    data.append(response.json())

# Extract, normalize, and load the data
pipeline.run(data, table_name='player')

Try it out in our Colab Demo or directly on our wasm-based playground in our docs.

Features

dlt is an open-source Python library that loads data from various, often messy data sources into well-structured datasets. It provides lightweight Python interfaces to extract, load, inspect, and transform data. dlt and dlt docs are built from the ground up to be used with LLMs: the LLM-native workflow will take your pipeline code to data in a notebook for over 5000 sources.

dlt is designed to be easy to use, flexible, and scalable:

dlt extracts data from REST APIs, SQL databases, cloud storage, Python data structures, and many more.
dlt infers schemas and data types, normalizes the data, and handles nested data structures.
dlt supports a variety of popular destinations and has an interface to add custom destinations to create reverse ETL pipelines.
dlt automates pipeline maintenance with incremental loading, schema evolution, and schema and data contracts.
dlt supports Python and SQL data access, transformations, pipeline inspection, and visualizing data in Marimo Notebooks.
dlt can be deployed anywhere Python runs, be it on Airflow, serverless functions, or any other cloud deployment of your choice.

Documentation

For detailed usage and configuration, please refer to the official documentation.

Examples

You can find examples for various use cases in the examples folder, or in the code examples section of our docs page.

Adding as dependency

dlt follows the semantic versioning with the MAJOR.MINOR.PATCH pattern.

major means breaking changes and removed deprecations
minor new features, sometimes automatic migrations
patch bug fixes

We suggest that you allow only patch level updates automatically:

Using the Compatible Release Specifier. For example dlt~=1.0 allows only versions >=1.0 and less than <1.1
Poetry caret requirements. For example ^1.0 allows only versions >=1.0 to <1.0

Please also see our release notes for notable changes between versions.

Get Involved

The dlt project is quickly growing, and we're excited to have you join our community! Here's how you can get involved:

Connect with the Community: Join other dlt users and contributors on our Slack
Report issues and suggest features: Please use the GitHub Issues to report bugs or suggest new features. Before creating a new issue, make sure to search the tracker for possible duplicates and add a comment if you find one.
Track progress of our work and our plans: Please check out our public Github project
Improve documentation: Help us enhance the dlt documentation.

Contribute code

Please read CONTRIBUTING before you make a PR.

📣 New destinations are unlikely to be merged due to high maintenance cost (but we are happy to improve SQLAlchemy destination to handle more dialects)
Significant changes require tests and docs and in many cases writing tests will be more laborious than writing code
Bugfixes and improvements are welcome! You'll get help with writing tests and docs + a decent review.

License

dlt is released under the Apache 2.0 License.