mirror of
https://github.com/dlt-hub/dlt.git
synced 2025-12-17 19:31:30 +00:00
@@ -1,7 +1,7 @@
|
||||
---
|
||||
title: "Data quality 🧪"
|
||||
description: Validate your data and control its quality
|
||||
keywords: ["dlthub", "data quality", "contracts"]
|
||||
keywords: ["dlthub", "data quality", "contracts", "check", "metrics"]
|
||||
---
|
||||
|
||||
:::warning
|
||||
@@ -18,3 +18,178 @@ With dltHub, you will be able to:
|
||||
* Enforce constraints on distributions, boundaries, and expected values.
|
||||
|
||||
Stay tuned for updates as we expand these capabilities! 🚀
|
||||
|
||||
## Checks
|
||||
A **data quality check** or **check** is a function applied to data that returns a **check result** or **result** (can be boolean, integer, float, etc.). The result that is converted to a success / fail **check outcome** or **outcome** (boolean) based on a **decision**.
|
||||
|
||||
:::info
|
||||
A **test** verifies that **code** behaves as expected. A **check** verifies that the **data** meets some expectations. Code tests enable you to make changes with confidence and data checks help monitor your live systems.
|
||||
:::
|
||||
|
||||
For example, the check `is_in(column_name, accepted_values)` verifies that the column only includes accepted values. Running the check counts successful records to compute the success rate (**result**). The **outcome** will be a success if success rate is 100%, i.e., all records succeeded the check (**decision**).
|
||||
|
||||
This snippet shows a single `is_in()` check being ran against the `orders` table in the `point_of_sale` dataset.
|
||||
|
||||
```py
|
||||
import dlt
|
||||
from dlt.hub import data_quality as dq
|
||||
|
||||
dataset = dlt.dataset("duckdb", "point_of_sale")
|
||||
checks = [
|
||||
dq.checks.is_in("payment_methods", ["card", "cash", "voucher"])
|
||||
]
|
||||
|
||||
# prepare a query to execute all checks
|
||||
check_plan = dq.prepare_checks(dataset["orders"], checks=checks)
|
||||
# execute checks and get results in-memory
|
||||
check_plan.df()
|
||||
```
|
||||
|
||||
|
||||
### Check level
|
||||
The **check level** indicates the granularity of the **check result**. For instance:
|
||||
- **Row-level** checks produce a result per record. It's possible to inspect which specific records pass / failed the check.
|
||||
|
||||
- **Table-level** checks produce a result per table (e.g., result is "the number of unique values" and decision is "is this greater than 5?").
|
||||
|
||||
These checks can often be rewritten as row-level checks (e.g., "is this value unique?")
|
||||
|
||||
- **Dataset-level** checks produce a result per dataset. This typically involves multiple tables, temporal comparisons, or pipeline information (e.g., "the number of rows in 'orders' is higher than the number of rows in 'customers')
|
||||
|
||||
:::important
|
||||
Notice that the **check level** relates to the result and not the **input data** of the check. For instance, a row-level check can involve multiple tables as input.
|
||||
:::
|
||||
|
||||
|
||||
### Built-in checks
|
||||
The library `dlthub` includes many built-in checks: `is_in()`, `is_unique()`, `is_primary_key()`, and more. The built-in `case()` and `where()` simplify custom row and table-level checks respectively.
|
||||
|
||||
For example, the following are equivalent:
|
||||
|
||||
```py
|
||||
from dlt.hub import data_quality as dq
|
||||
|
||||
dq.checks.is_unique("foo")
|
||||
dq.checks.case("COUNT(*) OVER (PARTITION BY foo) > 1")
|
||||
```
|
||||
|
||||
|
||||
### Custom checks (WIP)
|
||||
Can be implemented as a `dlt.hub.transformation` that matches a specific schema or as subclass of `_BaseCheck` for full control. This allows to use any language supported by transformations, allowing eager/lazy and in-memory/on-backend execution.
|
||||
|
||||
Notes:
|
||||
- Should have utilities to statically validate check definitions (especially lazy)
|
||||
- Should have testing utilities that makes it easy to unit test checks (same utilities as transformations)
|
||||
|
||||
|
||||
### Lifecycle
|
||||
Data quality checks can be executed at different stages of the pipeline lifecycle. This choice has several impacts, including:
|
||||
- the **input data** available for the check
|
||||
- the compute resources used
|
||||
- the **actions** available (e.g., drop invalid load)
|
||||
|
||||
<!--How does this affect transactions? How do we handle errors in the data quality part-->
|
||||
|
||||
#### Post-load
|
||||
The post-load execution is the simplest option. The pipeline goes through `Extract -> Normalize -> Load` as usual. Then, the checks are executed on the destination.
|
||||
|
||||
Properties:
|
||||
- Failed records can't be dropped or quarantined before load. All records must be written, checked, and then handled. This only works with `write_disposition="append"` or destinations supporting snapshots (e.g. `iceberg`, `ducklake`).
|
||||
- Checks have access to the full dataset. This includes current and past loads + internal dlt tables.
|
||||
- Computed directly on the destination. This scales well with the size of the data and the complexity of the checks.
|
||||
- Results and outcome are directly stored on the dataset. No data movement is required.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Resource
|
||||
participant Pipeline
|
||||
participant Dataset
|
||||
|
||||
Resource->>Pipeline: Extract
|
||||
Pipeline->>Pipeline: Normalize
|
||||
Pipeline->>Dataset: Load
|
||||
Dataset->>Dataset: Run Checks
|
||||
```
|
||||
|
||||
#### Pre-load (staging)
|
||||
The pre-load execution via staging dataset allows to execute checks on the destination and trigger actions before data is loaded into the dataset. This is effectively using **post-load** checks before a 2nd load phase.
|
||||
|
||||
:::info
|
||||
`dlt` uses staging datasets for other features such as `merge` and `replace` write dispositions.
|
||||
:::
|
||||
|
||||
Properties:
|
||||
- Failed records can be dropped or quarantined before load. This works with all `write_disposition`
|
||||
- Requires a destination that supports staging datasets.
|
||||
- Checks have access to the current load.
|
||||
- If the staging dataset is on the same destination, checks can access the full dataset.
|
||||
- If the staging dataset is on a different destination, communication between the staging dataset and the dataset.
|
||||
- Computed on the staging destination. This scales well with the size of the data and the complexity of the checks.
|
||||
- Data and checks results & outcome can be safely stored on the staging dataset until review. This helps human-in-the-loop workflows without reprocessing the full pipeline.
|
||||
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Resource
|
||||
participant Pipeline
|
||||
participant Staging as Staging Dataset
|
||||
participant Dataset
|
||||
|
||||
Resource->>Pipeline: Extract
|
||||
Pipeline->>Pipeline: Normalize
|
||||
Pipeline->>Staging: Load
|
||||
Staging->>Staging: Run Checks
|
||||
Staging->>Dataset: Load
|
||||
```
|
||||
|
||||
|
||||
#### Pre-load (in-memory)
|
||||
|
||||
The pre-load execution in-memory will execute checks using `duckdb` against the load packages (i.e., temporary files) stored on the machine that runs `dlt`. This allows to trigger actions before data is loaded into the destination.
|
||||
|
||||
:::note
|
||||
This is equivalent to using a staging destination that is the local filesystem. This section highlights the trade-offs of this choice.
|
||||
:::
|
||||
|
||||
Properties:
|
||||
- Failed records can be dropped or quarantined before load. This works with all `write_disposition
|
||||
- Checks only have access to the current load. Checking against the full dataset requires communication between the staging destination and the main destination.
|
||||
- Computed on the machine running the pipeline. The resource need to match the compute requirements.
|
||||
- Data and checks results & outcome may be lost if the runtime is ephemeral (e.g., AWS Lambda timeout). In this case, the pipeline must process the data again.
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Resource
|
||||
participant Pipeline
|
||||
participant Dataset
|
||||
|
||||
Resource->>Pipeline: Extract
|
||||
Pipeline->>Pipeline: Normalize
|
||||
Pipeline->>Pipeline: Run checks
|
||||
Pipeline->>Dataset: Load
|
||||
```
|
||||
|
||||
## Migration and versioning (WIP)
|
||||
|
||||
As the real-world change, their can be addition, removal, or modification of data quality checks for your pipeline / dataset. This is require for proper auditing.
|
||||
|
||||
For example, the check `is_in("division", ["europe", "america"])` defined in 2024 could evolve to `is_in("division", ["europe", "america", "asia"])` in 2026.
|
||||
|
||||
Notes:
|
||||
- checks need to be serialized and hashed (trivial for lazy checks)
|
||||
- checks can be stored on schema (consequently on the destination too)
|
||||
- this is the same challenge as versioning transformations
|
||||
|
||||
## Action (WIP)
|
||||
After running checks, **actions** can be triggered based on the **check result** or **check outcome**.
|
||||
|
||||
Notes:
|
||||
- actions can be configured globally or per-check
|
||||
- planned actions: drop data, quarantine data (move to a special dataset), resolve (e.g., fill value, set default, apply transformation), fail (prevents load), raise/alert (sends notification)
|
||||
- This needs to be configurable from outside the source code (e.g., via `config.toml`). The same checks would require different action during development vs. prod
|
||||
|
||||
|
||||
## Metrics
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user