docs: data_quality concept page (#3341)

* wrote data quality docs page
2025-12-17 19:31:30 +00:00 · 2025-11-26 10:18:11 -05:00
parent 91eacbff4c
commit ff6d28185d
1 changed files with 176 additions and 1 deletions
--- a/docs/website/docs/hub/features/quality/data-quality.md
+++ b/docs/website/docs/hub/features/quality/data-quality.md
@@ -1,7 +1,7 @@
 ---
 title: "Data quality 🧪"
 description: Validate your data and control its quality
-keywords: ["dlthub", "data quality", "contracts"]
+keywords: ["dlthub", "data quality", "contracts", "check", "metrics"]
 ---

 :::warning
@@ -18,3 +18,178 @@ With dltHub, you will be able to:
 * Enforce constraints on distributions, boundaries, and expected values.

 Stay tuned for updates as we expand these capabilities! 🚀
+
+## Checks
+A **data quality check** or **check** is a function applied to data that returns a **check result** or **result** (can be boolean, integer, float, etc.). The result that is converted to a success / fail **check outcome** or **outcome** (boolean) based on a **decision**.
+
+:::info
+A **test** verifies that **code** behaves as expected. A **check** verifies that the **data** meets some expectations. Code tests enable you to make changes with confidence and data checks help monitor your live systems.
+:::
+
+For example, the check `is_in(column_name, accepted_values)` verifies that the column only includes accepted values. Running the check counts successful records to compute the success rate (**result**). The **outcome** will be a success if success rate is 100%, i.e., all records succeeded the check (**decision**).
+
+This snippet shows a single `is_in()` check being ran against the `orders` table in the `point_of_sale` dataset.
+
+```py
+import dlt
+from dlt.hub import data_quality as dq
+
+dataset = dlt.dataset("duckdb", "point_of_sale")
+checks = [
+    dq.checks.is_in("payment_methods", ["card", "cash", "voucher"])
+]
+
+# prepare a query to execute all checks
+check_plan = dq.prepare_checks(dataset["orders"], checks=checks)
+# execute checks and get results in-memory
+check_plan.df()
+```
+
+
+### Check level
+The **check level** indicates the granularity of the **check result**. For instance:
+- **Row-level** checks produce a result per record. It's possible to inspect which specific records pass / failed the check.
+
+- **Table-level** checks produce a result per table (e.g., result is "the number of unique values" and decision is "is this greater than 5?"). 
+
+    These checks can often be rewritten as row-level checks (e.g., "is this value unique?")
+
+- **Dataset-level** checks produce a result per dataset. This typically involves multiple tables, temporal comparisons, or pipeline information (e.g., "the number of rows in 'orders' is higher than the number of rows in 'customers')
+
+:::important
+Notice that the **check level** relates to the result and not the **input data** of the check. For instance, a row-level check can involve multiple tables as input.
+:::
+
+
+### Built-in checks
+The library `dlthub` includes many built-in checks: `is_in()`, `is_unique()`, `is_primary_key()`, and more. The built-in `case()` and `where()` simplify custom row and table-level checks respectively.
+
+For example, the following are equivalent:
+
+```py
+from dlt.hub import data_quality as dq
+
+dq.checks.is_unique("foo")
+dq.checks.case("COUNT(*) OVER (PARTITION BY foo) > 1")
+```
+
+
+### Custom checks (WIP)
+Can be implemented as a `dlt.hub.transformation` that matches a specific schema or as subclass of `_BaseCheck` for full control. This allows to use any language supported by transformations, allowing eager/lazy and in-memory/on-backend execution.
+
+Notes:
+- Should have utilities to statically validate check definitions (especially lazy)
+- Should have testing utilities that makes it easy to unit test checks (same utilities as transformations)
+
+
+### Lifecycle
+Data quality checks can be executed at different stages of the pipeline lifecycle. This choice has several impacts, including:
+- the **input data** available for the check
+- the compute resources used
+- the **actions** available (e.g., drop invalid load)
+
+<!--How does this affect transactions? How do we handle errors in the data quality part-->
+
+#### Post-load
+The post-load execution is the simplest option. The pipeline goes through `Extract -> Normalize -> Load` as usual. Then, the checks are executed on the destination.
+
+Properties:
+- Failed records can't be dropped or quarantined before load. All records must be written, checked, and then handled. This only works with `write_disposition="append"` or destinations supporting snapshots (e.g. `iceberg`, `ducklake`).
+- Checks have access to the full dataset. This includes current and past loads + internal dlt tables.
+- Computed directly on the destination. This scales well with the size of the data and the complexity of the checks.
+- Results and outcome are directly stored on the dataset. No data movement is required.
+
+```mermaid
+sequenceDiagram
+    participant Resource
+    participant Pipeline
+    participant Dataset
+
+    Resource->>Pipeline: Extract
+    Pipeline->>Pipeline: Normalize
+    Pipeline->>Dataset: Load
+    Dataset->>Dataset: Run Checks
+```
+
+#### Pre-load (staging)
+The pre-load execution via staging dataset allows to execute checks on the destination and trigger actions before data is loaded into the dataset. This is effectively using **post-load** checks before a 2nd load phase.
+
+:::info
+`dlt` uses staging datasets for other features such as `merge` and `replace` write dispositions.
+:::
+
+Properties:
+- Failed records can be dropped or quarantined before load. This works with all `write_disposition`
+- Requires a destination that supports staging datasets.
+- Checks have access to the current load. 
+    - If the staging dataset is on the same destination, checks can access the full dataset. 
+    - If the staging dataset is on a different destination, communication between the staging dataset and the dataset.
+- Computed on the staging destination.  This scales well with the size of the data and the complexity of the checks.
+- Data and checks results & outcome can be safely stored on the staging dataset until review. This helps human-in-the-loop workflows without reprocessing the full pipeline.
+
+
+```mermaid
+sequenceDiagram
+    participant Resource
+    participant Pipeline
+    participant Staging as Staging Dataset
+    participant Dataset
+
+    Resource->>Pipeline: Extract
+    Pipeline->>Pipeline: Normalize
+    Pipeline->>Staging: Load
+    Staging->>Staging: Run Checks
+    Staging->>Dataset: Load
+```
+
+
+#### Pre-load (in-memory)
+
+The pre-load execution in-memory will execute checks using `duckdb` against the load packages (i.e., temporary files) stored on the machine that runs `dlt`. This allows to trigger actions before data is loaded into the destination.
+
+:::note
+This is equivalent to using a staging destination that is the local filesystem. This section highlights the trade-offs of this choice.
+:::
+
+Properties:
+- Failed records can be dropped or quarantined before load. This works with all `write_disposition
+- Checks only have access to the current load. Checking against the full dataset requires communication between the staging destination and the main destination.
+- Computed on the machine running the pipeline. The resource need to match the compute requirements.
+- Data and checks results & outcome may be lost if the runtime is ephemeral (e.g., AWS Lambda timeout). In this case, the pipeline must process the data again.
+
+```mermaid
+sequenceDiagram
+    participant Resource
+    participant Pipeline
+    participant Dataset
+
+    Resource->>Pipeline: Extract
+    Pipeline->>Pipeline: Normalize
+    Pipeline->>Pipeline: Run checks
+    Pipeline->>Dataset: Load
+```
+
+## Migration and versioning (WIP)
+
+As the real-world change, their can be addition, removal, or modification of data quality checks for your pipeline / dataset. This is require for proper auditing.
+
+For example, the check `is_in("division", ["europe", "america"])` defined in 2024 could evolve to `is_in("division", ["europe", "america", "asia"])` in 2026.
+
+Notes:
+- checks need to be serialized and hashed (trivial for lazy checks)
+- checks can be stored on schema (consequently on the destination too)
+- this is the same challenge as versioning transformations
+
+## Action (WIP)
+After running checks, **actions** can be triggered based on the **check result** or **check outcome**. 
+
+Notes:
+- actions can be configured globally or per-check
+- planned actions: drop data, quarantine data (move to a special dataset), resolve (e.g., fill value, set default, apply transformation), fail (prevents load), raise/alert (sends notification)
+- This needs to be configurable from outside the source code (e.g., via `config.toml`). The same checks would require different action during development vs. prod
+
+
+## Metrics
+
+
+