mirror of
https://github.com/dlt-hub/dlt.git
synced 2025-12-17 19:31:30 +00:00
* Initial commit * lesson_1_quick_start adjusted for marimo * lesson_2_dlt_sources_and_resources_create_first_dlt_pipeline marimo * Fundamentals course 3 improved * Marimo badges added * Fundamenta: course 8 * Marimo badge link fix * Fundamentals: course 7 * Fundamentals: course 6 * Fundamentals: course 5 * Fundamentals: cousre 4 * Fundamentals: course 3 * Fundamentals: course 2 * Fundmantals: course 1 * marimo links corrected * Inline deps * Fundamentals: fix lesson 2 * Fundamentals: fix lesson 3 * Fundamentals: fix lesson 4 * Formatting moved to build-molabs * Fundamentals: fix lesson 5 * Removal of scrolls * Fundamentals: fix lesson 6 * Fundamentals: fix lesson 7 * Fundamentals: fix lesson 8 * os.environ replaced with dlt.secrets where relevant * Advanced: fix lesson 5 * Advanced fix lesson 9 * os.environ fixes * Advanced: fix lesson 1 * Comments cleanup * Additional comment removal, fix lesson 6 advanced * Clean main makefile * Get rid of constants.py * Nicer json.loads() * Better functions in preprocess_to_molab * Tests for doc tooling funcs * Validate molab command * Marimo check added * docs pages adjustment * limits sqlglot in dev group until fixed --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>
636 lines
15 KiB
Plaintext
636 lines
15 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "y0sqFhxJnH5r"
|
|
},
|
|
"source": [
|
|
"# **Introduction** [](https://molab.marimo.io/github/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_8_logging_and_tracing.py) [](https://colab.research.google.com/github/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_8_logging_and_tracing.ipynb) [](https://github.com/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_8_logging_and_tracing.ipynb)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "_T-Syi9IGjH5"
|
|
},
|
|
"source": [
|
|
"In this notebook, we focus more on pipeline metadata, and how to use that to be able to trace and debug our pipelines.\n",
|
|
"\n",
|
|
"First, we create the pipeline we'll inspect throughout this notebook."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "z6IXlZgLHMIf"
|
|
},
|
|
"source": [
|
|
"## Create the pipeline we will inspect"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "lFKYz_nwHO-o"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%capture\n",
|
|
"!pip install dlt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "6b80QMawHKQW"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import os\n",
|
|
"from typing import Iterable, Union\n",
|
|
"import dlt\n",
|
|
"from dlt.sources.helpers import requests\n",
|
|
"from dlt.extract import DltResource\n",
|
|
"from dlt.common.typing import TDataItems\n",
|
|
"from dlt.sources.helpers.rest_client import RESTClient\n",
|
|
"from dlt.sources.helpers.rest_client.auth import BearerTokenAuth\n",
|
|
"from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator\n",
|
|
"\n",
|
|
"from google.colab import userdata\n",
|
|
"\n",
|
|
"dlt.secrets[\"SOURCES__SECRET_KEY\"] = userdata.get(\"ACCESS_TOKEN\")\n",
|
|
"\n",
|
|
"\n",
|
|
"@dlt.source\n",
|
|
"def github_source(secret_key: str = dlt.secrets.value) -> Iterable[DltResource]:\n",
|
|
" client = RESTClient(\n",
|
|
" base_url=\"https://api.github.com\",\n",
|
|
" auth=BearerTokenAuth(token=secret_key),\n",
|
|
" paginator=HeaderLinkPaginator(),\n",
|
|
" )\n",
|
|
"\n",
|
|
" @dlt.resource\n",
|
|
" def github_pulls(\n",
|
|
" cursor_date: dlt.sources.incremental[str] = dlt.sources.incremental(\n",
|
|
" \"updated_at\", initial_value=\"2024-12-01\"\n",
|
|
" )\n",
|
|
" ) -> TDataItems:\n",
|
|
" params = {\"since\": cursor_date.last_value, \"status\": \"open\"}\n",
|
|
" for page in client.paginate(\"repos/dlt-hub/dlt/pulls\", params=params):\n",
|
|
" yield page\n",
|
|
"\n",
|
|
" return github_pulls\n",
|
|
"\n",
|
|
"\n",
|
|
"# define new dlt pipeline\n",
|
|
"pipeline = dlt.pipeline(\n",
|
|
" pipeline_name=\"github_pipeline\",\n",
|
|
" destination=\"duckdb\",\n",
|
|
" dataset_name=\"github_data\",\n",
|
|
")\n",
|
|
"\n",
|
|
"\n",
|
|
"# run the pipeline with the new resource\n",
|
|
"load_info = pipeline.run(github_source())\n",
|
|
"print(load_info)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "KzJkSJe0LqEX"
|
|
},
|
|
"source": [
|
|
"## Look at the data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "tfl6kHFQLsLe"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import duckdb\n",
|
|
"\n",
|
|
"conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n",
|
|
"\n",
|
|
"conn.sql(\"SHOW ALL TABLES\").df()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "__-fKtLPzf0z"
|
|
},
|
|
"source": [
|
|
"More importantly, let's look at the saved load info"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "7Y223fvDMdYe"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"conn.sql(\"select * from github_data._dlt_loads\").df()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "I8UWx_rpnK7u"
|
|
},
|
|
"source": [
|
|
"# **Tracing with Sentry**\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "OAGZ0Y-jyLbZ"
|
|
},
|
|
"source": [
|
|
"You can enable tracing through Sentry.\n",
|
|
"\n",
|
|
"## What is `Sentry` 🤔\n",
|
|
"\n",
|
|
"`Sentry` is an open-source error tracking and performance monitoring tool that helps developers **identify**, **monitor**, and **fix issues** in real-time in their applications."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "rFvQG8oFyP4N"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%capture\n",
|
|
"!pip install sentry-sdk"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "Qj5VYSfKGOt5"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import sentry_sdk"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "rEhAANOpPVKx"
|
|
},
|
|
"source": [
|
|
"### Sentry needs to be initialized in normal scripts\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"```\n",
|
|
"import sentry_sdk\n",
|
|
"import os\n",
|
|
"\n",
|
|
"sentry_sdk.init(\n",
|
|
" dsn=os.getenv(\"RUNTIME__SENTRY_DSN\"),\n",
|
|
" traces_sample_rate=1.0 # Adjust this for performance monitoring if needed\n",
|
|
")\n",
|
|
"```\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "YEJcHL_EO8aW"
|
|
},
|
|
"source": [
|
|
"### Say, you make an error and it is caught with Sentry:\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"```\n",
|
|
"try:\n",
|
|
" 1 / 0\n",
|
|
"except ZeroDivisionError as e:\n",
|
|
" sentry_sdk.capture_exception(e)\n",
|
|
"\n",
|
|
"```\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"It will then show up on your Sentry dashboard:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "pkEv4goZWb_8"
|
|
},
|
|
"source": [
|
|
"Even when a normal error arises after Sentry has been initiated, your program executes normally, but sends that error to your dashboard, so it can be tracked!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "SwYMd4U4Rr7C"
|
|
},
|
|
"source": [
|
|
"### In dlt, you can enable Sentry quite easily\n",
|
|
"\n",
|
|
"You can configure the `DSN` in the `config.toml`:\n",
|
|
"\n",
|
|
"```\n",
|
|
"[runtime]\n",
|
|
"\n",
|
|
"sentry_dsn=\"https:///<...>\"\n",
|
|
"```\n",
|
|
"\n",
|
|
"\n",
|
|
"Alternatively, you can use environment variables. **This is what we'll be doing**:\n",
|
|
"```\n",
|
|
"RUNTIME__SENTRY_DSN=\"https:///<...>\"\n",
|
|
"```\n",
|
|
"The entry client is configured after the first pipeline is created with `dlt.pipeline()`. Feel free to use `sentry_sdk` init again to cover your specific needs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "Bi1yh7CKtTW_"
|
|
},
|
|
"source": [
|
|
"Let's try introducing the same error again"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "vUJrYa6pUWrl"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from google.colab import userdata\n",
|
|
"\n",
|
|
"dlt.config[\"RUNTIME__SENTRY_DSN\"] = userdata.get(\"SENTRY_TOKEN\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "lY0fH-svxG3o"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"data = {12: 34}\n",
|
|
"\n",
|
|
"info = pipeline.run([data], table_name=\"issues\")\n",
|
|
"info"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"And that comes up in Sentry as well"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "ck0pELZQZJdK"
|
|
},
|
|
"source": [
|
|
"The message sent to Sentry is:\n",
|
|
"```\n",
|
|
"Job for issues.a3f927c556.insert_values failed terminally in load 1723645286.6510239 with message Constraint Error: NOT NULL constraint failed: issues.id\n",
|
|
"```\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "RVUKMuNcnLLP"
|
|
},
|
|
"source": [
|
|
"# **Logging**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "r3nA0DSpdSJO"
|
|
},
|
|
"source": [
|
|
"There are various environments where we would be completely lost without logs.\n",
|
|
"\n",
|
|
"Debugging any system would be incredibly hard if we didn't know what was going on, or at what point the program ran into an error."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "8D_cKzYWfVaL"
|
|
},
|
|
"source": [
|
|
"### Setting log levels in `dlt`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "ImSsdFbdeIsu"
|
|
},
|
|
"source": [
|
|
"You can set log levels in your `config.toml` file:\n",
|
|
"\n",
|
|
"\n",
|
|
"\n",
|
|
"```\n",
|
|
"[runtime]\n",
|
|
"log_level=\"INFO\"\n",
|
|
"```\n",
|
|
"\n",
|
|
"`log_level` accepts the Python standard logging level names.\n",
|
|
"\n",
|
|
"The default log level is `WARNING`.\n",
|
|
"\n",
|
|
"**`INFO` log level is useful when diagnosing problems in production.**\n",
|
|
"\n",
|
|
"**`CRITICAL` will disable logging.**\n",
|
|
"\n",
|
|
"**`DEBUG` should not be used in production.**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "1xhMKleauYpM"
|
|
},
|
|
"source": [
|
|
"We'll be setting the log level in our environment variables:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "IK07cQE7aERW"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import dlt\n",
|
|
"\n",
|
|
"dlt.config[\"RUNTIME__LOG_LEVEL\"] = \"INFO\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "GT8_xifTuv1A"
|
|
},
|
|
"source": [
|
|
"dlt logs to a logger named `dlt`.\n",
|
|
"\n",
|
|
"dlt logger uses a regular python logger so you can configure the handlers as per your requirement."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "GWMDXIRIuwi2"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"\n",
|
|
"# Create a logger\n",
|
|
"logger = logging.getLogger(\"dlt\")\n",
|
|
"\n",
|
|
"# Set the log level\n",
|
|
"logger.setLevel(logging.INFO)\n",
|
|
"\n",
|
|
"# Create a file handler\n",
|
|
"handler = logging.FileHandler(\"dlt.log\")\n",
|
|
"\n",
|
|
"# Add the handler to the logger\n",
|
|
"logger.addHandler(handler)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "shyhxkeAqKx4"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# USING LOGGER\n",
|
|
"pipeline = dlt.pipeline(\n",
|
|
" pipeline_name=\"github_issues_merge_logger\",\n",
|
|
" destination=\"duckdb\",\n",
|
|
" dataset_name=\"github_data_merge\",\n",
|
|
")\n",
|
|
"load_info = pipeline.run(github_source())\n",
|
|
"print(load_info)\n",
|
|
"# result gets showed despite no print statement ? check dlt.log"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "8p9G2lMsDX6d"
|
|
},
|
|
"source": [
|
|
"### Logging via `Loguru` in our GitHub example"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "N3v7Azj-otzc"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%capture\n",
|
|
"!pip install loguru"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "IMT138h4pX6t"
|
|
},
|
|
"source": [
|
|
"let's change the logging level"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "HPTPzyD4paZ_"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import dlt\n",
|
|
"\n",
|
|
"dlt.config[\"RUNTIME__LOG_LEVEL\"] = \"INFO\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "W-oFszpnQGwi"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import logging\n",
|
|
"import sys\n",
|
|
"\n",
|
|
"import dlt\n",
|
|
"from loguru import logger as loguru_logger\n",
|
|
"\n",
|
|
"\n",
|
|
"# parent class logging.Handler processes log messages\n",
|
|
"class InterceptHandler(logging.Handler):\n",
|
|
" # decorator provided by loguru that catches any exceptions in the decorated function and logs them\n",
|
|
" @loguru_logger.catch(default=True, onerror=lambda _: sys.exit(1))\n",
|
|
" def emit(self, record: logging.LogRecord) -> None:\n",
|
|
" # Get corresponding Loguru level if it exists.\n",
|
|
" try:\n",
|
|
" level: Union[str, int] = loguru_logger.level(record.levelname).name\n",
|
|
" except ValueError:\n",
|
|
" level = record.levelno\n",
|
|
"\n",
|
|
" # Find caller (call frame) from where originated the logged message.\n",
|
|
" frame, depth = sys._getframe(6), 6\n",
|
|
" while frame and frame.f_code.co_filename == logging.__file__:\n",
|
|
" frame = frame.f_back\n",
|
|
" depth += 1\n",
|
|
"\n",
|
|
" # logs the message using loguru, with the level, exception information, and depth\n",
|
|
" loguru_logger.opt(depth=depth, exception=record.exc_info).log(\n",
|
|
" level, record.getMessage()\n",
|
|
" )\n",
|
|
"\n",
|
|
"\n",
|
|
"logger_dlt = logging.getLogger(\"dlt\")\n",
|
|
"logger_dlt.addHandler(InterceptHandler())\n",
|
|
"\n",
|
|
"# all logs will be written to dlt_loguru.log\n",
|
|
"loguru_logger.add(\"dlt_loguru.log\");"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "00IQqzbcQQam"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# using loguru logger\n",
|
|
"pipeline = dlt.pipeline(\n",
|
|
" pipeline_name=\"github_issues_merge_loguru\",\n",
|
|
" destination=\"duckdb\",\n",
|
|
" dataset_name=\"github_data_merge\",\n",
|
|
")\n",
|
|
"load_info = pipeline.run(github_source())\n",
|
|
"print(load_info)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "SzOlc7B7sBLu"
|
|
},
|
|
"source": [
|
|
"## **Logs for monitoring the progress**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"id": "SMOQqbsBwIfR"
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import dlt\n",
|
|
"\n",
|
|
"dlt.config[\"RUNTIME__LOG_LEVEL\"] = \"WARNING\"\n",
|
|
"\n",
|
|
"\n",
|
|
"pipeline = dlt.pipeline(\n",
|
|
" pipeline_name=\"github_issues_progress\",\n",
|
|
" destination=\"duckdb\",\n",
|
|
" dataset_name=\"github_data_merge\",\n",
|
|
" progress=\"log\",\n",
|
|
")\n",
|
|
"load_info = pipeline.run(github_source())\n",
|
|
"print(load_info)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"id": "AH3F46PaJZe4"
|
|
},
|
|
"source": [
|
|
"✅ ▶ Proceed to the [next lesson](https://github.com/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_9_performance_optimisation.ipynb)!"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"colab": {
|
|
"provenance": []
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"name": "python"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
}
|