dlt/docs/education/dlt-advanced-course/lesson_8_logging_and_tracing.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "y0sqFhxJnH5r"
   },
   "source": [
    "# **Introduction** [![Open in molab](https://marimo.io/molab-shield.svg)](https://molab.marimo.io/github/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_8_logging_and_tracing.py) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_8_logging_and_tracing.ipynb) [![GitHub badge](https://img.shields.io/badge/github-view_source-2b3137?logo=github)](https://github.com/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_8_logging_and_tracing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_T-Syi9IGjH5"
   },
   "source": [
    "In this notebook, we focus more on pipeline metadata, and how to use that to be able to trace and debug our pipelines.\n",
    "\n",
    "First, we create the pipeline we'll inspect throughout this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "z6IXlZgLHMIf"
   },
   "source": [
    "## Create the pipeline we will inspect"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "lFKYz_nwHO-o"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install dlt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6b80QMawHKQW"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "from typing import Iterable, Union\n",
    "import dlt\n",
    "from dlt.sources.helpers import requests\n",
    "from dlt.extract import DltResource\n",
    "from dlt.common.typing import TDataItems\n",
    "from dlt.sources.helpers.rest_client import RESTClient\n",
    "from dlt.sources.helpers.rest_client.auth import BearerTokenAuth\n",
    "from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator\n",
    "\n",
    "from google.colab import userdata\n",
    "\n",
    "dlt.secrets[\"SOURCES__SECRET_KEY\"] = userdata.get(\"ACCESS_TOKEN\")\n",
    "\n",
    "\n",
    "@dlt.source\n",
    "def github_source(secret_key: str = dlt.secrets.value) -> Iterable[DltResource]:\n",
    "    client = RESTClient(\n",
    "        base_url=\"https://api.github.com\",\n",
    "        auth=BearerTokenAuth(token=secret_key),\n",
    "        paginator=HeaderLinkPaginator(),\n",
    "    )\n",
    "\n",
    "    @dlt.resource\n",
    "    def github_pulls(\n",
    "        cursor_date: dlt.sources.incremental[str] = dlt.sources.incremental(\n",
    "            \"updated_at\", initial_value=\"2024-12-01\"\n",
    "        )\n",
    "    ) -> TDataItems:\n",
    "        params = {\"since\": cursor_date.last_value, \"status\": \"open\"}\n",
    "        for page in client.paginate(\"repos/dlt-hub/dlt/pulls\", params=params):\n",
    "            yield page\n",
    "\n",
    "    return github_pulls\n",
    "\n",
    "\n",
    "# define new dlt pipeline\n",
    "pipeline = dlt.pipeline(\n",
    "    pipeline_name=\"github_pipeline\",\n",
    "    destination=\"duckdb\",\n",
    "    dataset_name=\"github_data\",\n",
    ")\n",
    "\n",
    "\n",
    "# run the pipeline with the new resource\n",
    "load_info = pipeline.run(github_source())\n",
    "print(load_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "KzJkSJe0LqEX"
   },
   "source": [
    "## Look at the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "tfl6kHFQLsLe"
   },
   "outputs": [],
   "source": [
    "import duckdb\n",
    "\n",
    "conn = duckdb.connect(f\"{pipeline.pipeline_name}.duckdb\")\n",
    "\n",
    "conn.sql(\"SHOW ALL TABLES\").df()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "__-fKtLPzf0z"
   },
   "source": [
    "More importantly, let's look at the saved load info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "7Y223fvDMdYe"
   },
   "outputs": [],
   "source": [
    "conn.sql(\"select * from github_data._dlt_loads\").df()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "I8UWx_rpnK7u"
   },
   "source": [
    "# **Tracing with Sentry**\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OAGZ0Y-jyLbZ"
   },
   "source": [
    "You can enable tracing through Sentry.\n",
    "\n",
    "## What is `Sentry` 🤔\n",
    "\n",
    "`Sentry` is an open-source error tracking and performance monitoring tool that helps developers **identify**, **monitor**, and **fix issues** in real-time in their applications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "rFvQG8oFyP4N"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install sentry-sdk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Qj5VYSfKGOt5"
   },
   "outputs": [],
   "source": [
    "import sentry_sdk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "rEhAANOpPVKx"
   },
   "source": [
    "### Sentry needs to be initialized in normal scripts\n",
    "\n",
    "\n",
    "\n",
    "```\n",
    "import sentry_sdk\n",
    "import os\n",
    "\n",
    "sentry_sdk.init(\n",
    "    dsn=os.getenv(\"RUNTIME__SENTRY_DSN\"),\n",
    "    traces_sample_rate=1.0  # Adjust this for performance monitoring if needed\n",
    ")\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YEJcHL_EO8aW"
   },
   "source": [
    "### Say, you make an error and it is caught with Sentry:\n",
    "\n",
    "\n",
    "\n",
    "```\n",
    "try:\n",
    "    1 / 0\n",
    "except ZeroDivisionError as e:\n",
    "    sentry_sdk.capture_exception(e)\n",
    "\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It will then show up on your Sentry dashboard:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Lesson_8_Logging_%26_Tracing_img1](https://storage.googleapis.com/dlt-blog-images/dlt-advanced-course/Lesson_8_Logging_%26_Tracing_img1.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "pkEv4goZWb_8"
   },
   "source": [
    "Even when a normal error arises after Sentry has been initiated, your program executes normally, but sends that error to your dashboard, so it can be tracked!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SwYMd4U4Rr7C"
   },
   "source": [
    "### In dlt, you can enable Sentry quite easily\n",
    "\n",
    "You can configure the `DSN` in the `config.toml`:\n",
    "\n",
    "```\n",
    "[runtime]\n",
    "\n",
    "sentry_dsn=\"https:///<...>\"\n",
    "```\n",
    "\n",
    "\n",
    "Alternatively, you can use environment variables. **This is what we'll be doing**:\n",
    "```\n",
    "RUNTIME__SENTRY_DSN=\"https:///<...>\"\n",
    "```\n",
    "The entry client is configured after the first pipeline is created with `dlt.pipeline()`. Feel free to use `sentry_sdk` init again to cover your specific needs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "Bi1yh7CKtTW_"
   },
   "source": [
    "Let's try introducing the same error again"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "vUJrYa6pUWrl"
   },
   "outputs": [],
   "source": [
    "from google.colab import userdata\n",
    "\n",
    "dlt.config[\"RUNTIME__SENTRY_DSN\"] = userdata.get(\"SENTRY_TOKEN\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "lY0fH-svxG3o"
   },
   "outputs": [],
   "source": [
    "data = {12: 34}\n",
    "\n",
    "info = pipeline.run([data], table_name=\"issues\")\n",
    "info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And that comes up in Sentry as well"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Lesson_8_Logging_%26_Tracing_img2](https://storage.googleapis.com/dlt-blog-images/dlt-advanced-course/Lesson_8_Logging_%26_Tracing_img2.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ck0pELZQZJdK"
   },
   "source": [
    "The message sent to Sentry is:\n",
    "```\n",
    "Job for issues.a3f927c556.insert_values failed terminally in load 1723645286.6510239 with message Constraint Error: NOT NULL constraint failed: issues.id\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "RVUKMuNcnLLP"
   },
   "source": [
    "# **Logging**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "r3nA0DSpdSJO"
   },
   "source": [
    "There are various environments where we would be completely lost without logs.\n",
    "\n",
    "Debugging any system would be incredibly hard if we didn't know what was going on, or at what point the program ran into an error."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8D_cKzYWfVaL"
   },
   "source": [
    "### Setting log levels in `dlt`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "ImSsdFbdeIsu"
   },
   "source": [
    "You can set log levels in your `config.toml` file:\n",
    "\n",
    "\n",
    "\n",
    "```\n",
    "[runtime]\n",
    "log_level=\"INFO\"\n",
    "```\n",
    "\n",
    "`log_level` accepts the Python standard logging level names.\n",
    "\n",
    "The default log level is `WARNING`.\n",
    "\n",
    "**`INFO` log level is useful when diagnosing problems in production.**\n",
    "\n",
    "**`CRITICAL` will disable logging.**\n",
    "\n",
    "**`DEBUG` should not be used in production.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1xhMKleauYpM"
   },
   "source": [
    "We'll be setting the log level in our environment variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "IK07cQE7aERW"
   },
   "outputs": [],
   "source": [
    "import dlt\n",
    "\n",
    "dlt.config[\"RUNTIME__LOG_LEVEL\"] = \"INFO\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "GT8_xifTuv1A"
   },
   "source": [
    "dlt logs to a logger named `dlt`.\n",
    "\n",
    "dlt logger uses a regular python logger so you can configure the handlers as per your requirement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "GWMDXIRIuwi2"
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "\n",
    "# Create a logger\n",
    "logger = logging.getLogger(\"dlt\")\n",
    "\n",
    "# Set the log level\n",
    "logger.setLevel(logging.INFO)\n",
    "\n",
    "# Create a file handler\n",
    "handler = logging.FileHandler(\"dlt.log\")\n",
    "\n",
    "# Add the handler to the logger\n",
    "logger.addHandler(handler)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "shyhxkeAqKx4"
   },
   "outputs": [],
   "source": [
    "# USING LOGGER\n",
    "pipeline = dlt.pipeline(\n",
    "    pipeline_name=\"github_issues_merge_logger\",\n",
    "    destination=\"duckdb\",\n",
    "    dataset_name=\"github_data_merge\",\n",
    ")\n",
    "load_info = pipeline.run(github_source())\n",
    "print(load_info)\n",
    "# result gets showed despite no print statement ? check dlt.log"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "8p9G2lMsDX6d"
   },
   "source": [
    "### Logging via `Loguru` in our GitHub example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "N3v7Azj-otzc"
   },
   "outputs": [],
   "source": [
    "%%capture\n",
    "!pip install loguru"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "IMT138h4pX6t"
   },
   "source": [
    "let's change the logging level"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "HPTPzyD4paZ_"
   },
   "outputs": [],
   "source": [
    "import dlt\n",
    "\n",
    "dlt.config[\"RUNTIME__LOG_LEVEL\"] = \"INFO\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "W-oFszpnQGwi"
   },
   "outputs": [],
   "source": [
    "import logging\n",
    "import sys\n",
    "\n",
    "import dlt\n",
    "from loguru import logger as loguru_logger\n",
    "\n",
    "\n",
    "# parent class logging.Handler processes log messages\n",
    "class InterceptHandler(logging.Handler):\n",
    "    # decorator provided by loguru that catches any exceptions in the decorated function and logs them\n",
    "    @loguru_logger.catch(default=True, onerror=lambda _: sys.exit(1))\n",
    "    def emit(self, record: logging.LogRecord) -> None:\n",
    "        # Get corresponding Loguru level if it exists.\n",
    "        try:\n",
    "            level: Union[str, int] = loguru_logger.level(record.levelname).name\n",
    "        except ValueError:\n",
    "            level = record.levelno\n",
    "\n",
    "        # Find caller (call frame) from where originated the logged message.\n",
    "        frame, depth = sys._getframe(6), 6\n",
    "        while frame and frame.f_code.co_filename == logging.__file__:\n",
    "            frame = frame.f_back\n",
    "            depth += 1\n",
    "\n",
    "        # logs the message using loguru, with the level, exception information, and depth\n",
    "        loguru_logger.opt(depth=depth, exception=record.exc_info).log(\n",
    "            level, record.getMessage()\n",
    "        )\n",
    "\n",
    "\n",
    "logger_dlt = logging.getLogger(\"dlt\")\n",
    "logger_dlt.addHandler(InterceptHandler())\n",
    "\n",
    "# all logs will be written to dlt_loguru.log\n",
    "loguru_logger.add(\"dlt_loguru.log\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "00IQqzbcQQam"
   },
   "outputs": [],
   "source": [
    "# using loguru logger\n",
    "pipeline = dlt.pipeline(\n",
    "    pipeline_name=\"github_issues_merge_loguru\",\n",
    "    destination=\"duckdb\",\n",
    "    dataset_name=\"github_data_merge\",\n",
    ")\n",
    "load_info = pipeline.run(github_source())\n",
    "print(load_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SzOlc7B7sBLu"
   },
   "source": [
    "## **Logs for monitoring the progress**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "SMOQqbsBwIfR"
   },
   "outputs": [],
   "source": [
    "import dlt\n",
    "\n",
    "dlt.config[\"RUNTIME__LOG_LEVEL\"] = \"WARNING\"\n",
    "\n",
    "\n",
    "pipeline = dlt.pipeline(\n",
    "    pipeline_name=\"github_issues_progress\",\n",
    "    destination=\"duckdb\",\n",
    "    dataset_name=\"github_data_merge\",\n",
    "    progress=\"log\",\n",
    ")\n",
    "load_info = pipeline.run(github_source())\n",
    "print(load_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "AH3F46PaJZe4"
   },
   "source": [
    "✅ ▶ Proceed to the [next lesson](https://github.com/dlt-hub/dlt/blob/master/docs/education/dlt-advanced-course/lesson_9_performance_optimisation.ipynb)!"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}