* Add `batch_id` to jinja context of microbatch batches
* Add changie doc
* Update `format_batch_start` to assume `batch_start` is always provided
* Add "runtime only" property `batch_context` to `ModelNode`
By it being "runtime only" we mean that it doesn't exist on the artifact
and thus won't be written out to the manifest artifact.
* Begin populating `batch_context` during materialization execution for microbatch batches
* Fix circular import
* Fixup MicrobatchBuilder.batch_id property method
* Ensure MicrobatchModelRunner doesn't double compile batches
We were compiling the node for each batch _twice_. Besides making microbatch
models more expensive than they needed to be, double compiling wasn't
causing any issue. However the first compilation was happening _before_ we
had added the batch context information to the model node for the batch. This
was leading to models which try to access the `batch_context` information on the
model to blow up, which was undesirable. As such, we've now gone and skipped
the first compilation. We've done this similar to how SavedQuery nodes skip
compilation.
* Add `__post_serialize__` method to `BatchContext` to ensure correct dict shape
This is weird, but necessary, I apologize. Mashumaro handles the
dictification of this class via a compile time generated `to_dict`
method based off of the _typing_ of th class. By default `datetime`
types are converted to strings. We don't want that, we want them to
stay datetimes.
* Update tests to check for `batch_context`
* Update `resolve_event_time_filter` to use new `batch_context`
* Stop testing for batchless compiled code for microbatch models
In 45daec72f4 we stopped an extra compilation
that was happening per batch prior to the batch_context being loaded. Stopping
this extra compilation means that compiled sql for the microbatch model without
the event time filter / batch context is no longer produced. We have discussed
this and _believe_ it is okay given that this is a new node type that has not
hit GA yet.
* Rename `ModelNode.batch_context` to `ModelNode.batch`
* Rename `build_batch_context` to `build_jinja_context_for_batch`
The name `build_batch_context` was confusing as
1) We have a `BatchContext` object, which the method was not building
2) The method builds the jinja context for the batch
As such it felt appropriate to rename the method to more accurately
communicate what it does.
* Rename test macro `invalid_batch_context_macro_sql` to `invalid_batch_jinja_context_macro_sql`
This rename was to make it more clear that the jinja context for a
batch was being checked, as a batch_context has a slightly different
connotation.
* Update changie doc
* Rename `batch_info` to `previous_batch_results`
* Exclude `previous_batch_results` from serialization of model node to avoid jinja context bloat
* Drop `previous_batch_results` key from `test_manifest.py` unit tests
In 4050e377ec01c2f14dd9600fe704ddb34adb66fa we began excluding
`previous_batch_results` from the serialized representation of the
ModelNode. As such, we no longer need to check for it in `test_manifest.py`.
* Clean up changelog on main
* Bumping version to 1.10.0a1
* Code quality cleanup
* add 1.8,1.9 link
---------
Co-authored-by: Emily Rockman <emily.rockman@dbtlabs.com>
* Allow `dbt show` and `dbt compile` to output JSON without extra logs
* Add `quiet` attribute for ShowNode and CompiledNode messages
* Output of protoc compiler
* Utilize the `quiet` attribute for ShowNode and CompiledNode
* Reuse the `dbt list` approach when the `--quiet` flag is used
* Use PrintEvent to get to stdout even if the logger is set to ERROR
* Functional tests for quiet compile
* Functional tests for quiet show
* Fire event same way regardless if LOG_FORMAT is json or not
* Switch back to firing ShowNode and CompiledNode events
* Make `--inline-direct` to be quiet-compatible
* Temporarily change to dev branch for dbt-common
* Remove extraneous newline
* Functional test for `--quiet` for `--inline-direct` flag
* Update changelog entry
* Update `core_types_pb2.py`
* Restore the original branch in `dev-requirements.txt`
---------
Co-authored-by: Kshitij Aranke <kshitij.aranke@dbtlabs.com>
This is needed for dbt-core + dbt-adapters to work properly in regards to
the microbatch project_flag/behavior flag `require_batched_execution_for_custom_microbatch_strategy`
* first pass: replace os env with project flag
* Fix `TestMicrobatchMultipleRetries` to not use `os.env`
* Turn off microbatch project flag for `TestMicrobatchCustomUserStrategyDefault` as it was prior to a9df50f
* Update `BaseMicrobatchTest` to turn on microbatch via project flags
* Add changie doc
* Fix functional tests after merging in main
* Add function to that determines whether the new microbatch functionality should be used
The new microbatch functionality is, unfortunately, potentially dangerous. That is
it adds a new materalization strategy `microbatch` which an end user could have
defined as a custom strategy previously. Additionally we added config keys to nodes,
and as `config` is just a Dict[str, Any], it could contain anything, thus meaning
people could already be using the configs we're adding for different purposes. Thus
we need some intellegent gating. Specifically something that adheres to the following:
cms = Custom Microbatch Strategy
abms = Adapter Builtin Microbatch Strategy
bf = Behavior flag
umb = Use Microbatch Batching
t/f/e = True/False/Error
| cms | abms | bf | umb |
| t | t | t | t |
| f | t | t | t |
| t | f | t | t |
| f | f | t | e |
| t | t | f | f |
| f | t | f | t |
| t | f | f | f |
| f | f | f | e |
(The above table assumes that there is a microbatch model present in the project)
In order to achieve this we need to check that either the microbatch behavior
flag is set to true OR microbatch materializaion being used is the _root_ microbatch
materialization (i.e. not custom). The function we added in this commit,
`use_microbatch_batches`, does just that.
* Gate microbatch functionality by `use_microbatch_batches` manifest function
* Rename microbatch behavior flag to `require_batched_execution_for_custom_microbatch_strategy`
* Extract logic of `find_macro_by_name` to `find_macro_candiate_by_name`
In 0349968c615444de05360509ddeaf6d75d41d826 I had done this for the function
`find_materialization_macro_by_name`, but that wasn't the right function to
do it to, and will be reverted shortly. `find_materialization_macro_by_name`
is used for finding the general materialization macro, whereas `find_macro_by_name`
is more general. For the work we're doing, we need to find the microbatch
macro, which is not a materialization macro.
* Use `find_macro_candidate_by_name` to find the microbatch macro
* Fix microbatch macro locality check to search for `core` locality instead of `root`
Previously were were checking for a locality of `root`. However, a locality
of `root` means it was provided by a `package`. We wnt to check for locality
of `core` which basically means `builtin via dbt-core/adapters`. There is
another locality `imported` which I beleive means it comes from another
package.
* Move the evaluation of `use_microbatch_batches` to the last position in boolean checks
The method `use_microbatch_batches` is always invoked to evaluate an `if`
statement. In most instances, it is part of a logic chain (i.e. there are
multiple things being evaluated in the `if` statement). In `if` statements
where there are multiple things being evaulated, `use_microbatch_batches`
should come _last_ (or as late as possible). This is because it is likely
the most costly thing to evaluate in the logic chain, and thus any shortcuts
cuts via other evaluations in the if statement failing (and thus avoiding
invoking `use_microbatch_batches) is desirable.
* Drop behavior flag setting for BaseMicrobatchTest tests
* Rename 'env_var' to 'project_flag' in test_microbatch.py
* Update microbatch tests to assert when we are/aren't running with batches
* Update `test_resolve_event_time_filter` to use `use_microbatch_batches`
* Fire deprecation warning for custom microbatch macros
* Add microbatch deprecation events to test_events.py
---------
Co-authored-by: Quigley Malcolm <quigley.malcolm@dbtlabs.com>
* Add new `ArtifactWritten` event
* Emit ArtifactWritten event whenever an artifact is written
* Get artifact_type from class name for ArtifactWritten event
* Add changie docs
* Add test to check that ArtifactWritten events are being emitted
* Regenerate core_types_pb2.py using correct protobuf version
* Regen core_types_pb2 again, using a more correct protoc version
* Add unit tests to check how `safe_run_hooks` handles exceptions
* Improve exception handling in `get_execution_status`
Previously in `get_execution_status` if a non `DbtRuntimeError` exception was
raised, the finally would be entered, but the `status`/`message` would not be
set, and thus a `status not defined` exception would get raised on attempting
to return. Tangentially, there is another issue where somehow the `node_status`
is becoming `None`. In all my playing with `get_execution_status` I found that
trying to return an undefined variable in the `finally` caused an undefined
variable exception. However, if in some python version, it instead just handed
back `None`, then this fix should also solve that.
* Add changie docs
* Ensure run_results get written if KeyboardInterrupt happens during end run hooks
* Bump minimum dbt-adpaters to 1.8.0
In https://github.com/dbt-labs/dbt-core/pull/10859 we started using the
`get_adapter_run_info` method provided by `dbt-adapters`. However that
function is only available in dbt-adapters >= 1.8.0. Thus 1.8.0 is our
new minimum for dbt-adapters.
* Add changie doc
* Add function to MicrobatchBuilder to get ceiling of timestamp by batch_size
* Update `MicrobatchBuilder.build_end_time` to use `ceiling_timestamp`
* fix TestMicrobatchBuilder.test_build_end_time by specifying a BatchSize + asserting actual is a ceiling timestamp
* Add changie
---------
Co-authored-by: Michelle Ark <michelle.ark@dbtlabs.com>
* Stop validating that `--event-time-start` is before "current" time
In the next commit we'll be adding a validation that requires that `--event-time-start`
and `--event-time-end` are mutually required. That is, whenever one is specified,
the other is required. In that world, `--event-time-start` will never need to be compared
against the "current" time, because it'll never be run in conjunction with the "current"
time.
* Validate that `--event-time-start` and `--event-time-end` are mutually present
* Add changie doc for validation changes
* Alter functional microbatch tests to work with updated `event_time_start/end` reqs
We made it such that when `event_time_start` is specified, `event_time_end` must also
be specified (and vice versa). This broke numerous tests, in a few different ways:
1. There were tests that used `--event-time-start` without `--event-time-end` butg
were using event_time_start essentially as the `begin` time for models being initially
built or full refreshed. These tests could simply drop the `--event-time-start` and
instead rely on the `begin` value.
2. There was a test that was trying to load a subset of the data _excluding_ some
data which would be captured by using `begin`. In this test we added an appropriate
`--event-time-end` as the `--event-time-start` was necessary to statisfy what the
test was testing
3. There was a test which was trying to ensure that two microbatch models would be
given the same "current" time. Because we wanted to ensure the "current" time code
path was used, we couldn't add `--event-time-end` to resolve the problem, thus we
needed to remove the `--event-time-start` that was being used. However, this led to
the test being incredibly slow. This was resolved by switching the relevant microbatch
models from having `batch_size`s of `day` to instead have `year`. This solution should
be good enough for roughly ~40 years? We'll figure out a better solution then, so see ya
in 2064. Assuming I haven't died before my 70th birthday, feel free to ping me to get
this taken care of.
---------
Co-authored-by: Michelle Ark <michelle.ark@dbtlabs.com>
* Add adapter telemetry to snowplow event.
* Temporary dev branch switch.
* Set tracking for overrideable adapter method.
* Do safer adapter ref.
* Improve comment.
* Code review comments.
* Don't call the asdict on a dict.
* Bump ci to pull in fix from base adapter.
* Add unit tests for coverage.
* Update field name from base adapter/schema change.
* remove breakpoint.
* Change `lookback` default from `0` to `1`
* Regen jsonschema manifest v12 to include `lookback` default change
* Regen saved state of v12 manifest for functional artifact testing
* Add changie doc for lookback default change
* Avoid a KeyError if `child_unique_id` is not found in the dictionary
* Changelog entry
* Functional test when an exposure references a deprecated model
dbt-adapters updated the incremental_strategy validation of incremental models such that
the validation now _always_ happens when an incremental model is executed. A test in dbt-core
`TestMicrobatchCustomUserStrategyEnvVarTrueInvalid` was previously set to _expect_ buggy behavior
where an incremental model would succeed on it's "first"/"refresh" run even if it had an invalid
incremental strategy. Thus we needed to update this test in dbt-core to expect the now correct
behavior of incremental model execution time validation
* [Tidy-First]: Fix `timings` object for hooks and macros, and make types of timings explicit
* cast literal to str
* change test
* change jsonschema to enum
* Discard changes to schemas/dbt/manifest/v12.json
* nits
---------
Co-authored-by: Chenyu Li <chenyu.li@dbtlabs.com>
* Add `order_by` and `limit` fields to saved queries.
* Update JSON schema
* Add change log for #10531.
* Check order by / limit in saved-query parsing test.
* Add test that checks microbatch models are all operating with the same `current_time`
* Set an `invocated_at` on the `RuntimeConfig` and plumb to `MicrobatchBuilder`
* Add changie doc
* Rename `invocated_at` to `invoked_at`
* Simply conditional logic for setting MicrobatchBuilder.batch_current_time
* Rename `batch_current_time` to `default_end_time` for MicrobatchBuilder
* Begin testing that microbatch execution times are being tracked and set
* Begin tracking the execution time of batches for microbatch models
* Add changie doc
* Additional assertions in microbatch testing
* Validate that `event_time_start` is before `event_time_end` when passed from CLI
Sometimes CLI options have restrictions based on other CLI options. This is the case
for `--event-time-start` and `--event-time-end`. Unfortunately, click doesn't provide
a good way for validating this, at least not that I found. Additionaly I'm not sure
if we have had anything like this previously. In any case, I couldn't find a
centralized validation area for such occurances. Thus I've gone and added one,
`validate_option_interactions`. Long term if more validations are added, we should
add this wrapper to each CLI command. For now I've only added it to the commands that
support `event_time_start` and `event_time_end`, specifically `build` and `run`.
* Add changie doc
* If `--event-time-end` is not specififed, ensure `--event-time-start` is less than the current time
* Fixup error message about event_time_start and event_time_end
* Move logic to validate `event_time` cli flags to `flags.py`
* Update validation of `--event-time-start` against `datetime.now` to use UTC
* When retrying microbatch models, propagate prior successful state
* Changie doc for microbatch dbt retry fixes
* Fix test_manifest unit tests for batch_info key
* Add functional test for when a microbatch model has multiple retries
* Add comment about when batch_info will be something other than None
* Add tests to check how microbatch models respect `full_refresh` model configs
* Fix `_is_incremental` to properly respect `full_refresh` model config
In dbt-core, it is generally expected that values passed via CLI flags take
precedence over model level configs. However, `full_refresh` on a model is an
exception to this rule, where in the model config takes precedence. This
config exists specifically to _prevent_ accidental full refreshes of large
incremental models, as doing so can be costly. **_It is actually best
practice_** to set `full_refresh=False` on incremental models.
Prior to this commit, for microbatch models, the above was not happening. The
CLI flag `--full-refresh` was taking precedence over the model config
`full_refresh`. That meant that if `--full-refresh` was supplied, then the
microbatch model **_would full refresh_** even if `full_refresh=False` was
set on the model. This commit solves that problem.
* Add changie doc for microbatch `full_refresh` config handling
* Add `PartialSuccess` status type and use it for microbatch models with mixed results
* Handle `PartialSuccess` in `interpret_run_result`
* Add `BatchResults` object to `BaseResult` and begin tracking during microbatch runs
* Ensure batch_results being propagated to `run_results` artifact
* Move `batch_results` from `BaseResult` class to `RunResult` class
* Move `BatchResults` and `BatchType` to separate arifacts file to avoid circular imports
In our next commit we're gonna modify `dbt/contracts/graph/nodes.py` to import the
`BatchType` as part of our work to implement dbt retry for microbatch model nodes.
Unfortunately, the import in `nodes.py` creates a circular dependency because
`dbt/artifacts/schemas/results.py` imports from `nodes.py` and `dbt/artifacts/schemas/run/v5/run.py`
imports from that `results.py`. Thus the new import creates a circular import. Now this
_shouldn't_ be necessary as nothing in artifacts should import from the rest of dbt-core.
However, we do. We should fix this, but this is also out of scope for this segement of work.
* Add `PartialSuccess` as a retry-able status, and use batches to retry microbatch models
* Fix BatchType type so that the first datetime is no longer Optional
* Ensure `PartialSuccess` causes skipping of downstream nodes
* Alter `PartialSuccess` status to be considered an error in `interpret_run_result`
* Update schemas and test artifacts to include new batch_results run results key
* Add functional test to check that 'dbt retry' retries 'PartialSuccess' models
* Update partition failure test to assert downstream models are skipped
* Improve `success`/`error`/`partial success` messaging for microbatch models
* Include `PartialSuccess` in status that `--fail-fast` counts as a failure
* Update `LogModelResult` to handle partial successes
* Update `EndOfRunSummary` to handle partial successes
* Cleanup TODO comment
* Raise a DbtInternalError if we get a batch run result without `batch_results`
* When running a microbatch model with supplied batches, force non full-refresh behavior
This is necessary because of retry. Say on the initial run the microbatch model
succeeds on 97% of it's batches. Then on retry it does the last 3%. If the retry
of the microbatch model executes in full refresh mode it _might_ blow away the
97% of work that has been done. This edge case seems to be adapter specific.
* Only pass batches to retry for microbatch model when there was a PartialSuccess
In the previous commit we made it so that retries of microbatch models wouldn't
run in full refresh mode when the microbatch model to retry has batches already
specified from the prior run. This is only problematic when the run being retried
was a full refresh AND all the batches for a given microbatch model failed. In
that case WE DO want to do a full refresh for the given microbatch model. To better
outline the problem, consider the following:
* a microbatch model had a begin of `2020-01-01` and has been running this way for awhile
* the begin config has changed to `2024-01-01` and dbt run --full-refresh gets run
* every batch for an microbatch model fails
* on dbt retry the the relation is said to exist, and the now out of range data (2020-01-01 through 2023-12-31) is never purged
To avoid this, all we have to do is ONLY pass the batch information for partially successful microbatch
models. Note: microbatch models only have a partially successful status IFF they have both
successful and failed batches.
* Fix test_manifest unit tests to know about model 'batches' key
* Add some console output assertions to microbatch functional tests
* add batch_results: None to expected_run_results
* Add changie doc for microbatch retry functionality
* maintain protoc version 5.26.1
* Cleanup extraneous comment in LogModelResult
---------
Co-authored-by: Michelle Ark <michelle.ark@dbtlabs.com>
* Test case for `merge_exclude_columns`
* Update expected output for `merge_exclude_columns`
* Skip TestMergeExcludeColumns test
* Enable this test since PostgreSQL 15+ is available in CI now
* Undo modification to expected output
* Remove duplicated constructor for `ResourceTypeSelector`
* Add type annotation for `ResourceTypeSelector`
* Standardize on constructor for `ResourceTypeSelector` where `include_empty_nodes=True`
* Changelog entry
* Adding logic to TestSelector to remove unit tests if they are in excluded_resource_types
* Adding change log
* Respect `--resource-type` and `--exclude-resource-type` CLI flags and associated environment variables
* Test CLI flag for excluding unit tests for the `dbt test` command
* Satisy isort pre-commit hook
* Fix mypy for positional argument "resource_types" in call to "TestSelector"
* Replace `TestSelector` with `ResourceTypeSelector`
* Add co-author
* Update changelog description
* Add functional tests for new feature
* Compare the actual results, not just the count
* Remove test case covered elsewhere
* Test for `DBT_EXCLUDE_RESOURCE_TYPES` environment variable for `dbt test`
* Update per pre-commit hook
* Restore to original form (until we refactor extraneous `ResourceTypeSelector` references later)
---------
Co-authored-by: Matthew Cooper <asimov.1st@gmail.com>
* initial rough-in with CLI flags
* dbt-adapters testing against event-time-ref-filtering
* fix TestList
* Checkpoint
* fix tests
* add event_time_start params to build
* rename configs
* Gate resolve_event_time_filter via micro batch strategy and fix strptime usage
* Add unit test for resolve_event_time_filter
* Additional unit tests for `resolve_event_time_filter` to ensure lookback + batch_size work
* Remove extraneous comments and print statements from resolve_event_time_filter
* Fixup microbatch functional tests to use microbatch strategy
* Gate microbatch functionality behind env_var while in beta
* Add comment about how _is_incremental should be removed
* Improve `event_time_start/end` cli parameters to auto convert to datetime objects
* for testing: dbt-postgres 'microbatch' strategy
* rough in: chunked backfills
* partial failure of microbatch runs
* decouple run result methods
* initial refactor
* rename configs to __dbt_internal
* update compiled_code in context after re-compilation
* finish rename of context vars
* changelog entry
* fix patch_microbatch_end_time
* refactor into MicrobatchBuilder
* fix provider unit tests + add unit tests for MicrobatchBuilder
* add TestMicrobatchJinjaContextVarsAvailable
* unit test offset + truncate timestamp methods
* Remove pairing.md file
* Add tying to microbatch specific functions added in `task/run.py`
* Add doc strings to microbatch.py functions and classes
* Set microbatch node status to `ERROR` if all batches for node failed
* Fire an event for batch exceptions instead of directly printing
* Fix firing of failed microbatch log event
---------
Co-authored-by: Quigley Malcolm <quigley.malcolm@dbtlabs.com>
* Update functional tests to cover this case
* Revert "Update functional tests to cover this case"
This reverts commit 4c78e816f6c370b7a39df774f2fab7328db5312b.
* New functional tests to cover the resource_type config
* Separate data tests from unit tests for `resource_types` config of `dbt list` and `dbt build`
* Changelog entry
* Add functional tests for custom incremental strategies names 'microbatch'
* Point dev-requirement of `dbt-adapters` back to the main branch
The associated branch/PR in `dbt-adapters` that we were previously
pointing to has been merged. Thus we can point back to `main` again.
---------
Co-authored-by: Quigley Malcolm <quigley.malcolm@dbtlabs.com>
issue_body:"At a minimum, update body to include a link to the page on docs.getdbt.com requiring updates and what part(s) of the page you would like to see updated."
issue_body:"At a minimum, update body to include a link to the page on docs.getdbt.com requiring updates and what part(s) of the page you would like to see updated.\n Originating from this issue: https://github.com/dbt-labs/dbt-core/issues/${{ github.event.issue.number }}"
Non-breaking changes to artifact schemas require an update to the corresponding jsonschemas published to [schemas.getdbt.com](https://schemas.getdbt.com), which are defined in https://github.com/dbt-labs/schemas.getdbt.com. To do so:
Note this must be done AFTER the core pull request is merged, otherwise we may end up with unresolvable conflicts and schemas that are invalid prior to base pull request merge. You may create the schemas.getdbt.com pull request prior to merging the base pull request, but do not merge until afterward.
1. Create a PR in https://github.com/dbt-labs/schemas.getdbt.com which reflects the schema changes to the artifact. The schema can be updated in-place for non-breaking changes. Example PR: https://github.com/dbt-labs/schemas.getdbt.com/pull/39
2. Merge the https://github.com/dbt-labs/schemas.getdbt.com PR
3. Observe the `Artifact Schema Check` CI check pass on the `dbt-core` PR that updates the artifact schemas, and merge the `dbt-core` PR!
Note: Although `jsonschema` validation using the schemas in [schemas.getdbt.com](https://schemas.getdbt.com) is not encouraged or formally supported, `jsonschema` validation should still continue to work once the schemas are updated because they are forward-compatible and can therefore be used to validate previous minor versions of the schema.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.