Update reapply #5361

dandavison · 2024-01-28T19:36:30Z

WIP: not for review
This PR merges into #5360

dandavison · 2024-01-29T03:27:56Z

service/history/ndc/workflow_resetter.go

 	"go.temporal.io/server/service/history/consts"
 	"go.temporal.io/server/service/history/shard"
 	"go.temporal.io/server/service/history/workflow"
 	wcache "go.temporal.io/server/service/history/workflow/cache"
 )

 type (
-	workflowResetReapplyEventsFn func(ctx context.Context, resetMutableState workflow.MutableState) error
+	workflowResetReapplyEventsFn func(ctx context.Context, resetMutableState workflow.MutableState) ([]*historypb.HistoryEvent, error)


ResetWorkflow now needs to know the events that have been reapplied, so that it can advance the update state machine for UpdateRequested events. Several of the changes in this PR (like this one) are passing the slice of reapplied events up the stack from the low level reapplyEvents function to ResetWorkflow.

## What changed?  MutableState's `SetHistoryTree` has an unused parameter. (not used downstream either) ## Why? `ctx` often implies "side effect", sort of like the IO Monad, but there is no side effect here. ## How did you test it? Compiler. ## Potential risks I cannot see any. The change can easily be reverted if requirements change. ## Is hotfix candidate? No

## What changed?  - Deprecate taskCategoryRegistry usage for DLQ Key on frontend ## Why?  - Frontend should not use task category registry. That component should be used only by history service. ## How did you test it?  - Existing tests. ## Potential risks  ## Is hotfix candidate?  - No.

## What changed?  - Remove unused persistence API for registering queue readers and updating it's progress. ## Why?  - Those APIs are adding for a deprecated design ## How did you test it?  - Existing tests ## Potential risks  ## Is hotfix candidate?

## What changed? Drop the response from scheduled workflow if the response is too big. ## Why? If response is too big, the schedule workflow will fail and never restarted. ## How did you test it? Unit test. ## Potential risks No. ## Is hotfix candidate? No.

## What changed? - Add PR template reminder to update docs - Add first d2 diagram to Matching README (more to follow) - Refactor `docs` folder to contain an `assets` folder where we store images for all the READMEs in the project, instead of using github assets ## Why? To improve how we interact with and update docs ## How did you test it? n/a ## Potential risks n/a ## Is hotfix candidate? no --------- Co-authored-by: Stephan Behnke <[email protected]> Co-authored-by: Dan Davison <[email protected]>

## What changed? Adjust API priorities ## Why? So that when rate limit happens, it only impacting poller APIs which is automatically retried. ## How did you test it? Existing unit tests ## Potential risks No ## Documentation  ## Is hotfix candidate?

## What changed? I've upgraded the version of our Go API in order to allow invalid UTF-8 data in our proto strings. Please review 5cf9790 only ## Why? The version of gogo/protobuf we used allowed invalid UTF-8 in strings which is a violation of the proto3 spec. We were unaware of this until our new google/protobuf based structs began failing to deserialize data we'd used previously ## How did you test it? I added a new end-to-end test to verify that we can create, describe, and delete a namespace that contains invalid UTF-8 in multiple attributes ## Potential risks This is lower risk than commits prior to this as temporal DBs may already contain invalid UTF-8 data ## Documentation I adjusted our docs to mention this behavior and the `protolegacy` tag ## Is hotfix candidate? Yes: this will be going out as part of our 1.23.0 release. In fact, the release is blocked on it! --------- Co-authored-by: David Reiss <[email protected]>

## What changed? Add a hook to Start Workflow, in-between Workflow Execution creation and its persistence. ## Why? In preparation of upcoming Update-with-Start changes. ## How did you test it? The new `withStart` path will be used and tested in subsequent PRs. The existing path is covered by existing tests. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? After an update, the schedule workflow goes back to the previous action/update and recomputes next times from there, instead of from the current time. ## Why? Fixes #4866. Fixes the situation where an update happens in between the nominal time and the jittered time of an action. ## How did you test it? new unit test

## What changed? Partial revert of commit #5447 (7f11cae). ## Why? We're going to implement it in a slightly different way and want to do a release from main without having to retain this specific workflow logic. ## How did you test it? existing tests

## What changed?  Removed `println`s from test. ## Why? Don't need them. ## How did you test it? Is test. ## Potential risks  ## Documentation  ## Is hotfix candidate?

) ## What changed?  Use `historyrequire.HistoryRequire` to assert on histories in tests. ## Why?  To make tests more readable and unify history assertions. ## How did you test it?  Run all tests. ## Potential risks  No risks. ## Documentation  No. ## Is hotfix candidate?  No.

## What changed? Expanded the lint exclusion rules to include not just functional tests but all tests. ## Why? When using subtests, it's easy to get an unhelpful "cyclomatic complexity" error. ## How did you test it? ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  (1) Allow to run an Update operation with a custom `WorkflowLease`. (2) Separate the WorkflowLease-specific step from the post-WorkflowLease steps. ## Why? This is a refactoring to enable Update-With-Start. It requires these two abilities mentioned above. ## How did you test it? Existing test. ## Potential risks  ## Documentation  ## Is hotfix candidate?  --------- Co-authored-by: Alex Shtin <[email protected]>

## What changed? Add string validation for a few string fields ## Why? Since we disable utf8 string validation from proto level, we want to enforce minimal validation for some key fields. ## How did you test it? Unit tests ## Potential risks No ## Documentation  ## Is hotfix candidate?

## What changed? Validate normal TQ on sticky poll ## Why? On sticky poll, we fetch normal TQ's UserData. So an invalid normal TQ on a sticky queue would still trigger a load for the invalid normal queue. ## How did you test it? Manual test with invalid TQ. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? Added a test for Start Workflow + Terminate-if-Running. Also refactored the existing test cases to make them more explicit, have less duplication, less clever looping. ## Why? Adding a test for Terminate-if-Running behavior for changing its behavior in an upcoming PR (related to temporalio/api#359). ## How did you test it? Is test. ## Potential risks  ## Documentation  ## Is hotfix candidate?

This PR is intended to merge 4 months of work in the `nexus` feature branch into `main`. The functionality it brings is: - Dispatching Nexus Tasks by namespace and task queue - Internal implementation of the Incoming Service Registry - not yet exposed - Attaching workflow close callbacks on `StartWorkflowExecutionRequest` and processing of those callbacks There's more to come and some of the callback code will be refactored into a new plugin architecture (#5446) soon. --------- Co-authored-by: PJ Doerner <[email protected]>

## What changed?  Mark Pull Requests - but not Issues - as stale after 120 days. ## Why? Nudge author to either close or resurrect their PR from its slumber. ## How did you test it?  ## Potential risks None. The "stale" label simply can be removed by the author; no PR is closed automatically. ## Documentation  ## Is hotfix candidate?

## What changed? - clearer distinction between the behavior for closed and running workflows - name that isn't particular to workflowIdReusePolicy, but the more general issue of an "ID duplication" ## Why? Preparing for changes from temporalio/api#359 by pulling this refactor in early, reducing the size of the change later. ## How did you test it? Existing tests + added a test for Terminate-if-Running earlier in #5483. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? In description. ## Why? It's unsafe to use test suite helpers from goroutines. I didn't know that when I originally wrote these tests. See test failure here: https://github.com/temporalio/temporal/actions/runs/8177154337/job/22358369314?pr=5497 ## How did you test it? Ran tests a few times locally.

## What changed?  Moved the non-ringpop, static membership implementation into its own package. ## Why?  First step in making it available to the Temporal dev server. I believe that'll make the dev server less flaky ("Not enough hosts to serve the request"). ## How did you test it?  Is tests. ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? Using the new metricDefinition.With approach. ## Why? The usage of metrics.Handler could be both redundant and unsafe: ``` handler.Histogram( myHisto.Name(), myHisto.Unit() ).Record(...) │ │ └───┐ │ │ │ └─────┐ └───────────┐ │ │ │ │ Redundant and Unsafe: 'Histogram' │ │ is specified despite knowing │ │ 'myHisto' is a histogram. There's │ │ also a risk if 'myHisto.Name()' │ │ refers to a metric definition that │ │ is not a histogram. │ │ └─── Redundant information ``` And using the new With(Handler) method, there's no need to specify redundant information. ## How did you test it? Already exists an ExampleHandler which tests all the new approach methods. ## Potential risks People can still use the old approach, because we are keeping the Handler interface as-is. ## Is hotfix candidate? No.

## What changed?  Fix execution duration calculation in Visibility ## Why?  PR #4961 changed the definition of execution duration by accident. ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed?  Update GHA actions version ## Why?  Old versions use node16 which reached EOL. New versions use node20. ## How did you test it?  ## Potential risks  ## Documentation  ## Is hotfix candidate?

## What changed? Attempting to configure a MySQL 8 Visibility database with go-sql-driver/mysql's `interpolateParams` option will now fail. ## Why? There are a number of reasons why this doesn't work right now so we're going to prevent it from happening while we discuss how much we should invest in fixing it. The current work to fix this is in #5428 ## How did you test it? Added a new test ## Potential risks There should be none. Any user who has currently set up their database this way will already have encountered exciting errors like `Cannot create a JSON value from a string with CHARACTER SET 'binary'` and `Unable to parse ExecutionStatus value from DB`. ## Is hotfix candidate? That's up for discussion

…ion (#5506) ## What changed? `go.temporal.io/server/common/concurrent.Map` is now the canonical "locking map" implementation for the server (unless there's another one I'm aware of). ## Why? We recently found a data race in matching's lockableResultMap where the wrong lock type was taken during the `pop` operation, causing panics at run time. It's an easy mistake to make when you're writing multiple implementations like this, so I made a common one we can reuse. ## How did you test it? Added new tests and manually looked at all 40 lines of code to ensure the correct locks were taken. ## Potential risks None ## Documentation No ## Is hotfix candidate? Maybe?

## Notes for reviewers I don't consider this a final approach but I do think it's a step in the right direction, we need to model more state machines on top of this to form a more solid API. Note that mutable state itself is registered as a state machine in the HSM framework and used as the root node in the HSM tree. It's a bit annoying that all of the unit tests need to register the workflow state machine but I think this is ended up being the cleanest approach from the HSM framework perspective. There's still some follow-up work, like collapsing timer tasks and support for interacting with history event, that I'll take on later. Replication support for the framework will also come at a later time. - See discussions in related PRs (#5446, #5495) ## What changed? - Modified the statemachines abstraction to be a bit more generic - Rewrote callbacks as a plugin using this framework ## Why? This centralizes most the callback code in the plugin directory instead of having it spread out the entire project moving common concerns such as staleness checks, task generation, and (in the future) replication into a framework and should generally help speed up feature delivery and maintainability. I plan to leverage this framework when implementing Nexus operations. ## How did you test it? Existing tests from the feature branch and added unit tests.

* Create UpdateRequested events during reapply * Honor update exclude types * Don't demand that accepted message has request

dandavison mentioned this pull request Jan 28, 2024

Update reapply temporalio/samples-python#102

Draft

dandavison force-pushed the update-reapply branch from a83a197 to d8f1bee Compare January 29, 2024 03:15

dandavison commented Jan 29, 2024

View reviewed changes

dandavison force-pushed the oss-2147-reapply-types-prep branch from ba58573 to 9a35ddf Compare January 29, 2024 03:40

dandavison force-pushed the update-reapply branch from d8f1bee to 85e29cb Compare January 29, 2024 03:42

dandavison force-pushed the oss-2147-reapply-types-prep branch from 9a35ddf to 48e0147 Compare February 2, 2024 12:49

dandavison force-pushed the update-reapply branch from 43fd533 to 1645bea Compare February 16, 2024 20:34

dandavison mentioned this pull request Feb 16, 2024

Updates for update reapply temporalio/sdk-core#677

Merged

dandavison force-pushed the update-reapply branch 2 times, most recently from 632bc4d to 3b5c94e Compare February 17, 2024 18:21

dandavison force-pushed the oss-2147-reapply-types-prep branch from 48e0147 to e58ee0f Compare February 17, 2024 18:24

dandavison force-pushed the update-reapply branch 3 times, most recently from 23e7af5 to e29b4c7 Compare February 17, 2024 23:07

dandavison force-pushed the oss-2147-reapply-types-prep branch from e58ee0f to e2a0158 Compare February 17, 2024 23:30

dandavison force-pushed the update-reapply branch 5 times, most recently from aa7c8f8 to 1e0beef Compare February 18, 2024 14:26

dandavison force-pushed the oss-2147-reapply-types-prep branch from e2a0158 to 2170deb Compare February 18, 2024 23:37

dandavison force-pushed the update-reapply branch 2 times, most recently from 09cbc43 to 2109346 Compare February 20, 2024 17:07

dandavison force-pushed the oss-2147-reapply-types-prep branch from 2170deb to 6a37404 Compare February 20, 2024 17:07

stephanos and others added 3 commits February 20, 2024 20:23

dandavison force-pushed the update-reapply branch from 2109346 to be113b3 Compare February 20, 2024 23:48

ast2023 and others added 2 commits February 21, 2024 08:46

yiminc and others added 22 commits March 1, 2024 22:22

Partial revert of #5447 (#5480)

c6e1280

## What changed? Partial revert of commit #5447 (7f11cae). ## Why? We're going to implement it in a slightly different way and want to do a release from main without having to retain this specific workflow logic. ## How did you test it? existing tests

dandavison force-pushed the update-reapply branch from be113b3 to aeea9d1 Compare March 11, 2024 18:06

bergundy and others added 4 commits March 11, 2024 12:59

Introduce ResetReapplyExcludeTypes

13a1f43

Reapply updates after reset

755e502

* Create UpdateRequested events during reapply * Honor update exclude types * Don't demand that accepted message has request

make clean-proto && make proto selective commit

dd7e341

dandavison force-pushed the update-reapply branch from aeea9d1 to dd7e341 Compare March 12, 2024 01:20

dandavison closed this Mar 12, 2024

dandavison deleted the update-reapply branch March 12, 2024 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update reapply #5361

Update reapply #5361

dandavison commented Jan 28, 2024 •

edited

Loading

Uh oh!

dandavison Jan 29, 2024 •

edited

Loading

Uh oh!

Uh oh!

Update reapply #5361

Update reapply #5361

Conversation

dandavison commented Jan 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dandavison Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dandavison commented Jan 28, 2024 •

edited

Loading

dandavison Jan 29, 2024 •

edited

Loading