-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Update reapply #5361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update reapply #5361
Conversation
a83a197
to
d8f1bee
Compare
"go.temporal.io/server/service/history/consts" | ||
"go.temporal.io/server/service/history/shard" | ||
"go.temporal.io/server/service/history/workflow" | ||
wcache "go.temporal.io/server/service/history/workflow/cache" | ||
) | ||
|
||
type ( | ||
workflowResetReapplyEventsFn func(ctx context.Context, resetMutableState workflow.MutableState) error | ||
workflowResetReapplyEventsFn func(ctx context.Context, resetMutableState workflow.MutableState) ([]*historypb.HistoryEvent, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResetWorkflow
now needs to know the events that have been reapplied, so that it can advance the update state machine for UpdateRequested
events. Several of the changes in this PR (like this one) are passing the slice of reapplied events up the stack from the low level reapplyEvents
function to ResetWorkflow
.
ba58573
to
9a35ddf
Compare
d8f1bee
to
85e29cb
Compare
9a35ddf
to
48e0147
Compare
43fd533
to
1645bea
Compare
632bc4d
to
3b5c94e
Compare
48e0147
to
e58ee0f
Compare
23e7af5
to
e29b4c7
Compare
e58ee0f
to
e2a0158
Compare
aa7c8f8
to
1e0beef
Compare
e2a0158
to
2170deb
Compare
09cbc43
to
2109346
Compare
2170deb
to
6a37404
Compare
## What changed? <!-- Describe what has changed in this PR --> MutableState's `SetHistoryTree` has an unused parameter. (not used downstream either) ## Why? `ctx` often implies "side effect", sort of like the IO Monad, but there is no side effect here. ## How did you test it? Compiler. ## Potential risks I cannot see any. The change can easily be reverted if requirements change. ## Is hotfix candidate? No
## What changed? <!-- Describe what has changed in this PR --> - Deprecate taskCategoryRegistry usage for DLQ Key on frontend ## Why? <!-- Tell your future self why have you made these changes --> - Frontend should not use task category registry. That component should be used only by history service. ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - Existing tests. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> - No.
## What changed? <!-- Describe what has changed in this PR --> - Remove unused persistence API for registering queue readers and updating it's progress. ## Why? <!-- Tell your future self why have you made these changes --> - Those APIs are adding for a deprecated design ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> - Existing tests ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
2109346
to
be113b3
Compare
## What changed? Drop the response from scheduled workflow if the response is too big. ## Why? If response is too big, the schedule workflow will fail and never restarted. ## How did you test it? Unit test. ## Potential risks No. ## Is hotfix candidate? No.
## What changed? - Add PR template reminder to update docs - Add first d2 diagram to Matching README (more to follow) - Refactor `docs` folder to contain an `assets` folder where we store images for all the READMEs in the project, instead of using github assets ## Why? To improve how we interact with and update docs ## How did you test it? n/a ## Potential risks n/a ## Is hotfix candidate? no --------- Co-authored-by: Stephan Behnke <[email protected]> Co-authored-by: Dan Davison <[email protected]>
## What changed? Adjust API priorities ## Why? So that when rate limit happens, it only impacting poller APIs which is automatically retried. ## How did you test it? Existing unit tests ## Potential risks No ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? I've upgraded the version of our Go API in order to allow invalid UTF-8 data in our proto strings. Please review 5cf9790 only ## Why? The version of gogo/protobuf we used allowed invalid UTF-8 in strings which is a violation of the proto3 spec. We were unaware of this until our new google/protobuf based structs began failing to deserialize data we'd used previously ## How did you test it? I added a new end-to-end test to verify that we can create, describe, and delete a namespace that contains invalid UTF-8 in multiple attributes ## Potential risks This is lower risk than commits prior to this as temporal DBs may already contain invalid UTF-8 data ## Documentation I adjusted our docs to mention this behavior and the `protolegacy` tag ## Is hotfix candidate? Yes: this will be going out as part of our 1.23.0 release. In fact, the release is blocked on it! --------- Co-authored-by: David Reiss <[email protected]>
## What changed? Add a hook to Start Workflow, in-between Workflow Execution creation and its persistence. ## Why? In preparation of upcoming Update-with-Start changes. ## How did you test it? The new `withStart` path will be used and tested in subsequent PRs. The existing path is covered by existing tests. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? After an update, the schedule workflow goes back to the previous action/update and recomputes next times from there, instead of from the current time. ## Why? Fixes #4866. Fixes the situation where an update happens in between the nominal time and the jittered time of an action. ## How did you test it? new unit test
## What changed? <!-- Describe what has changed in this PR --> Removed `println`s from test. ## Why? Don't need them. ## How did you test it? Is test. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
) ## What changed? <!-- Describe what has changed in this PR --> Use `historyrequire.HistoryRequire` to assert on histories in tests. ## Why? <!-- Tell your future self why have you made these changes --> To make tests more readable and unify history assertions. ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> Run all tests. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> No risks. ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> No. ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> No.
## What changed? Expanded the lint exclusion rules to include not just functional tests but all tests. ## Why? When using subtests, it's easy to get an unhelpful "cyclomatic complexity" error. ## How did you test it? ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? <!-- Describe what has changed in this PR --> (1) Allow to run an Update operation with a custom `WorkflowLease`. (2) Separate the WorkflowLease-specific step from the post-WorkflowLease steps. ## Why? This is a refactoring to enable Update-With-Start. It requires these two abilities mentioned above. ## How did you test it? Existing test. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) --> --------- Co-authored-by: Alex Shtin <[email protected]>
## What changed? Add string validation for a few string fields ## Why? Since we disable utf8 string validation from proto level, we want to enforce minimal validation for some key fields. ## How did you test it? Unit tests ## Potential risks No ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? Validate normal TQ on sticky poll ## Why? On sticky poll, we fetch normal TQ's UserData. So an invalid normal TQ on a sticky queue would still trigger a load for the invalid normal queue. ## How did you test it? Manual test with invalid TQ. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? Added a test for Start Workflow + Terminate-if-Running. Also refactored the existing test cases to make them more explicit, have less duplication, less clever looping. ## Why? Adding a test for Terminate-if-Running behavior for changing its behavior in an upcoming PR (related to temporalio/api#359). ## How did you test it? Is test. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
This PR is intended to merge 4 months of work in the `nexus` feature branch into `main`. The functionality it brings is: - Dispatching Nexus Tasks by namespace and task queue - Internal implementation of the Incoming Service Registry - not yet exposed - Attaching workflow close callbacks on `StartWorkflowExecutionRequest` and processing of those callbacks There's more to come and some of the callback code will be refactored into a new plugin architecture (#5446) soon. --------- Co-authored-by: PJ Doerner <[email protected]>
## What changed? <!-- Describe what has changed in this PR --> Mark Pull Requests - but not Issues - as stale after 120 days. ## Why? Nudge author to either close or resurrect their PR from its slumber. ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> ## Potential risks None. The "stale" label simply can be removed by the author; no PR is closed automatically. ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? - clearer distinction between the behavior for closed and running workflows - name that isn't particular to workflowIdReusePolicy, but the more general issue of an "ID duplication" ## Why? Preparing for changes from temporalio/api#359 by pulling this refactor in early, reducing the size of the change later. ## How did you test it? Existing tests + added a test for Terminate-if-Running earlier in #5483. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? In description. ## Why? It's unsafe to use test suite helpers from goroutines. I didn't know that when I originally wrote these tests. See test failure here: https://github.com/temporalio/temporal/actions/runs/8177154337/job/22358369314?pr=5497 ## How did you test it? Ran tests a few times locally.
## What changed? <!-- Describe what has changed in this PR --> Moved the non-ringpop, static membership implementation into its own package. ## Why? <!-- Tell your future self why have you made these changes --> First step in making it available to the Temporal dev server. I believe that'll make the dev server less flaky ("Not enough hosts to serve the request"). ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> Is tests. ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? Using the new metricDefinition.With approach. ## Why? The usage of metrics.Handler could be both redundant and unsafe: ``` handler.Histogram( myHisto.Name(), myHisto.Unit() ).Record(...) │ │ └───┐ │ │ │ └─────┐ └───────────┐ │ │ │ │ Redundant and Unsafe: 'Histogram' │ │ is specified despite knowing │ │ 'myHisto' is a histogram. There's │ │ also a risk if 'myHisto.Name()' │ │ refers to a metric definition that │ │ is not a histogram. │ │ └─── Redundant information ``` And using the new With(Handler) method, there's no need to specify redundant information. ## How did you test it? Already exists an ExampleHandler which tests all the new approach methods. ## Potential risks People can still use the old approach, because we are keeping the Handler interface as-is. ## Is hotfix candidate? No.
## What changed? <!-- Describe what has changed in this PR --> Fix execution duration calculation in Visibility ## Why? <!-- Tell your future self why have you made these changes --> PR #4961 changed the definition of execution duration by accident. ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? <!-- Describe what has changed in this PR --> Update GHA actions version ## Why? <!-- Tell your future self why have you made these changes --> Old versions use node16 which reached EOL. New versions use node20. ## How did you test it? <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> ## Potential risks <!-- Assuming the worst case, what can be broken when deploying this change to production? --> ## Documentation <!-- Have you made sure this change doesn't falsify anything currently stated in `docs/`? If significant new behavior is added, have you described that in `docs/`? --> ## Is hotfix candidate? <!-- Is this PR a hotfix candidate or does it require a notification to be sent to the broader community? (Yes/No) -->
## What changed? Attempting to configure a MySQL 8 Visibility database with go-sql-driver/mysql's `interpolateParams` option will now fail. ## Why? There are a number of reasons why this doesn't work right now so we're going to prevent it from happening while we discuss how much we should invest in fixing it. The current work to fix this is in #5428 ## How did you test it? Added a new test ## Potential risks There should be none. Any user who has currently set up their database this way will already have encountered exciting errors like `Cannot create a JSON value from a string with CHARACTER SET 'binary'` and `Unable to parse ExecutionStatus value from DB`. ## Is hotfix candidate? That's up for discussion
…ion (#5506) ## What changed? `go.temporal.io/server/common/concurrent.Map` is now the canonical "locking map" implementation for the server (unless there's another one I'm aware of). ## Why? We recently found a data race in matching's lockableResultMap where the wrong lock type was taken during the `pop` operation, causing panics at run time. It's an easy mistake to make when you're writing multiple implementations like this, so I made a common one we can reuse. ## How did you test it? Added new tests and manually looked at all 40 lines of code to ensure the correct locks were taken. ## Potential risks None ## Documentation No ## Is hotfix candidate? Maybe?
be113b3
to
aeea9d1
Compare
## Notes for reviewers I don't consider this a final approach but I do think it's a step in the right direction, we need to model more state machines on top of this to form a more solid API. Note that mutable state itself is registered as a state machine in the HSM framework and used as the root node in the HSM tree. It's a bit annoying that all of the unit tests need to register the workflow state machine but I think this is ended up being the cleanest approach from the HSM framework perspective. There's still some follow-up work, like collapsing timer tasks and support for interacting with history event, that I'll take on later. Replication support for the framework will also come at a later time. - See discussions in related PRs (#5446, #5495) ## What changed? - Modified the statemachines abstraction to be a bit more generic - Rewrote callbacks as a plugin using this framework ## Why? This centralizes most the callback code in the plugin directory instead of having it spread out the entire project moving common concerns such as staleness checks, task generation, and (in the future) replication into a framework and should generally help speed up feature delivery and maintainability. I plan to leverage this framework when implementing Nexus operations. ## How did you test it? Existing tests from the feature branch and added unit tests.
* Create UpdateRequested events during reapply * Honor update exclude types * Don't demand that accepted message has request
aeea9d1
to
dd7e341
Compare
WIP: not for review
This PR merges into #5360