Skip to content

Update reapply #5361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 52 commits into from
Closed

Update reapply #5361

wants to merge 52 commits into from

Conversation

dandavison
Copy link
Contributor

@dandavison dandavison commented Jan 28, 2024

WIP: not for review
This PR merges into #5360

"go.temporal.io/server/service/history/consts"
"go.temporal.io/server/service/history/shard"
"go.temporal.io/server/service/history/workflow"
wcache "go.temporal.io/server/service/history/workflow/cache"
)

type (
workflowResetReapplyEventsFn func(ctx context.Context, resetMutableState workflow.MutableState) error
workflowResetReapplyEventsFn func(ctx context.Context, resetMutableState workflow.MutableState) ([]*historypb.HistoryEvent, error)
Copy link
Contributor Author

@dandavison dandavison Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResetWorkflow now needs to know the events that have been reapplied, so that it can advance the update state machine for UpdateRequested events. Several of the changes in this PR (like this one) are passing the slice of reapplied events up the stack from the low level reapplyEvents function to ResetWorkflow.

@dandavison dandavison force-pushed the oss-2147-reapply-types-prep branch from ba58573 to 9a35ddf Compare January 29, 2024 03:40
@dandavison dandavison force-pushed the oss-2147-reapply-types-prep branch from 9a35ddf to 48e0147 Compare February 2, 2024 12:49
@dandavison dandavison force-pushed the update-reapply branch 2 times, most recently from 632bc4d to 3b5c94e Compare February 17, 2024 18:21
@dandavison dandavison force-pushed the oss-2147-reapply-types-prep branch from 48e0147 to e58ee0f Compare February 17, 2024 18:24
@dandavison dandavison force-pushed the update-reapply branch 3 times, most recently from 23e7af5 to e29b4c7 Compare February 17, 2024 23:07
@dandavison dandavison force-pushed the oss-2147-reapply-types-prep branch from e58ee0f to e2a0158 Compare February 17, 2024 23:30
@dandavison dandavison force-pushed the update-reapply branch 5 times, most recently from aa7c8f8 to 1e0beef Compare February 18, 2024 14:26
@dandavison dandavison force-pushed the oss-2147-reapply-types-prep branch from e2a0158 to 2170deb Compare February 18, 2024 23:37
@dandavison dandavison force-pushed the update-reapply branch 2 times, most recently from 09cbc43 to 2109346 Compare February 20, 2024 17:07
@dandavison dandavison force-pushed the oss-2147-reapply-types-prep branch from 2170deb to 6a37404 Compare February 20, 2024 17:07
stephanos and others added 3 commits February 20, 2024 20:23
## What changed?
<!-- Describe what has changed in this PR -->

MutableState's `SetHistoryTree` has an unused parameter. (not used
downstream either)

## Why?

`ctx` often implies "side effect", sort of like the IO Monad, but there
is no side effect here.

## How did you test it?

Compiler.

## Potential risks

I cannot see any. The change can easily be reverted if requirements
change.

## Is hotfix candidate?

No
## What changed?
<!-- Describe what has changed in this PR -->
- Deprecate taskCategoryRegistry usage for DLQ Key on frontend

## Why?
<!-- Tell your future self why have you made these changes -->
- Frontend should not use task category registry. That component should
be used only by history service.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Existing tests.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
- No.
## What changed?
<!-- Describe what has changed in this PR -->
- Remove unused persistence API for registering queue readers and
updating it's progress.

## Why?
<!-- Tell your future self why have you made these changes -->
- Those APIs are adding for a deprecated design

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
- Existing tests

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
ast2023 and others added 2 commits February 21, 2024 08:46
## What changed?
Drop the response from scheduled workflow if the response is too big.

## Why?
If response is too big, the schedule workflow will fail and never
restarted.

## How did you test it?
Unit test.

## Potential risks
No.

## Is hotfix candidate?
No.
## What changed?
- Add PR template reminder to update docs
- Add first d2 diagram to Matching README (more to follow)
- Refactor `docs` folder to contain an `assets` folder where we store
images for all the READMEs in the project, instead of using github
assets

## Why?
To improve how we interact with and update docs

## How did you test it?
n/a

## Potential risks
n/a

## Is hotfix candidate?
no

---------

Co-authored-by: Stephan Behnke <[email protected]>
Co-authored-by: Dan Davison <[email protected]>
yiminc and others added 22 commits March 1, 2024 22:22
## What changed?
Adjust API priorities

## Why?
So that when rate limit happens, it only impacting poller APIs which is
automatically retried.

## How did you test it?
Existing unit tests

## Potential risks
No

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
I've upgraded the version of our Go API in order to allow invalid UTF-8
data in our proto strings.

Please review
5cf9790
only

## Why?
The version of gogo/protobuf we used allowed invalid UTF-8 in strings
which is a violation of the proto3 spec. We were unaware of this until
our new google/protobuf based structs began failing to deserialize data
we'd used previously

## How did you test it?
I added a new end-to-end test to verify that we can create, describe,
and delete a namespace that contains invalid UTF-8 in multiple
attributes

## Potential risks
This is lower risk than commits prior to this as temporal DBs may
already contain invalid UTF-8 data

## Documentation
I adjusted our docs to mention this behavior and the `protolegacy` tag

## Is hotfix candidate?
Yes: this will be going out as part of our 1.23.0 release. In fact, the
release is blocked on it!

---------

Co-authored-by: David Reiss <[email protected]>
## What changed?

Add a hook to Start Workflow, in-between Workflow Execution creation and
its persistence.

## Why?

In preparation of upcoming Update-with-Start changes.

## How did you test it?

The new `withStart` path will be used and tested in subsequent PRs.

The existing path is covered by existing tests.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
After an update, the schedule workflow goes back to the previous
action/update and recomputes next times from there, instead of from the
current time.

## Why?
Fixes #4866. Fixes the situation where an update happens in between the
nominal time and the jittered time of an action.

## How did you test it?
new unit test
## What changed?
Partial revert of commit #5447 (7f11cae).

## Why?
We're going to implement it in a slightly different way and want to do a
release from main without having to retain this specific workflow logic.

## How did you test it?
existing tests
## What changed?
<!-- Describe what has changed in this PR -->

Removed `println`s from test.

## Why?

Don't need them.

## How did you test it?

Is test.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
)

## What changed?
<!-- Describe what has changed in this PR -->
Use `historyrequire.HistoryRequire` to assert on histories in tests.

## Why?
<!-- Tell your future self why have you made these changes -->
To make tests more readable and unify history assertions.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
Run all tests.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
No risks.

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->
No.

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
No.
## What changed?

Expanded the lint exclusion rules to include not just functional tests
but all tests.

## Why?

When using subtests, it's easy to get an unhelpful "cyclomatic
complexity" error.

## How did you test it?

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->

(1) Allow to run an Update operation with a custom `WorkflowLease`.
(2) Separate the WorkflowLease-specific step from the post-WorkflowLease
steps.

## Why?

This is a refactoring to enable Update-With-Start. It requires these two
abilities mentioned above.

## How did you test it?

Existing test.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->

---------

Co-authored-by: Alex Shtin <[email protected]>
## What changed?
Add string validation for a few string fields

## Why?
Since we disable utf8 string validation from proto level, we want to
enforce minimal validation for some key fields.

## How did you test it?
Unit tests

## Potential risks
No

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
Validate normal TQ on sticky poll

## Why?
On sticky poll, we fetch normal TQ's UserData. So an invalid normal TQ
on a sticky queue would still trigger a load for the invalid normal
queue.

## How did you test it?
Manual test with invalid TQ.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?

Added a test for Start Workflow + Terminate-if-Running.

Also refactored the existing test cases to make them more explicit, have
less duplication, less clever looping.

## Why?

Adding a test for Terminate-if-Running behavior for changing its
behavior in an upcoming PR (related to
temporalio/api#359).

## How did you test it?

Is test.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
This PR is intended to merge 4 months of work in the `nexus` feature
branch into `main`.

The functionality it brings is:
- Dispatching Nexus Tasks by namespace and task queue
- Internal implementation of the Incoming Service Registry - not yet
exposed
- Attaching workflow close callbacks on `StartWorkflowExecutionRequest`
and processing of those callbacks

There's more to come and some of the callback code will be refactored
into a new plugin architecture
(#5446) soon.

---------

Co-authored-by: PJ Doerner <[email protected]>
## What changed?
<!-- Describe what has changed in this PR -->

Mark Pull Requests - but not Issues - as stale after 120 days.

## Why?

Nudge author to either close or resurrect their PR from its slumber.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks

None. The "stale" label simply can be removed by the author; no PR is
closed automatically.

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?

- clearer distinction between the behavior for closed and running
workflows
- name that isn't particular to workflowIdReusePolicy, but the more
general issue of an "ID duplication"

## Why?

Preparing for changes from temporalio/api#359 by
pulling this refactor in early, reducing the size of the change later.

## How did you test it?

Existing tests + added a test for Terminate-if-Running earlier in
#5483.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?

In description.

## Why?

It's unsafe to use test suite helpers from goroutines. I didn't know
that when I originally wrote these tests.

See test failure here:


https://github.com/temporalio/temporal/actions/runs/8177154337/job/22358369314?pr=5497

## How did you test it?

Ran tests a few times locally.
## What changed?
<!-- Describe what has changed in this PR -->

Moved the non-ringpop, static membership implementation into its own
package.

## Why?
<!-- Tell your future self why have you made these changes -->

First step in making it available to the Temporal dev server. I believe
that'll make the dev server less flaky ("Not enough hosts to serve the
request").

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

Is tests.

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
Using the new metricDefinition.With approach.

## Why?
The usage of metrics.Handler could be both redundant and unsafe:

```
 handler.Histogram( myHisto.Name(), myHisto.Unit() ).Record(...)
                │             │               └───┐
                │             │                   │
                └─────┐       └───────────┐       │
                      │                   │       │
       Redundant and Unsafe: 'Histogram'  │       │
       is specified despite knowing       │       │
       'myHisto' is a histogram. There's  │       │
       also a risk if 'myHisto.Name()'    │       │
       refers to a metric definition that │       │
       is not a histogram.                │       │
                                          └─── Redundant information
```

And using the new With(Handler) method, there's no need to specify
redundant information.


## How did you test it?
Already exists an ExampleHandler which tests all the new approach
methods.

## Potential risks
People can still use the old approach, because we are keeping
the Handler interface as-is.

## Is hotfix candidate?
No.
## What changed?
<!-- Describe what has changed in this PR -->
Fix execution duration calculation in Visibility

## Why?
<!-- Tell your future self why have you made these changes -->
PR #4961 changed the definition of execution duration by accident.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
<!-- Describe what has changed in this PR -->
Update GHA actions version

## Why?
<!-- Tell your future self why have you made these changes -->
Old versions use node16 which reached EOL. New versions use node20.

## How did you test it?
<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->

## Potential risks
<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->

## Documentation
<!-- Have you made sure this change doesn't falsify anything currently
stated in `docs/`? If significant
new behavior is added, have you described that in `docs/`? -->

## Is hotfix candidate?
<!-- Is this PR a hotfix candidate or does it require a notification to
be sent to the broader community? (Yes/No) -->
## What changed?
Attempting to configure a MySQL 8 Visibility database with
go-sql-driver/mysql's `interpolateParams` option will now fail.

## Why?

There are a number of reasons why this doesn't work right now so we're
going to prevent it from happening while we discuss how much we should
invest in fixing it. The current work to fix this is in
#5428

## How did you test it?
Added a new test

## Potential risks
There should be none. Any user who has currently set up their database
this way will already have encountered exciting errors like `Cannot
create a JSON value from a string with CHARACTER SET 'binary'` and
`Unable to parse ExecutionStatus value from DB`.

## Is hotfix candidate?

That's up for discussion
…ion (#5506)

## What changed?

`go.temporal.io/server/common/concurrent.Map` is now the canonical
"locking map" implementation for the server (unless there's another one
I'm aware of).

## Why?

We recently found a data race in matching's lockableResultMap where the
wrong lock type was taken during the `pop` operation, causing panics at
run time. It's an easy mistake to make when you're writing multiple
implementations like this, so I made a common one we can reuse.

## How did you test it?

Added new tests and manually looked at all 40 lines of code to ensure
the correct locks were taken.

## Potential risks
None

## Documentation
No

## Is hotfix candidate?
Maybe?
bergundy and others added 4 commits March 11, 2024 12:59
## Notes for reviewers

I don't consider this a final approach but I do think it's a step in
the right direction, we need to model more state machines on top of this
to form a more solid API.

Note that mutable state itself is registered as a state machine in the HSM framework and used as the root node in the HSM tree.

It's a bit annoying that all of the unit tests need to register the workflow state machine but I think this is ended up being the cleanest approach from the HSM framework perspective.

There's still some follow-up work, like collapsing timer tasks and support for interacting with history event, that I'll take on later.

Replication support for the framework will also come at a later time.


- See discussions in related PRs (#5446, #5495)

## What changed?

- Modified the statemachines abstraction to be a bit more generic
- Rewrote callbacks as a plugin using this framework

## Why?

This centralizes most the callback code in the plugin directory instead
of having it spread out the entire project moving common concerns such
as staleness checks, task generation, and (in the future) replication
into a framework and should generally help speed up feature delivery and
maintainability.

I plan to leverage this framework when implementing Nexus operations.

## How did you test it?

Existing tests from the feature branch and added unit tests.
* Create UpdateRequested events during reapply
* Honor update exclude types
* Don't demand that accepted message has request
@dandavison dandavison closed this Mar 12, 2024
@dandavison dandavison deleted the update-reapply branch March 12, 2024 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.