Skip to content

Conversation

@gazi-yestemirova
Copy link
Contributor

@gazi-yestemirova gazi-yestemirova commented Nov 28, 2025

What changed?

  • Updated executor so Stop() sends the final heartbeat with "DRAINING" status. This keeps shard ownership metadata consistent when an executor shuts down. We skip this final draining heartbeat if no previous heartbeats succeeded.
  • Fixed the canary app's shutdown ordering: start the yarpc dispatcher before the executors and stop the executors before the dispatcher so the draining heartbeat can reach shard-distributor during shutdown.

Why?
Shard-distributor should be notified about the executors that are shutting down to reassign the shards and keep track of ownership.

How did you test it?
unit-tests and local testing with etcd:
Screenshot 2025-11-28 at 14 41 26

Potential risks

Release notes

Documentation Changes

Signed-off-by: Gaziza Yestemirova <[email protected]>
@gazi-yestemirova gazi-yestemirova changed the title feature: [shard-distributor]Send "draining" heartbeat on executer shutdown feat: [shard-distributor]Send "draining" heartbeat on executer shutdown Nov 28, 2025
Signed-off-by: Gaziza Yestemirova <[email protected]>
Comment on lines +68 to +92
func registerExecutorLifecycle(params lifecycleParams) {
params.Lifecycle.Append(fx.Hook{
OnStart: func(ctx context.Context) error {
if err := params.Dispatcher.Start(); err != nil {
return err
}
for _, executor := range params.FixedExecutors {
executor.Start(ctx)
}
for _, executor := range params.EphemeralExecutors {
executor.Start(ctx)
}
return nil
},
OnStop: func(ctx context.Context) error {
for _, executor := range params.FixedExecutors {
executor.Stop()
}
for _, executor := range params.EphemeralExecutors {
executor.Stop()
}
return params.Dispatcher.Stop()
},
})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite unexpected to canary needs to explicitely .Start and .Stop executors + care about order.
Does that mean other clients (executors) should do that as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"github.com/uber-go/tally"
"go.uber.org/fx"
"go.uber.org/fx/fxtest"
ubergomock "go.uber.org/mock/gomock"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think we need both "github.com/golang/mock/gomock" and "go.uber.org/mock/gomock".
I suggest paying attention every time there is an import alias, in lots of cases it is questionable.

Comment on lines +345 to +346
if e.heartBeatInterval > 0 && e.heartBeatInterval < drainingHeartbeatTimeout {
return e.heartBeatInterval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if e.heartBeatInterval > 0 { return min(e.heartBeatInterval, drainingHeartbeatTimeout) }

Btw. is that possible e.heartBeatInterval == 0? It means no heartbeat?

Comment on lines +39 to +40
assert.Equal(t, "test-namespace", req.Namespace)
assert.Equal(t, "test-executor-id", req.ExecutorID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe to move part of to shard check? I believe having this in the DRAINING call is as important

},
MigrationMode: types.MigrationModeONBOARDED,
}, nil)
callCount := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo in the function name "TestHeartBeartLoop"

Comment on lines +569 to +571
assert.Equal(t, types.ExecutorStatusDRAINING, req.Status)
assert.Empty(t, req.ShardStatusReports)
return &types.ExecutorHeartbeatResponse{}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend moving this repeating check to embedded function like checkDrainingCall

assert.Equal(t, mockShardProcessor, processor)
}

func TestStopWithoutHeartbeatDoesNotSendDraining(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all the functions use goleak. Let's be consistent.

Comment on lines +621 to +628
executor := &executorImpl[*MockShardProcessor]{
logger: log.NewNoop(),
shardDistributorClient: mockShardDistributorClient,
namespace: "test-namespace",
executorID: "test-executor-id",
metrics: tally.NoopScope,
stopC: make(chan struct{}),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if all this executor constructions could be moved to something like initTest() with one or two params?

Comment on lines +68 to +92
func registerExecutorLifecycle(params lifecycleParams) {
params.Lifecycle.Append(fx.Hook{
OnStart: func(ctx context.Context) error {
if err := params.Dispatcher.Start(); err != nil {
return err
}
for _, executor := range params.FixedExecutors {
executor.Start(ctx)
}
for _, executor := range params.EphemeralExecutors {
executor.Start(ctx)
}
return nil
},
OnStop: func(ctx context.Context) error {
for _, executor := range params.FixedExecutors {
executor.Stop()
}
for _, executor := range params.EphemeralExecutors {
executor.Stop()
}
return params.Dispatcher.Stop()
},
})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +117 to +120
mockLifecycle := tt.params.Lifecycle.(*mockLifecycle)

registerExecutorLifecycle(tt.params)
assert.Equal(t, tt.expectedHookCount, mockLifecycle.hookCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend checking .Start and .Stop calls of executor.
Otherwise it's not even clear why we compare it with expectedHookCount = 1 both times.
btw, I wonder if fxtest would help with such testing since essentially you testing uberfx integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants