Prevent clearing topic-partitions that are still assigned #648
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

Problem
To decrease the impact of rebalances during rolling bounces of k8s pods, we changed the
partition.assignment.strategyfrom the defaultRangeAssignortoCooperativeStickyAssignor. After this change we encountered NPEs and theS3SinkTaskgoes into an unrecoverable state. We did not find the same issue withStickyAssignorhowever.Example of an NPE (this is with
v10.0.7):Solution
WorkerSinkTaskalways sends a list oftopicPartitionsonclose. We currently clear all the assignedtopicPartitionWriters onclose(). This worked fine with stop-the-world rebalance strategies likeRangeAssignororStickyAssignor, since the current assignment would be fully closed. But withCooperativeStickyAssignoronly a fewtopicPartitions could be reassigned/closed. In such a scenario clearing out alltopicPartitionWriters is causing NPEs.I am not sure if there is some historical context that I might be missing here and the
.clear()is deliberate, could not find clues from commit history.Test Strategy
Testing done:
Did not specifically write any tests for this case, nor am I aware of any existing tests that test assignment strategies. Open to ideas on any necessary tests. We have applied this path for the past few days and dont see the same degradation.