Prevent clearing topic-partitions that are still assigned #648

SatyaKuppam · 2023-05-05T19:53:34Z

Problem

To decrease the impact of rebalances during rolling bounces of k8s pods, we changed the partition.assignment.strategy from the default RangeAssignor to CooperativeStickyAssignor. After this change we encountered NPEs and the S3SinkTask goes into an unrecoverable state. We did not find the same issue with StickyAssignor however.

Example of an NPE (this is with v10.0.7):

"org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask\
      \ due to unrecoverable exception.\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:611)\n\
      \tat org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:333)\n\
      \tat org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234)\n\
      \tat org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203)\n\
      \tat org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:188)\n\
      \tat org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:243)\n\t\
      at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\
      \tat java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\n\t\
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\
      \tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\
      \tat java.base/java.lang.Thread.run(Thread.java:829)\nCaused by: java.lang.NullPointerException\n\
      \tat io.confluent.connect.s3.S3SinkTask.put(S3SinkTask.java:225)\n\tat org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:581)\n\
      \t... 10 more\n"

Solution

WorkerSinkTask always sends a list of topicPartitions on close. We currently clear all the assigned topicPartitionWriters on close(). This worked fine with stop-the-world rebalance strategies like RangeAssignor or StickyAssignor, since the current assignment would be fully closed. But with CooperativeStickyAssignor only a few topicPartitions could be reassigned/closed. In such a scenario clearing out all topicPartitionWriters is causing NPEs.

I am not sure if there is some historical context that I might be missing here and the .clear() is deliberate, could not find clues from commit history.

Test Strategy

Testing done:

Did not specifically write any tests for this case, nor am I aware of any existing tests that test assignment strategies. Open to ideas on any necessary tests. We have applied this path for the past few days and dont see the same degradation.

Unit tests
Integration tests
System tests
Manual tests

SatyaKuppam · 2023-05-05T20:34:48Z

Jenkins job failing ITs because of IAM perm issues.

https://jenkins.public.confluent.io/job/kafka-connect-storage-cloud/job/PR-648/2/console

subhashiyer9 · 2024-10-14T09:52:38Z

@SatyaKuppam Thanks for changes. The changes looks good to me
Can you help rebase the changes against oldest branch 10.0.x and update the PR?
If the builds are failing even after the same, I will take a look.

SatyaKuppam added 2 commits May 5, 2023 15:08

Prevent clearing topic-partitions that are still assigned

007cd95

typo

a1ebe57

SatyaKuppam marked this pull request as ready for review May 5, 2023 20:00

SatyaKuppam requested a review from a team as a code owner May 5, 2023 20:01

SatyaKuppam mentioned this pull request May 5, 2023

Prevent clearing topic-partitions that are still assigned during a rebalance #649

Open

mhmcdonald mentioned this pull request Feb 3, 2025

feat: prevent clearing assigned topic partitions #818

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prevent clearing topic-partitions that are still assigned #648

Prevent clearing topic-partitions that are still assigned #648

Uh oh!

SatyaKuppam commented May 5, 2023 •

edited

Loading

Uh oh!

SatyaKuppam commented May 5, 2023

Uh oh!

subhashiyer9 commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Prevent clearing topic-partitions that are still assigned #648

Are you sure you want to change the base?

Prevent clearing topic-partitions that are still assigned #648

Uh oh!

Conversation

SatyaKuppam commented May 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Test Strategy

Testing done:

Uh oh!

SatyaKuppam commented May 5, 2023

Uh oh!

subhashiyer9 commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SatyaKuppam commented May 5, 2023 •

edited

Loading