GH-2195: DefaultErrorHandler Improvements #2207

garyrussell · 2022-04-04T21:50:51Z

Resolves #2195

Add an option to avoid seeks after handling exceptions.

Instead, pause the consumer for one poll() and use the remaining records as the
result of that poll.

New methods on CommonErrorHandler - handleOne for record listeners, returning
a boolean to indicate whether the record was recovered and should not be redelivered.

handlaBatchAndReturnRemaining for batch listeners, returning either the complete
set or a subset, e.g. when the DEH receives a BatchListenerExecutionFailedException
and commits a partial batch.

Also includes the classifier refactoring discussed here
#2185 (comment)

The new logic is disabled by default, we can consider enabling it in 3.0 and remove
the deprecations.

artembilan

This looks for me too bug to be considered for back-porting.
Why just not aim the fix for 3.0 and feel free from possible breaking changes?

Thanks

garyrussell · 2022-04-05T15:12:03Z

I agree; maybe this is the final driver to create a 2.9 branch; which we can release sooner (May).

There is a big performance problem when using the retryable topic - see the discussions on the issue,

tomazfernandes · 2022-04-06T00:16:06Z

@garyrussell, I began looking into this today and should continue tomorrow - there's a lot going on in parts of the code I'm not that familiar with. Overall I think the solution looks great, just want to get a better grasp at it.

Thanks

artembilan

Do we need to add anything to the docs for this change?

Thanks

spring-kafka/src/main/java/org/springframework/kafka/listener/CommonErrorHandler.java

...ng-kafka/src/main/java/org/springframework/kafka/listener/DefaultAfterRollbackProcessor.java

...rc/test/java/org/springframework/kafka/listener/DefaultErrorHandlerNoSeeksBatchAckTests.java

artembilan · 2022-04-06T20:34:25Z

So, is this OK now for merging?

garyrussell · 2022-04-06T20:50:05Z

Let @tomazfernandes have a few more days to review; maybe merge on Monday?

tomazfernandes · 2022-04-06T20:51:12Z

That would work nicely for me, thanks!

garyrussell · 2022-04-06T20:52:44Z

We still need to seek on the retry containers because the delay is enabled by pausing the partitions.

I guess you should only configure this on the main container and use a small max.poll.records on the retry containers.

garyrussell · 2022-04-06T21:05:52Z

It would add a little more complexity (in the container), but maybe we could come up with a hybrid approach - perform the seeks if listener throws a KafkaBackOffException and use the new approach for all other exceptions.

But that can be another PR.

tomazfernandes · 2022-04-06T21:12:32Z

I think the ideal scenario there would be, even for retry containers: if we process a record, let's say it fails and goes through recovery logic, which succeeds (message sent to next topic) . If the next record for that container does not trigger a backoff (it's already past its due time), we shouldn't need to seek again. Makes sense?

If it does backoff, recovery does not succeed, and the BackOffManager pauses the partition before any of this logic is executed. I'm not sure where in the PR the partition is resumed, but if it does not resume when it didn't pause, and don't serve the failed record while paused, maybe we're good?

tomazfernandes · 2022-04-06T21:32:57Z

I think the trick here is understanding the interaction of this new logic with partition pausing.

If we do seek in case of a KafkaBackOffException, we shouldn't receive any records from the paused partition until it's resumed by the BackOffManager, since resuming the whole container explicitly leaves paused partitions alone. So we're basically polling for records that have nothing to do with the KBE itself.

But we might consider filtering out any records that are from a currently paused partition before serving them again, this way we don't need to seek all partitions due to one backing off.

tomazfernandes · 2022-04-06T21:43:50Z

So summing up, I think that maybe rather than "coupling" the container to the KafkaBackOffException, we should probably figure out how it should work when handling paused partitions, and just let KafkaBackOffManager do its work.

Sorry I kind of thought out loud - still figuring out how these changes work 😄

tomazfernandes · 2022-04-06T21:47:00Z

Just to remember, the BackOffManager pauses the partition before throwing the KafkaBackOffException, and resumes it after receiving a PartitionIdleEvent, if time is due.

garyrussell · 2022-04-07T19:14:15Z

This clearly needs some more thought; so definitely not in the scope of this PR.

tomazfernandes · 2022-04-07T19:35:50Z

Sure! I was really just thinking out loud, sorry about that. I took a better look at these changes yesterday and they look good to me, sure will increase performance a lot when dealing with errors, and also are defensive enough that shouldn't be a problem when disabled.

I'm wondering, considering we'll be having milestone exposure to this, DYT maybe it's worth it to leave it on by default so we get more feedback? If there's any problem users can just disable it, and then we can release a new milestone with the fix. Maybe I'm missing something though - perhaps it's a too big behavior change for a minor release for example.

Thanks

garyrussell · 2022-04-07T19:58:46Z

That's an interesting thought; but I think we need to resolve the issue with the retry containers first because, if it is used there, yes, we will still pause the relevant partition, but there may be additional records from that partition in the remaining records. We'd need to week those out somehow, and only seek the paused partition.

tomazfernandes · 2022-04-07T20:08:23Z

Sure. Also, that might impact users pausing partitions themselves. But I'm sure we'll work this out soon enough. As far as retryable topics go, we could disable it in RT setup if necessary.

...ng-kafka/src/main/java/org/springframework/kafka/listener/KafkaMessageListenerContainer.java

Resolves spring-projects#2195 Add an option to avoid seeks after handling exceptions. Instead, pause the consumer for one `poll()` and use the remaining records as the result of that poll. New methods on `CommonErrorHandler` - `handleOne` for record listeners, returning a boolean to indicate whether the record was recovered and should not be redelivered. `handlaBatchAndReturnRemaining` for batch listeners, returning either the complete set or a subset, e.g. when the `DEH` receives a `BatchListenerExecutionFailedException` and commits a partial batch. Also includes the classifier refactoring discussed here spring-projects#2185 (comment) The new logic is disabled by default, we can consider enabling it in 3.0 and remove the deprecations.

…ve paused after the error.

Co-authored-by: Artem Bilan <[email protected]>

- move the resume logic to after the invokes and don't resume if pending records - don't check `isPaused()` after empty poll due to errors; always restore the pending records

…rror handlers.

… after an error.

…ers while the test thread is calling revoke/assign.

…he partitions.

...ng-kafka/src/main/java/org/springframework/kafka/listener/KafkaMessageListenerContainer.java

tomazfernandes · 2022-04-13T15:03:05Z

...ng-kafka/src/main/java/org/springframework/kafka/listener/KafkaMessageListenerContainer.java

-			}

 			invokeIfHaveRecords(records);
+			if (this.pendingRecordsAfterError == null) {


Just a thought. Since we're not committed to listening to pause/ resume calls while in this logic, I wonder if maybe we should also ignore it for partition pausing / resuming, and instead handle back offs relying on the KafkaBackOffException itself. The exception contains information such as the TopicPartition that was backed off. Of course, this or any other solution can be part of a different PR.

Thanks.

Yeah - we already know we need to come up with a solution for the retry containers if the primary container has this option set; we can cover this in another PR after M1.

artembilan

LGTM.

@tomazfernandes , anything else in your mind @garyrussell has to fix, yet?

Thanks

tomazfernandes · 2022-04-13T21:10:20Z

LGTM.

@tomazfernandes , anything else in your mind @garyrussell has to fix, yet?

Thanks

Not really, at this point I'm more nit picking about implementation details 😄 I think this looks great and will surely be a noticeable gain in performance regarding error handling.

This was a really interesting journey into this logic, which is maybe the framework's heart, so thanks a lot @garyrussell for the opportunity.

Thanks.

artembilan · 2022-04-13T21:59:44Z

Sorry, this does not back-port to 2.9.x clean.

Thanks

garyrussell · 2022-04-14T15:31:29Z

...cherry-picked to 2.9.x after resolving conflicts; also removed deprecated error handlers.

artembilan requested changes Apr 5, 2022

View reviewed changes

artembilan requested changes Apr 6, 2022

View reviewed changes

garyrussell force-pushed the GH-2195 branch from 8b32155 to 278dce5 Compare April 6, 2022 20:11

tomazfernandes reviewed Apr 11, 2022

View reviewed changes

...ng-kafka/src/main/java/org/springframework/kafka/listener/KafkaMessageListenerContainer.java Outdated Show resolved Hide resolved

garyrussell force-pushed the GH-2195 branch from 278dce5 to b000460 Compare April 11, 2022 17:58

garyrussell and others added 7 commits April 12, 2022 14:45

Fix race - do not call resume() on the container; the user might ha…

021b9be

…ve paused after the error.

Change Since to 2.9.

5aa2ad5

Fix typos.

5313e62

Co-authored-by: Artem Bilan <[email protected]>

Remove unnecessary local variable; add docs.

241aa9d

Polishing - see commit comment for more details

9fa044d

- move the resume logic to after the invokes and don't resume if pending records - don't check `isPaused()` after empty poll due to errors; always restore the pending records

Remove unnecessary boolean; fix deprecation warnings and delegating e…

1640ed6

…rror handlers.

garyrussell force-pushed the GH-2195 branch from b000460 to 1640ed6 Compare April 12, 2022 19:04

garyrussell added 2 commits April 12, 2022 16:16

Emergency stop container if the consumer returns records while paused…

ad65daa

… after an error.

Fix race in test - prevent consumer thread from changing pausedConsum…

1fd21d5

…ers while the test thread is calling revoke/assign.

garyrussell added 4 commits April 12, 2022 17:20

Remove System.out().

034157d

Add diagnostics to test.

436e50c

Fix race in test; wait until next poll after consumer thread pauses t…

e3b713f

…he partitions.

Fix stubbing in emergency stop test.

2672c91

garyrussell changed the title ~~GH-2195: DefaultErrorHandlerImprovements [REVIEW ONLY, DO NOT MERGE YET]~~ GH-2195: DefaultErrorHandlerImprovements Apr 12, 2022

garyrussell changed the title ~~GH-2195: DefaultErrorHandlerImprovements~~ GH-2195: DefaultErrorHandler Improvements Apr 12, 2022

tomazfernandes reviewed Apr 13, 2022

View reviewed changes

...ng-kafka/src/main/java/org/springframework/kafka/listener/KafkaMessageListenerContainer.java Outdated Show resolved Hide resolved

tomazfernandes reviewed Apr 13, 2022

View reviewed changes

Remove unnecessary boolean.

c944651

artembilan approved these changes Apr 13, 2022

View reviewed changes

artembilan merged commit af56eca into spring-projects:main Apr 13, 2022

GH-2195: DefaultErrorHandler Improvements #2207

GH-2195: DefaultErrorHandler Improvements #2207

Uh oh!

Conversation

garyrussell commented Apr 4, 2022

Uh oh!

artembilan left a comment

Choose a reason for hiding this comment

Uh oh!

garyrussell commented Apr 5, 2022

Uh oh!

tomazfernandes commented Apr 6, 2022

Uh oh!

artembilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

artembilan commented Apr 6, 2022

Uh oh!

garyrussell commented Apr 6, 2022

Uh oh!

tomazfernandes commented Apr 6, 2022

Uh oh!

garyrussell commented Apr 6, 2022

Uh oh!

garyrussell commented Apr 6, 2022

Uh oh!

tomazfernandes commented Apr 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomazfernandes commented Apr 6, 2022

Uh oh!

tomazfernandes commented Apr 6, 2022

Uh oh!

tomazfernandes commented Apr 6, 2022

Uh oh!

garyrussell commented Apr 7, 2022

Uh oh!

tomazfernandes commented Apr 7, 2022

Uh oh!

garyrussell commented Apr 7, 2022

Uh oh!

tomazfernandes commented Apr 7, 2022

Uh oh!

Uh oh!

Uh oh!

tomazfernandes Apr 13, 2022

Choose a reason for hiding this comment

Uh oh!

garyrussell Apr 13, 2022

Choose a reason for hiding this comment

Uh oh!

artembilan left a comment

Choose a reason for hiding this comment

Uh oh!

tomazfernandes commented Apr 13, 2022

Uh oh!

artembilan commented Apr 13, 2022

Uh oh!

garyrussell commented Apr 14, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomazfernandes commented Apr 6, 2022 •

edited

Loading