Skip to content

Conversation

garyrussell
Copy link
Contributor

Resolves #2195

Add an option to avoid seeks after handling exceptions.

Instead, pause the consumer for one poll() and use the remaining records as the
result of that poll.

New methods on CommonErrorHandler - handleOne for record listeners, returning
a boolean to indicate whether the record was recovered and should not be redelivered.

handlaBatchAndReturnRemaining for batch listeners, returning either the complete
set or a subset, e.g. when the DEH receives a BatchListenerExecutionFailedException
and commits a partial batch.

Also includes the classifier refactoring discussed here
#2185 (comment)

The new logic is disabled by default, we can consider enabling it in 3.0 and remove
the deprecations.

Copy link
Member

@artembilan artembilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks for me too bug to be considered for back-porting.
Why just not aim the fix for 3.0 and feel free from possible breaking changes?

Thanks

@garyrussell
Copy link
Contributor Author

I agree; maybe this is the final driver to create a 2.9 branch; which we can release sooner (May).

There is a big performance problem when using the retryable topic - see the discussions on the issue,

@tomazfernandes
Copy link
Contributor

@garyrussell, I began looking into this today and should continue tomorrow - there's a lot going on in parts of the code I'm not that familiar with. Overall I think the solution looks great, just want to get a better grasp at it.

Thanks

Copy link
Member

@artembilan artembilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add anything to the docs for this change?

Thanks

@artembilan
Copy link
Member

So, is this OK now for merging?

@garyrussell
Copy link
Contributor Author

Let @tomazfernandes have a few more days to review; maybe merge on Monday?

@tomazfernandes
Copy link
Contributor

That would work nicely for me, thanks!

@garyrussell
Copy link
Contributor Author

We still need to seek on the retry containers because the delay is enabled by pausing the partitions.

I guess you should only configure this on the main container and use a small max.poll.records on the retry containers.

@garyrussell
Copy link
Contributor Author

It would add a little more complexity (in the container), but maybe we could come up with a hybrid approach - perform the seeks if listener throws a KafkaBackOffException and use the new approach for all other exceptions.

But that can be another PR.

@tomazfernandes
Copy link
Contributor

tomazfernandes commented Apr 6, 2022

I think the ideal scenario there would be, even for retry containers: if we process a record, let's say it fails and goes through recovery logic, which succeeds (message sent to next topic) . If the next record for that container does not trigger a backoff (it's already past its due time), we shouldn't need to seek again. Makes sense?

If it does backoff, recovery does not succeed, and the BackOffManager pauses the partition before any of this logic is executed. I'm not sure where in the PR the partition is resumed, but if it does not resume when it didn't pause, and don't serve the failed record while paused, maybe we're good?

@tomazfernandes
Copy link
Contributor

I think the trick here is understanding the interaction of this new logic with partition pausing.

If we do seek in case of a KafkaBackOffException, we shouldn't receive any records from the paused partition until it's resumed by the BackOffManager, since resuming the whole container explicitly leaves paused partitions alone. So we're basically polling for records that have nothing to do with the KBE itself.

But we might consider filtering out any records that are from a currently paused partition before serving them again, this way we don't need to seek all partitions due to one backing off.

@tomazfernandes
Copy link
Contributor

So summing up, I think that maybe rather than "coupling" the container to the KafkaBackOffException, we should probably figure out how it should work when handling paused partitions, and just let KafkaBackOffManager do its work.

Sorry I kind of thought out loud - still figuring out how these changes work 😄

@tomazfernandes
Copy link
Contributor

Just to remember, the BackOffManager pauses the partition before throwing the KafkaBackOffException, and resumes it after receiving a PartitionIdleEvent, if time is due.

@garyrussell
Copy link
Contributor Author

This clearly needs some more thought; so definitely not in the scope of this PR.

@tomazfernandes
Copy link
Contributor

Sure! I was really just thinking out loud, sorry about that. I took a better look at these changes yesterday and they look good to me, sure will increase performance a lot when dealing with errors, and also are defensive enough that shouldn't be a problem when disabled.

I'm wondering, considering we'll be having milestone exposure to this, DYT maybe it's worth it to leave it on by default so we get more feedback? If there's any problem users can just disable it, and then we can release a new milestone with the fix. Maybe I'm missing something though - perhaps it's a too big behavior change for a minor release for example.

Thanks

@garyrussell
Copy link
Contributor Author

That's an interesting thought; but I think we need to resolve the issue with the retry containers first because, if it is used there, yes, we will still pause the relevant partition, but there may be additional records from that partition in the remaining records. We'd need to week those out somehow, and only seek the paused partition.

@tomazfernandes
Copy link
Contributor

Sure. Also, that might impact users pausing partitions themselves. But I'm sure we'll work this out soon enough. As far as retryable topics go, we could disable it in RT setup if necessary.

garyrussell and others added 7 commits April 12, 2022 14:45
Resolves spring-projects#2195

Add an option to avoid seeks after handling exceptions.

Instead, pause the consumer for one `poll()` and use the remaining records as the
result of that poll.

New methods on `CommonErrorHandler` - `handleOne` for record listeners, returning
a boolean to indicate whether the record was recovered and should not be redelivered.

`handlaBatchAndReturnRemaining` for batch listeners, returning either the complete
set or a subset, e.g. when the `DEH` receives a `BatchListenerExecutionFailedException`
and commits a partial batch.

Also includes the classifier refactoring discussed here
spring-projects#2185 (comment)

The new logic is disabled by default, we can consider enabling it in 3.0 and remove
the deprecations.
Co-authored-by: Artem Bilan <[email protected]>
- move the resume logic to after the invokes and don't resume if pending records
- don't check `isPaused()` after empty poll due to errors; always restore the pending records
@garyrussell garyrussell changed the title GH-2195: DefaultErrorHandlerImprovements [REVIEW ONLY, DO NOT MERGE YET] GH-2195: DefaultErrorHandlerImprovements Apr 12, 2022
@garyrussell garyrussell changed the title GH-2195: DefaultErrorHandlerImprovements GH-2195: DefaultErrorHandler Improvements Apr 12, 2022
}

invokeIfHaveRecords(records);
if (this.pendingRecordsAfterError == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought. Since we're not committed to listening to pause/ resume calls while in this logic, I wonder if maybe we should also ignore it for partition pausing / resuming, and instead handle back offs relying on the KafkaBackOffException itself. The exception contains information such as the TopicPartition that was backed off. Of course, this or any other solution can be part of a different PR.

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - we already know we need to come up with a solution for the retry containers if the primary container has this option set; we can cover this in another PR after M1.

Copy link
Member

@artembilan artembilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@tomazfernandes , anything else in your mind @garyrussell has to fix, yet?

Thanks

@tomazfernandes
Copy link
Contributor

LGTM.

@tomazfernandes , anything else in your mind @garyrussell has to fix, yet?

Thanks

Not really, at this point I'm more nit picking about implementation details 😄 I think this looks great and will surely be a noticeable gain in performance regarding error handling.

This was a really interesting journey into this logic, which is maybe the framework's heart, so thanks a lot @garyrussell for the opportunity.

Thanks.

@artembilan artembilan merged commit af56eca into spring-projects:main Apr 13, 2022
@artembilan
Copy link
Member

Sorry, this does not back-port to 2.9.x clean.

Thanks

@garyrussell
Copy link
Contributor Author

...cherry-picked to 2.9.x after resolving conflicts; also removed deprecated error handlers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid unnecessary partition seeks on successful record recovery

3 participants