-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HDDS-1403. KeyOutputStream writes fails after max retries while writing to a closed container #753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
💔 -1 overall
This message was automatically generated. |
|
Thanks Hanisha for updating the patch. The patch adds a retry interval while doing a retry of a client write request. But, this may not address the problem holistically, as client can still get allocated blocks from a container and while the actual write happens to the datanode, the container might get closed. The problem gets aggravated if we have large no of preallocated blocks, but client write happens much later. |
arp7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Hanisha for updating the patch. The patch adds a retry interval while doing a retry of a client write request. But, this may not address the problem holistically, as client can still get allocated blocks from a container and while the actual write happens to the datanode, the container might get closed. The problem gets aggravated if we have large no of preallocated blocks, but client write happens much later.
Hi @bshashikant , the retry is before going to the OM. This is to add a bit of throttle to protect the OM since clients that are failing could keep spamming the OM in a tight loop.
@mukul1987 did suggest reducing the interval a bit. We could reduce it to 100ms.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't hardcode the unit (ms). We can specify the unit with the config key. See Configuration#getTimeDuration.
|
Thanks @arp7 . The retry interval should by default should be lower as, other than ContainerCloseExceptions, Ozone client retries in cases, where a request times out, or leader election could not complete etc where Ratis itself retries for a certain interval of time of around 10 mins .This retryInterval will again be added to to total time between two successive calls to OM in case of a failure. This is in the actual write path and will affect the write throughput considerably. |
b0926d0 to
62fad22
Compare
|
💔 -1 overall
This message was automatically generated. |
arp7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with minor comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: By default
62fad22 to
894da1d
Compare
|
💔 -1 overall
This message was automatically generated. |
|
The test failures in CI are not related to this PR. Will merge the PR. Thank you @arp7 , @bshashikant and @mukul1987 for the reviews. |
Author: Boris S <[email protected]> Author: Boris S <[email protected]> Author: Boris Shkolnik <[email protected]> Reviewers: Prateek Maheshwari <[email protected]> Closes apache#753 from sborya/UseSamazResetInKafka
Increasing the number of client retries to 100 and adding sleep of 500ms between retries