HDDS-1403. KeyOutputStream writes fails after max retries while writing to a closed container #753

hanishakoneru · 2019-04-18T23:07:46Z

Increasing the number of client retries to 100 and adding sleep of 500ms between retries

hadoop-yetus · 2019-04-19T00:52:52Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	27	Docker mode activated.
		_ Prechecks _
+1	@author	0	The patch does not contain any @author tags.
-1	test4tests	0	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ trunk Compile Tests _
0	mvndep	71	Maven dependency ordering for branch
+1	mvninstall	1039	trunk passed
+1	compile	954	trunk passed
+1	checkstyle	141	trunk passed
+1	mvnsite	148	trunk passed
+1	shadedclient	1026	branch has no errors when building and testing our client artifacts.
+1	findbugs	188	trunk passed
+1	javadoc	119	trunk passed
		_ Patch Compile Tests _
0	mvndep	61	Maven dependency ordering for patch
+1	mvninstall	109	the patch passed
+1	compile	928	the patch passed
+1	javac	928	the patch passed
+1	checkstyle	139	the patch passed
+1	mvnsite	124	the patch passed
+1	whitespace	0	The patch has no whitespace issues.
+1	xml	1	The patch has no ill-formed XML file.
+1	shadedclient	649	patch has no errors when building and testing our client artifacts.
+1	findbugs	198	the patch passed
+1	javadoc	106	the patch passed
		_ Other Tests _
+1	unit	80	common in the patch passed.
+1	unit	39	client in the patch passed.
+1	unit	34	objectstore-service in the patch passed.
+1	asflicense	45	The patch does not generate ASF License warnings.
		6231

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-753/1/artifact/out/Dockerfile
GITHUB PR	#753
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname	Linux f69225dbaa0b 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `ef97a20`
maven	version: Apache Maven 3.3.9
Default Java	1.8.0_191
findbugs	v3.1.0-RC1
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/1/testReport/
Max. process+thread count	449 (vs. ulimit of 5500)
modules	C: hadoop-hdds/common hadoop-ozone/client hadoop-ozone/objectstore-service U: .
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/1/console
Powered by	Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

bshashikant · 2019-04-22T19:13:30Z

Thanks Hanisha for updating the patch. The patch adds a retry interval while doing a retry of a client write request. But, this may not address the problem holistically, as client can still get allocated blocks from a container and while the actual write happens to the datanode, the container might get closed. The problem gets aggravated if we have large no of preallocated blocks, but client write happens much later.

arp7

Thanks Hanisha for updating the patch. The patch adds a retry interval while doing a retry of a client write request. But, this may not address the problem holistically, as client can still get allocated blocks from a container and while the actual write happens to the datanode, the container might get closed. The problem gets aggravated if we have large no of preallocated blocks, but client write happens much later.

Hi @bshashikant , the retry is before going to the OM. This is to add a bit of throttle to protect the OM since clients that are failing could keep spamming the OM in a tight loop.

@mukul1987 did suggest reducing the interval a bit. We could reduce it to 100ms.

What do you think?

arp7 · 2019-04-18T23:12:01Z

hadoop-hdds/common/src/main/resources/ozone-default.xml

Don't hardcode the unit (ms). We can specify the unit with the config key. See Configuration#getTimeDuration.

bshashikant · 2019-04-24T12:04:25Z

Thanks @arp7 . The retry interval should by default should be lower as, other than ContainerCloseExceptions, Ozone client retries in cases, where a request times out, or leader election could not complete etc where Ratis itself retries for a certain interval of time of around 10 mins .This retryInterval will again be added to to total time between two successive calls to OM in case of a failure. This is in the actual write path and will affect the write throughput considerably.

hadoop-yetus · 2019-04-25T00:35:51Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	26	Docker mode activated.
		_ Prechecks _
+1	@author	0	The patch does not contain any @author tags.
-1	test4tests	0	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ trunk Compile Tests _
0	mvndep	71	Maven dependency ordering for branch
+1	mvninstall	1136	trunk passed
+1	compile	966	trunk passed
+1	checkstyle	143	trunk passed
+1	mvnsite	185	trunk passed
+1	shadedclient	1109	branch has no errors when building and testing our client artifacts.
+1	findbugs	231	trunk passed
+1	javadoc	148	trunk passed
		_ Patch Compile Tests _
0	mvndep	22	Maven dependency ordering for patch
-1	mvninstall	18	client in the patch failed.
-1	mvninstall	18	objectstore-service in the patch failed.
+1	compile	894	the patch passed
+1	javac	894	the patch passed
+1	checkstyle	184	the patch passed
-1	mvnsite	33	objectstore-service in the patch failed.
+1	whitespace	0	The patch has no whitespace issues.
+1	xml	1	The patch has no ill-formed XML file.
+1	shadedclient	727	patch has no errors when building and testing our client artifacts.
-1	findbugs	27	objectstore-service in the patch failed.
+1	javadoc	147	the patch passed
		_ Other Tests _
+1	unit	88	common in the patch passed.
+1	unit	34	client in the patch passed.
+1	unit	45	common in the patch passed.
-1	unit	30	objectstore-service in the patch failed.
+1	asflicense	42	The patch does not generate ASF License warnings.
		6734

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-753/2/artifact/out/Dockerfile
GITHUB PR	#753
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname	Linux 9bb732da53b9 4.4.0-139-generic #165~14.04.1-Ubuntu SMP Wed Oct 31 10:55:11 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `a703dae`
maven	version: Apache Maven 3.3.9
Default Java	1.8.0_191
findbugs	v3.1.0-RC1
mvninstall	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/2/artifact/out/patch-mvninstall-hadoop-ozone_client.txt
mvninstall	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/2/artifact/out/patch-mvninstall-hadoop-ozone_objectstore-service.txt
mvnsite	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/2/artifact/out/patch-mvnsite-hadoop-ozone_objectstore-service.txt
findbugs	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/2/artifact/out/patch-findbugs-hadoop-ozone_objectstore-service.txt
unit	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/2/artifact/out/patch-unit-hadoop-ozone_objectstore-service.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/2/testReport/
Max. process+thread count	336 (vs. ulimit of 5500)
modules	C: hadoop-hdds/common hadoop-ozone/client hadoop-ozone/common hadoop-ozone/objectstore-service U: .
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/2/console
Powered by	Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

arp7

+1 with minor comment.

arp7 · 2019-04-25T21:12:55Z

hadoop-hdds/common/src/main/resources/ozone-default.xml

Nitpick: By default

hadoop-yetus · 2019-04-25T23:58:41Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
0	reexec	26	Docker mode activated.
		_ Prechecks _
+1	@author	0	The patch does not contain any @author tags.
-1	test4tests	0	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ trunk Compile Tests _
0	mvndep	63	Maven dependency ordering for branch
+1	mvninstall	1029	trunk passed
+1	compile	983	trunk passed
+1	checkstyle	139	trunk passed
+1	mvnsite	192	trunk passed
+1	shadedclient	1058	branch has no errors when building and testing our client artifacts.
+1	findbugs	253	trunk passed
+1	javadoc	171	trunk passed
		_ Patch Compile Tests _
0	mvndep	23	Maven dependency ordering for patch
-1	mvninstall	20	client in the patch failed.
-1	mvninstall	20	objectstore-service in the patch failed.
+1	compile	918	the patch passed
+1	javac	918	the patch passed
+1	checkstyle	137	the patch passed
-1	mvnsite	37	objectstore-service in the patch failed.
+1	whitespace	0	The patch has no whitespace issues.
+1	xml	2	The patch has no ill-formed XML file.
+1	shadedclient	679	patch has no errors when building and testing our client artifacts.
-1	findbugs	38	objectstore-service in the patch failed.
+1	javadoc	169	the patch passed
		_ Other Tests _
+1	unit	82	common in the patch passed.
+1	unit	40	client in the patch passed.
+1	unit	48	common in the patch passed.
-1	unit	37	objectstore-service in the patch failed.
+1	asflicense	50	The patch does not generate ASF License warnings.
		6612

Subsystem	Report/Notes
Docker	Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/hadoop-multibranch/job/PR-753/3/artifact/out/Dockerfile
GITHUB PR	#753
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml
uname	Linux 16f011cc5d3b 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	personality/hadoop.sh
git revision	trunk / `b5dcf64`
maven	version: Apache Maven 3.3.9
Default Java	1.8.0_191
findbugs	v3.1.0-RC1
mvninstall	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/3/artifact/out/patch-mvninstall-hadoop-ozone_client.txt
mvninstall	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/3/artifact/out/patch-mvninstall-hadoop-ozone_objectstore-service.txt
mvnsite	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/3/artifact/out/patch-mvnsite-hadoop-ozone_objectstore-service.txt
findbugs	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/3/artifact/out/patch-findbugs-hadoop-ozone_objectstore-service.txt
unit	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/3/artifact/out/patch-unit-hadoop-ozone_objectstore-service.txt
Test Results	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/3/testReport/
Max. process+thread count	445 (vs. ulimit of 5500)
modules	C: hadoop-hdds/common hadoop-ozone/client hadoop-ozone/common hadoop-ozone/objectstore-service U: .
Console output	https://builds.apache.org/job/hadoop-multibranch/job/PR-753/3/console
Powered by	Apache Yetus 0.9.0 http://yetus.apache.org

This message was automatically generated.

hanishakoneru · 2019-04-26T17:33:19Z

The test failures in CI are not related to this PR. Will merge the PR. Thank you @arp7 , @bshashikant and @mukul1987 for the reviews.

Author: Boris S <[email protected]> Author: Boris S <[email protected]> Author: Boris Shkolnik <[email protected]> Reviewers: Prateek Maheshwari <[email protected]> Closes apache#753 from sborya/UseSamazResetInKafka

hanishakoneru added the ozone label Apr 18, 2019

hanishakoneru requested a review from arp7 April 18, 2019 23:08

arp7 reviewed Apr 22, 2019

View reviewed changes

hanishakoneru force-pushed the HDDS-1403 branch from b0926d0 to 62fad22 Compare April 24, 2019 22:42

arp7 approved these changes Apr 25, 2019

View reviewed changes

hadoop-hdds/common/src/main/resources/ozone-default.xml Outdated

Copy link

Contributor

arp7 Apr 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: By default

hanishakoneru added 3 commits April 25, 2019 15:07

Increasing the number of client retries and adding sleep between retries

583c482

Changing key type to TimeDuration and default retry wait interval to 0.

d9b7429

Resolving checkstyle issue

894da1d

hanishakoneru force-pushed the HDDS-1403 branch from 62fad22 to 894da1d Compare April 25, 2019 22:07

hanishakoneru merged commit 3758270 into apache:trunk Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-1403. KeyOutputStream writes fails after max retries while writing to a closed container #753

HDDS-1403. KeyOutputStream writes fails after max retries while writing to a closed container #753

Uh oh!

hanishakoneru commented Apr 18, 2019

Uh oh!

hadoop-yetus commented Apr 19, 2019

Uh oh!

bshashikant commented Apr 22, 2019 •

edited

Loading

Uh oh!

arp7 left a comment

Uh oh!

arp7 Apr 18, 2019

Uh oh!

bshashikant commented Apr 24, 2019

Uh oh!

hadoop-yetus commented Apr 25, 2019

Uh oh!

arp7 left a comment

Uh oh!

arp7 Apr 25, 2019

Uh oh!

hadoop-yetus commented Apr 25, 2019

Uh oh!

hanishakoneru commented Apr 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-1403. KeyOutputStream writes fails after max retries while writing to a closed container #753

HDDS-1403. KeyOutputStream writes fails after max retries while writing to a closed container #753

Uh oh!

Conversation

hanishakoneru commented Apr 18, 2019

Uh oh!

hadoop-yetus commented Apr 19, 2019

Uh oh!

bshashikant commented Apr 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arp7 left a comment

Choose a reason for hiding this comment

Uh oh!

arp7 Apr 18, 2019

Choose a reason for hiding this comment

Uh oh!

bshashikant commented Apr 24, 2019

Uh oh!

hadoop-yetus commented Apr 25, 2019

Uh oh!

arp7 left a comment

Choose a reason for hiding this comment

Uh oh!

arp7 Apr 25, 2019

Choose a reason for hiding this comment

Uh oh!

hadoop-yetus commented Apr 25, 2019

Uh oh!

hanishakoneru commented Apr 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bshashikant commented Apr 22, 2019 •

edited

Loading