-
Notifications
You must be signed in to change notification settings - Fork 920
Description
Describe the bug
This is covered in HADOOP-19221.
S3 multipart upload fails to recover from server-side failures.
It repeats the attempted POST, but does not supply enough data, so the S3 store returns 400
"Your socket connection to the server was not read from or written to within the timeout period"
This failure has been observed in a live system.
We have also been able to replicate it ourselves through fault injection, converting the 200 response into a 500.
Once this is done, all subsequent retries fail with 400 + RequestTimeout
responses until the retry limit is reached.
This is a regression from the V1 SDK, whose upload requests supported a filename and offset as well as data size. It was clearly able to repeat a request by seeking to the original offset and streaming up the data again.
The new design, while elegant and flexible, seems unable to cope with failures for a foundational use case "upload part of a file". This is unfortunate.
This is a critical enough failure that we have worked around it ourselves by implementing our own content providers for file sources and others; The fault injection used to replicate the failure allows us to be confident that we have fixed it in our own code.
We have not attempted fault injection through the TransferManager on the basis that there is nothing we can do to fix it. We are simply going to document the error and recommend disabling using it to upload data from the local file system.
Expected Behavior
retrying an upload of a part in a multipart upload to succeed
Current Behavior
After the first failure is returned (500 etc) all retries fail with 400 error and a message that not enough data was supplied.
org.apache.hadoop.fs.s3a.AWSBadRequestException: upload part #1 upload ID KylDNXbmiZlZE5JI2aKVcVeA66lyRaUZWmKv5K3QfGcjgqjpMAsnbRkJRug8jf3kdjXuL_zVt09813gtOMS9CSxwTDbfGR4xOLPUmZZ9suEJGh.iolwJgMnSXk5KFxnG on test/testCommitOperations[array-wrap-true]: software.amazon.awssdk.services.s3.model.S3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: S3, Status Code: 400, Request ID: TT17CRYF6HJH2G0Y, Extended Request ID: d7VKx+iXtShSdp/uVQhrDAhC7dOGbZiZE6ul1I/k19RRDc6wRHyKSkBJU82ta22ns+J7QbbL5XY=):RequestTimeout: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: S3, Status Code: 400, Request ID: TT17CRYF6HJH2G0Y, Extended Request ID: d7VKx+iXtShSdp/uVQhrDAhC7dOGbZiZE6ul1I/k19RRDc6wRHyKSkBJU82ta22ns+J7QbbL5XY=)
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:260)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:124)
at org.apache.hadoop.fs.s3a.Invoker.lambda$retry$4(Invoker.java:376)
at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:372)
at org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:347)
at org.apache.hadoop.fs.s3a.WriteOperationHelper.retry(WriteOperationHelper.java:205)
at org.apache.hadoop.fs.s3a.WriteOperationHelper.uploadPart(WriteOperationHelper.java:589)
at org.apache.hadoop.fs.s3a.commit.impl.CommitOperations.uploadFileDataAsInputStream(CommitOperations.java:686)
at org.apache.hadoop.fs.s3a.commit.impl.CommitOperations.uploadFileToPendingCommit(CommitOperations.java:586)
at org.apache.hadoop.fs.s3a.commit.magic.ITestUploadRecovery.testCommitOperations(ITestUploadRecovery.java:280)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:750)
Caused by: software.amazon.awssdk.services.s3.model.S3Exception: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: S3, Status Code: 400, Request ID: TT17CRYF6HJH2G0Y, Extended Request ID: d7VKx+iXtShSdp/uVQhrDAhC7dOGbZiZE6ul1I/k19RRDc6wRHyKSkBJU82ta22ns+J7QbbL5XY=)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleErrorResponse(AwsXmlPredicatedResponseHandler.java:156)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handleResponse(AwsXmlPredicatedResponseHandler.java:108)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:85)
at software.amazon.awssdk.protocols.xml.internal.unmarshall.AwsXmlPredicatedResponseHandler.handle(AwsXmlPredicatedResponseHandler.java:43)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler$Crc32ValidationResponseHandler.handle(AwsSyncClientHandler.java:93)
at software.amazon.awssdk.core.internal.handler.BaseClientHandler.lambda$successTransformationResponseHandler$7(BaseClientHandler.java:279)
at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:50)
at software.amazon.awssdk.core.internal.http.pipeline.stages.HandleResponseStage.execute(HandleResponseStage.java:38)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:72)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptTimeoutTrackingStage.execute(ApiCallAttemptTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:78)
at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:55)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallAttemptMetricCollectionStage.execute(ApiCallAttemptMetricCollectionStage.java:39)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81)
at software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:56)
at software.amazon.awssdk.core.internal.http.StreamManagingStage.execute(StreamManagingStage.java:36)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.executeWithTimer(ApiCallTimeoutTrackingStage.java:80)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:60)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallTimeoutTrackingStage.execute(ApiCallTimeoutTrackingStage.java:42)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:50)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApiCallMetricCollectionStage.execute(ApiCallMetricCollectionStage.java:32)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:37)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ExecutionFailureExceptionReportingStage.execute(ExecutionFailureExceptionReportingStage.java:26)
at software.amazon.awssdk.core.internal.http.AmazonSyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonSyncHttpClient.java:224)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.invoke(BaseSyncClientHandler.java:103)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.doExecute(BaseSyncClientHandler.java:173)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.lambda$execute$1(BaseSyncClientHandler.java:80)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.measureApiCallSuccess(BaseSyncClientHandler.java:182)
at software.amazon.awssdk.core.internal.handler.BaseSyncClientHandler.execute(BaseSyncClientHandler.java:74)
at software.amazon.awssdk.core.client.handler.SdkSyncClientHandler.execute(SdkSyncClientHandler.java:45)
at software.amazon.awssdk.awscore.client.handler.AwsSyncClientHandler.execute(AwsSyncClientHandler.java:53)
at software.amazon.awssdk.services.s3.DefaultS3Client.uploadPart(DefaultS3Client.java:11515)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$uploadPart$16(S3AFileSystem.java:3345)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfSupplier(IOStatisticsBinding.java:651)
at org.apache.hadoop.fs.s3a.S3AFileSystem.uploadPart(S3AFileSystem.java:3342)
at org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$uploadPart$9(WriteOperationHelper.java:594)
at org.apache.hadoop.fs.store.audit.AuditingFunctions.lambda$withinAuditSpan$0(AuditingFunctions.java:62)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:122)
... 24 moreSuppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 1 failure: null (Service: S3, Status Code: 500, Request ID: TZ0558DJES2Z0HF3, Extended Request ID: NZRAAV8C5lcYmt1knZyj3jQbUxvEQFgOdOI+Od/xwx+jZVTLOTZyXb87acVfI6SoE5IybofBiDY=)
Suppressed: software.amazon.awssdk.core.exception.SdkClientException: Request attempt 2 failure: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: S3, Status Code: 400, Request ID: FRGQYV48RNTQDADX, Extended Request ID: r7drlnxjjdIbZCfmCg5i5Y/a622hxSMO+JK7idhL49zJOvLvCwwDUPxbgW8rbxAWTvoN76cCnp4=)
Reproduction Steps
Steps to replicate
- Supply the request body from an existing input stream, via
RequestBody.fromInputStream(stream, size)
- submit a post request such that the request body is successfully uploaded
- have the far end return a 500 response, which triggers a retry within the sdk.
- wait about 30 seconds.
Fraction of the code which generates this
InputStream fileStream = Files.newInputStream(localFile.toPath());
...
UploadPartRequest part = createUploadRequest()build();
RequestBody body = RequestBody.fromInputStream(fileStream, size);
UploadPartResponse response = s3Client.uploadPart(part, body);
Possible Solution
Implement a ContentProvider which can open a file at a given offset, supplying a specified number of bytes. On a retry, repeat this operation. This is what we have done -restoring the behaviour of the V1 sdk.
There is clearly also something failing silently in the SDK on its second and subsequent attempts. The client does not raise any error about having an insufficient quantity of data, it's mainly reattempts the post but does not supply enough bytes compared to the content length header.
Additional Information/Context
No response
AWS Java SDK version used
2.25.53
JDK version used
openjdk version "1.8.0_362" OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-macos-aarch64) (build 1.8.0_362-b09) OpenJDK 64-Bit Server VM (Zulu 8.68.0.21-CA-macos-aarch64) (build 25.362-b09, mixed mode)
Operating System and version
macos 23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000 arm64