-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-19654. Upgrade AWS SDK to 2.35.4 #7882
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-19654. Upgrade AWS SDK to 2.35.4 #7882
Conversation
I found |
|
@pan3793 maybe. what is unrelated is out the box the SDK doesn't do bulk delete with third party stores which support it (Dell ECS). |
|
@pan3793 no, it's lifecycle related. Test needs to set up that minicluster before the test cases. and that's somehow not happening |
5b9a7e3 to
efd34a0
Compare
|
regressions everywhereNo logging. Instead we get
more on this once I've looked at it. If it is an SDK issue, major regression, though it may be something needing changes in the aal libary s3 expressassumption: now that the store has lifecycle rules, you don't get prefix listings when there's an in-progress upload. Fix: change test but also path capability warning of inconsistency. this is good. Operation costs/auditing count an extra HTTP request, so cost tests fail. I suspect it is always calling CreateSession, but without logging can't be sure |
efd34a0 to
6a7e6d9
Compare
6a7e6d9 to
cc31e5b
Compare
|
💔 -1 overall
This message was automatically generated. |
cc31e5b to
3351e41
Compare
|
Thanks @steveloughran, PR looks good overall. Are then failures in |
| // disable create session so there's no need to | ||
| // add a role policy for it. | ||
| disableCreateSession(conf); | ||
| //disableCreateSession(conf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can just cut this instead of commenting it out, since we're skipping these tests if S3 Express is enabled
| // close the stream, should throw RemoteFileChangedException | ||
| RemoteFileChangedException exception = intercept(RemoteFileChangedException.class, stream::close); | ||
| assertS3ExceptionStatusCode(SC_412_PRECONDITION_FAILED, exception); | ||
| verifyS3ExceptionStatusCode(SC_412_PRECONDITION_FAILED, exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know what the difference is with the other tests here?
As in, why with S3 express is it ok to assert that we'll get a 412, whereas the others tests will throw a 200?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, it's your server code. Go see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked..the answer is that it's MPU that has divergence, put object which these tests do will return 412
|
💔 -1 overall
This message was automatically generated. |
|
I've attached a log of a test run against an s3 express bucket where the test The relevant stuff is at line 564 where a HEAD request fails because the stream is broken The second request always works. Either the request is being rejected (why?) or the connection has gone stale. But why should it happen at exactly the same place on every single test run? org.apache.hadoop.fs.s3a.statistics.ITestAWSStatisticCollection-output.txt |
|
💔 -1 overall
This message was automatically generated. |
|
@steveloughran discovered completely by accident, but it's something to do with the checksumming code. If you comment out these lines: the test will pass. Could be something to do with s3Express not supporting md5, will look into it. |
|
Specifically, it's this line: Comment that out, or change it to My guess is it's something to do with S3 express not supporting MD5, but for operations where Have asked the SDK team. |
|
ok, so maybe for s3express stores we don't do legacy MD5 plugin stuff all is good?
While on the topic of S3 Express, is it now the case that because there's lifecycle rules for cleanup, LIST calls don't return prefixes of paths with incomplete uploads? If so I will need to change production code and the test -with a separate JIRA for that for completeness |
@steveloughran confirming with the SDK team, since the MD5 plugin is supposed to restore previous behaviour, the server rejecting the first request seems wrong. let's see what they have to say.
Will check with S3 express team on this |
661dc6e to
aa8e814
Compare
|
thanks. I don't see it on tests against s3 with the 2.29.52 release, so something is changing with the requests made with new SDK + MD5 stuff. |
|
@steveloughran not able to narrow this error down just yet, it looks like it's a combination of S3A's configuration of the S3 client + these new Md5 changes. I see the failure when the S3A client, and don't see it when I use a newly created client. So it's not just because of Looking into it some more. S3 express team said there have been no changes in LIST behaviour. |
|
able to reproduce the issue outside of S3A. Basically did what would happen when you run a test in S3A:
The head fails, but if you comment out no idea what's going on. but have shared this local reproduction with SDK team. And rules out that it's something in the S3A code. |
* Now need to explicitly turn off checksum validation on downloads (slow) * Default fs.s3a.create.checksum.algorithm is "" again: nothing. Docs updated to try and explain this.
149e982 to
6416c20
Compare
|
Seeing
This doesn't happen standalone. In the IDE I get java8 errors, probably need to log out and log in again now I've switched my default jvm to 17. I'm not worrying about this. |
* fs.s3a.ext.multipart.commit.consumes.upload.id => fs.s3a.ext.test.multipart.commit.consumes.upload.id Makes clear it is for testing and not relevant in production. * remove some whitespace * declare "auto", "sdk" and "ec2" as reserved regions.
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md
Outdated
Show resolved
Hide resolved
|
test failure with ITestConnectionTimeouts and store set to use analytics stream. Fix: make sure we only use classic stream here. |
|
I think I'm done here, @mukund-thakur and @ahmarsuhail .... Testing against corner case deployments are finding corners of test configurations, not actual code failures I am getting failures of some MR jobs since the JDK/junit updates; with no obvious cause. |
5b4114a to
ff3fade
Compare
|
everything running mr job can't spawn process properly as the launched jvm is always java8. One regression is a JUnit5 regression; the configurable timeouts of scale tests are no longer being picked up, slow tests are timing out |
|
The intermittent test failures happen on trunk when running with java17; it's related to the parallel test runner. I am not investigating it here. |
ahmarsuhail
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
Thanks @steveloughran! good to see the test stabilisation changes
|
|
||
| <property> | ||
| <name>fs.s3a.request.md5.header</name> | ||
| <value>false</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confused - I thought this should be true, otherwise the SDK won't generate the MD5's, which was causing the compatibility issues with third party stores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right. somehow got that wrong in my port. will fix
| // close the stream, should throw RemoteFileChangedException | ||
| RemoteFileChangedException exception = intercept(RemoteFileChangedException.class, stream::close); | ||
| assertS3ExceptionStatusCode(SC_412_PRECONDITION_FAILED, exception); | ||
| verifyS3ExceptionStatusCode(SC_412_PRECONDITION_FAILED, exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked..the answer is that it's MPU that has divergence, put object which these tests do will return 412
|
|
||
| ```xml | ||
| <property> | ||
| <name>fs.s3a.region</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cut this
+ added that "null" is also a choice of region name to avoid.
mukund-thakur
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was running tests locally. All good other than 2 failures.
| //when to calculate request checksums. | ||
| final RequestChecksumCalculation checksumCalculation = | ||
| parameters.isChecksumCalculationEnabled() | ||
| ? RequestChecksumCalculation.WHEN_SUPPORTED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking if we could have some docs around the WHEN_SUPPORTED and WHEN_REQUIRED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its confusing. What happens if it is required but not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some operations require checksums (bulk delete?) and everything which implemented them has had to expect checksums. This new generation option, "when supported" is what broke things as it really means "generate checksums on all requests". There are only two values in the enum, so the sdk always has to choose one.
when_supported
- doesn't work for most third party stores
- seems to break MPUs if you don't set a content checksum for put/posted data.
I think having a generation "true/false" is simpler for people to understand than the nuances of when_supported vs when_required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should be just true/false. @ahmarsuhail could you please talk to the SDK team for this. Why they did this way?
|
this can ignored.
this one I thinking what is going on |
AWS SDK upgraded to 2.35.4. This SDK has changed checksum/checksum headers handling significantly, causing problems with third party stores, and, in some combinations AWS S3 itself. The S3A connector has retained old behavior; options to change these settings are now available. The default settings are chosen for maximum compatiblity and performance. fs.s3a.request.md5.header: true fs.s3a.checksum.generation: false fs.s3a.create.checksum.algorithm: "" Consult the documentation for more details. Contributed by Steve Loughran
was looking at this in the regions patch as it fails for sdk and ec2 regions. we are trying to issue a create command and need to know the bucket region for the call. The test will have to explicitly ask for it via a HEAD call. (my pr currently skips the test if the region is sdk or ec2, as well as the existing non-aws/non s3-express options). |
|
reading about the stack trace, the reason for failure is both s3client and the create bucket request should have the same configured region. but even if setting and verifying that propagation is happening correctly, it fails with the same reason. Yes just accepting both error, the test will be fine and I am wondering what is going on. |
AWS SDK upgraded to 2.35.4. This SDK has changed checksum/checksum headers handling significantly, causing problems with third party stores, and, in some combinations AWS S3 itself. The S3A connector has retained old behavior; options to change these settings are now available. The default settings are chosen for maximum compatiblity and performance. fs.s3a.request.md5.header: true fs.s3a.checksum.generation: false fs.s3a.create.checksum.algorithm: "" Consult the documentation for more details. Contributed by Steve Loughran
AWS SDK upgraded to 2.35.4. This SDK has changed checksum/checksum headers handling significantly, causing problems with third party stores, and, in some combinations AWS S3 itself. The S3A connector has retained old behavior; options to change these settings are now available. The default settings are chosen for maximum compatiblity and performance. fs.s3a.request.md5.header: true fs.s3a.checksum.generation: false fs.s3a.create.checksum.algorithm: "" Consult the documentation for more details. Contributed by Steve Loughran
How was this patch tested?
Testing in progress; still trying to get the ITests working.
JUnit5 update complicates things here, as it highlights that minicluster tests aren't working.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?