Skip to content

Add LegacyMd5Plugin for MD5 checksum calculations in S3 operations requiring checksums #6055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 28, 2025

Conversation

joviegas
Copy link
Contributor

@joviegas joviegas commented Apr 24, 2025

Motivation and Context

#5802 (comment)

  • The recent S3 changes introduced challenges for customers using S3A (Apache Spark, Iceberg) with various third-party storage providers.

  • While some providers like MinIO have fixes available, many customers use different storage solutions ( or cannot easily upgrade their third-party stores.

  • This plugin provides a backwards-compatible solution allowing customers to maintain MD5 checksum calculations where required while adopting S3 Motorcade features.

  • Note: This Plugin adds MD5 checksum for operations which require checksums along with the default checksum provided by SDK for the operation which require checksum .

 // For synchronous S3 client
 S3Client s3Client = S3Client.builder()
                            .addPlugin(LegacyMd5Plugin.create())
                            .build();

 // For asynchronous S3 client
 S3AsyncClient asyncClient = S3AsyncClient.builder()
                                         .addPlugin(LegacyMd5Plugin.create())
                                         .build();

If you want to add MD5 checksums to the operations that require checksums and want to skip adding of SDK Default checksums for operations that support checksums but not required, then you can enable ClientBuilder options requestChecksumCalculation and responseChecksumValidation as WHEN_REQUIRED, this will add SDK default checksums only to operation that required checksums

// Use LegacyMd5Plugin with requestChecksumCalculation and responseChecksumValidation set to WHEN_REQUIRED
S3AsyncClient asyncClient = S3AsyncClient.builder()
                                         .addPlugin(LegacyMd5Plugin.create())
                                         .requestChecksumCalculation(RequestChecksumCalculation.WHEN_REQUIRED)
                                         .responseChecksumValidation(ResponseChecksumValidation.WHEN_REQUIRED)
                                         .build();

Modifications

Testing

  • Junits

  • One-off integs

  • A Test class was created with following APIs

    public void runAllTests() {

            createBucket();
            putObject();
            listObjects();
            getObject();
            deleteObjects();
            deleteBucket();
            System.out.println("All tests completed successfully!");
    }

Tested with minio version RELEASE.2025-01-18T00-31-37Z (commit-id=4b6eadbd80313711c01039bfa9a05167291a8c50) where the issue existed and all test passed , below config were tested

** With LegacyMd5Plugin **

        S3Client s3Client = S3Client.builder()
            .addPlugin(LegacyMd5Plugin.create())
            .endpointOverride(URI.create("http://127.0.0.1:9000"))
            .credentialsProvider(StaticCredentialsProvider.create(
                         AwsBasicCredentials.create("minioadmin", "minioadmin")))
                          .forcePathStyle(true) // Required for MinIO
                           .build();

** With LegacyMd5Plugin with ChecksumCalculation and ChecksumValidation as when required**

        S3Client s3Client = S3Client.builder()
            .addPlugin(LegacyMd5Plugin.create())
            .requestChecksumCalculation(RequestChecksumCalculation.WHEN_REQUIRED)
            .responseChecksumValidation(ResponseChecksumValidation.WHEN_REQUIRED)
            .endpointOverride(URI.create("http://127.0.0.1:9000"))
            .credentialsProvider(StaticCredentialsProvider.create(
                         AwsBasicCredentials.create("minioadmin", "minioadmin")))
                          .forcePathStyle(true) // Required for MinIO
                           .build();

** Also tested the above test cases and configutation with a real bucket in S3 and test passed **

Screenshots (if appropriate)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

License

  • I confirm that this pull request can be released under the Apache 2 license

@joviegas joviegas requested a review from a team as a code owner April 24, 2025 21:39
@joviegas joviegas force-pushed the joviegas/legacy_md5_pluggin branch from 32af381 to aae6a51 Compare April 24, 2025 22:54
@joviegas joviegas changed the title LegacyMd5Plugin which adds MD5 checksum for operations that require i… Add LegacyMd5Plugin for MD5 checksum calculations in S3 operations requiring checksums Apr 24, 2025
Copy link
Contributor

@L-Applin L-Applin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

architecture-tests fails because HttpChecksumUtils (intenal API in sdk-core) is used in another package (s3). We should either mark this utils class protected api or update the archunit store.

@Override
public void configureClient(SdkServiceClientConfiguration.Builder config) {
S3ServiceClientConfiguration.Builder s3Config = (S3ServiceClientConfiguration.Builder) config;
s3Config.responseChecksumValidation(ResponseChecksumValidation.WHEN_REQUIRED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not set it if responseChecksumValidation is already configured? Same for RequestChecksumCalculation (I missed it when I wrote the sample code😅)

Copy link
Contributor Author

@joviegas joviegas Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you you please elaborate , the default values of RequestChecksumCalculation and ResponseChecksumValidation is WHEN_SUPPORTED thus this values will always have some values.

Can you please help me with a Scenario where this additional check might be helpful ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I believe we should not be doing:

s3Config.responseChecksumValidation(ResponseChecksumValidation.WHEN_REQUIRED);
s3Config.requestChecksumCalculation(RequestChecksumCalculation.WHEN_REQUIRED);
        

This is to maintain the single responsibility of the LegacyMd5Plugin, which is solely added to add the MD5 header.

The responseChecksumValidation and requestChecksumCalculation should instead be set on the client or in the environment variables. What do you think?

*
* <p>Use this plugin only when you need to maintain compatibility with applications that depend on the
* legacy MD5 checksum behavior, particularly for operations that previously calculated MD5 checksums
* automatically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add sample code of how customers can configure it on the client with @snippet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@joviegas joviegas force-pushed the joviegas/legacy_md5_pluggin branch from 1a2bace to a7587ea Compare April 25, 2025 16:39
@joviegas
Copy link
Contributor Author

architecture-tests fails because HttpChecksumUtils (intenal API in sdk-core) is used in another package (s3). We should either mark this utils class protected api or update the archunit store.

Done , removed the HttpChecksumUtils apiand made a inline.

@joviegas
Copy link
Contributor Author

joviegas commented Apr 25, 2025

architecture-tests fails because HttpChecksumUtils (intenal API in sdk-core) is used in another package (s3). We should either mark this utils class protected api or update the archunit store.

Done , I removed the HttpChecksumUtils api and made a inline. Lets see if it passes.

*
* <p>This plugin configures the S3 client to:
* <ul>
* <li>Set request checksum calculation to WHEN_REQUIRED mode, which calculates default checksums only when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we need to update this documentation since we removed configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*
* // For asynchronous S3 client
* S3AsyncClient asyncClient = S3AsyncClient.builder()
* .addPlugin(create())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LegacyMd5Plugin.create(). Can we configure the request checksum calculation and response checksum validation to when_required in the code snippet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@joviegas joviegas requested review from zoewangg and L-Applin April 28, 2025 15:12
@joviegas joviegas enabled auto-merge April 28, 2025 16:15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test cases with the plugin when user manually set md5 on the request body?

Copy link
Contributor Author

@joviegas joviegas Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test case both for putObject and deleteObjects

Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
64.1% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@joviegas joviegas added this pull request to the merge queue Apr 28, 2025
Merged via the queue into master with commit 4bf1f8f Apr 28, 2025
17 of 19 checks passed
@steveloughran
Copy link

thank you for this!

have you tried testing it against a third party store? google cloud is easy to set up & documented in the s3a thirdparty and qualification docs. Even if you only run it weekly, it'd restore my confidence in the SDK changes.

thanks

wendigo added a commit to trinodb/trino that referenced this pull request Apr 30, 2025
wendigo added a commit to trinodb/trino that referenced this pull request Apr 30, 2025
@joviegas
Copy link
Contributor Author

joviegas commented Apr 30, 2025

thank you for this!

have you tried testing it against a third party store? google cloud is easy to set up & documented in the s3a thirdparty and qualification docs. Even if you only run it weekly, it'd restore my confidence in the SDK changes.

thanks

HI Steve ,

  • I tested it with minio version RELEASE.2025-01-18T00-31-37Z where this issue was easily repeatable.
  • let me do a one-off test with google cloud , thanks for the reference
  • run it weekly will keep this in consideration and discuss about it (might take some time) .

wendigo added a commit to trinodb/trino that referenced this pull request May 1, 2025
wendigo added a commit to trinodb/trino that referenced this pull request May 1, 2025
@uwolfer
Copy link

uwolfer commented May 14, 2025

Has somebody already successfully tested this change against GCP? I have enabled the new plugin, but putObject still fails with a SignatureDoesNotMatch error while it works with the same code with 2.29.52.

You can also see some sample code in the related issue: #5987

@joviegas
Copy link
Contributor Author

joviegas commented May 19, 2025

Hi @uwolfer ,
I have only tested with minio and it worked fine.

Also along with the pluggin could you please make sure ChecksumCalculation option are set as below

       S3Client s3Client = S3Client.builder()
            .addPlugin(LegacyMd5Plugin.create())
            .requestChecksumCalculation(RequestChecksumCalculation.WHEN_REQUIRED)
            .responseChecksumValidation(ResponseChecksumValidation.WHEN_REQUIRED)

If issue stil persist can you please check wire logs and check good and bad case Wire logs ?
I spent some time using gcp but it was giving me lot of authentication issues.
Can you please share some simple docs/link/steps which can help me to do a quick test with GCP.
@bhoradc did you get a chance to check it with GCP.

@joviegas joviegas deleted the joviegas/legacy_md5_pluggin branch May 29, 2025 15:05
@uwolfer
Copy link

uwolfer commented May 30, 2025

Also along with the pluggin could you please make sure ChecksumCalculation option are set as below

@joviegas Is it really the idea to disable optional checksum checks? If I remember correctly, it did even work without adding the new LegacyMd5Plugin plugin that way. But I would feel more comfortable if we could keep the checks as they were before 2.30.

Can you please share some simple docs/link/steps which can help me to do a quick test with GCP.

Here you can find a simple code snippet to test it against GCP (just insert the access key id and secret and pass some random content): #5987 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants