Skip to content

Conversation

@millems
Copy link
Contributor

@millems millems commented Jan 29, 2025

This error is sometimes encountered when a customer uses:

  1. The AsyncS3Client
  2. A ChecksumAlgorithm of SHA1 or SHA256 (instead of the default CRC32)
  3. Parallel uploads

The root cause was the SDK using thread locals to cache the SHA1 or SHA256 message digest implementations. This meant that if a single event loop thread was processing multiple requests, those two requests would use the same digest implementation to calculate the checksum.

This PR updates the SHA1 and SHA256 (and MD5, though it's not used by S3) to use a LIFO cache.

This error is sometimes encountered when a customer uses:
1. The AsyncS3Client
2. A ChecksumAlgorithm of SHA1 or SHA256 (instead of the default CRC32)
3. Parallel uploads

The root cause was the SDK using thread locals to cache the SHA1 or SHA256 message digest implementations. This meant that if a single event loop thread was processing multiple requests, those two requests would use the same digest implementation to calculate the checksum.
@millems millems requested a review from a team as a code owner January 29, 2025 00:43

// Avoid over-caching after large traffic bursts. The maximum chosen here is arbitrary. It's also not strictly
// enforced, since these statements aren't synchronized.
if (digestCache.size() <= MAX_CACHED_DIGESTS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size() may be expensive, should we track the size using atomic integer instead?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you coordinate the atomic size and the concurrent deque?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll switch to a linked blocking deque. Uses locking, but it's still likely faster than creating new checksums and has a constant-time size() method. We can do benchmarking later to verify this is the fastest method.

Is that reasonable?

DigestThreadLocal(String algorithmName) {
this.algorithmName = algorithmName;
/**
* Retrieve the message digest bytes. This will close the message digest when invoked. This is because the underlying
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: where do we close messageDigest in this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that needs a test added.


// Avoid over-caching after large traffic bursts. The maximum chosen here is arbitrary. It's also not strictly
// enforced, since these statements aren't synchronized.
if (digestCache.size() <= MAX_CACHED_DIGESTS) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you coordinate the atomic size and the concurrent deque?

throw new RuntimeException("Unable to fetch message digest instance for Algorithm "
+ algorithmName + ": " + e.getMessage(), e);
return new CloseableMessageDigest((MessageDigest) digest.get().clone());
} catch (CloneNotSupportedException e) { // should never occur
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time a comment says "should never occur", it seems to happen. Why can't this method declare that it throws CloneNotSupportedException instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloneNotSupportedException is a checked exception. Because the SDKs don't throw checked exceptions, the callers would need to wrap it in an unchecked exception themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved the exception message if this scenario does happen.

Comment on lines 106 to 107
// Avoid over-caching after large traffic bursts. The maximum chosen here is arbitrary. It's also not strictly
// enforced, since these statements aren't synchronized.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to have a "prefilled" cache of lazy loaded digests, and then when we can't use a cached digest, we have a different instance that just closes and doesn't need to interact with the cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an option. It might be trickier to decide on the size of the cache, and then it means that not releasing a message digest to the cache will have long-term performance implications if it's one of those special "cached" digests. The advantage to this implementation is that the odd error that could fail to release the digest back to the cache doesn't really hurt.

digestCache.addFirst(digest.get());
}
// Drop this digest is the cache is full.
digestCache.offerFirst(digest.get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@sonarqubecloud
Copy link

@millems millems added this pull request to the merge queue Jan 29, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 29, 2025
@millems millems added this pull request to the merge queue Jan 29, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 29, 2025
@millems millems added this pull request to the merge queue Jan 30, 2025
Merged via the queue into master with commit c137348 Jan 30, 2025
18 checks passed
ghetelgb pushed a commit to ghetelgb/aws-sdk-java-v2 that referenced this pull request Feb 7, 2025
…s. (aws#5836)

* Fixed an issue that could cause checksum mismatch errors in S3 uploads.

This error is sometimes encountered when a customer uses:
1. The AsyncS3Client
2. A ChecksumAlgorithm of SHA1 or SHA256 (instead of the default CRC32)
3. Parallel uploads

The root cause was the SDK using thread locals to cache the SHA1 or SHA256 message digest implementations. This meant that if a single event loop thread was processing multiple requests, those two requests would use the same digest implementation to calculate the checksum.

* Fixed an issue where SignerUtils was using the internal class of another module.

* Update clone() failure message. Move to LinkedBlockingDeque for DigestAlgorithm cache.

* Added direct testing of the DigestAlgorithm cache.

* Fixed a transient issue that could occur when the first attempt fully sends the payload, and then a retry happens with checksums enabled.
imsayari404 added a commit to imsayari404/presto that referenced this pull request Jul 23, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
imsayari404 added a commit to imsayari404/presto that referenced this pull request Jul 29, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
imsayari404 added a commit to imsayari404/presto that referenced this pull request Aug 11, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
imsayari404 added a commit to imsayari404/presto that referenced this pull request Aug 11, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
imsayari404 added a commit to imsayari404/presto that referenced this pull request Aug 18, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
imsayari404 added a commit to imsayari404/presto that referenced this pull request Aug 18, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
imsayari404 added a commit to imsayari404/presto that referenced this pull request Aug 20, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
hantangwangd pushed a commit to prestodb/presto that referenced this pull request Aug 20, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
lga-zurich pushed a commit to lga-zurich/presto-exchange that referenced this pull request Sep 2, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
lga-zurich pushed a commit to lga-zurich/presto-exchange that referenced this pull request Sep 8, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
lga-zurich pushed a commit to lga-zurich/presto-exchange that referenced this pull request Sep 8, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
lga-zurich pushed a commit to lga-zurich/presto-exchange that referenced this pull request Sep 8, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
lga-zurich pushed a commit to lga-zurich/presto-exchange that referenced this pull request Sep 8, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
lga-zurich pushed a commit to lga-zurich/presto-exchange that referenced this pull request Sep 8, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
lga-zurich pushed a commit to lga-zurich/presto-exchange that referenced this pull request Sep 8, 2025
Disable checksum validation and chunked encoding to ensure compatibility
with MinIO and newer AWS SDK versions per aws/aws-sdk-java-v2#5836
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants