Skip to content

Conversation

L-Applin
Copy link
Contributor

@L-Applin L-Applin commented Sep 16, 2025

Implement parallel download for multipart GetObject in s3 Async Client and Transfer Manager.

Modifications

  • Add two new classes (Publisher/Subscriber) to orchestrate the non-linear multipart download: NonLinearMultipartDownloaderSubscriber and FileAsyncResponseTransformerPublisher. Note for reviewer: This is the core of the PR new functionality and review should probably start with those two classes.
  • Add support in Transfer-Manager module for Transfer Progress Updater.
    • Note for reviewer: The AsyncResponseTransformer published by FileAsyncResponseTransformerPublisher needs to wrapped to publish progress to the progress updater. This is done in GenericS3TransferManager and TransferProgressUpdater
  • New public API, as discussed during design review
    • supportNonSerial on SplitResult
    • ParallelConfiguration new config class in MultipartConfiguration for the maxInFlightParts config
  • New internal API
    • FileAsyncTransformer exposes getters for position, path and FileTransformerConfiguration

Testing

  • Added unit test
  • Added integration test
  • Manual tests using large objects

L-Applin added 28 commits July 22, 2025 18:15
…in the onResponse callback. Keep track of all inflight requests.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy of the DownstreamSubscription inner class in SplittingTransformer that has been moved to its own class to be reused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the same class?

Copy link
Contributor Author

@L-Applin L-Applin Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DownstreamSubscription refers to a few of variable of SplittingTransformer, it is not a static inner class. I don't think we can.

Copy link
Contributor Author

@L-Applin L-Applin Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move to utils module because used in sdk-core

@L-Applin L-Applin marked this pull request as ready for review September 16, 2025 17:04
@L-Applin L-Applin requested a review from a team as a code owner September 16, 2025 17:04
Copy link
Contributor

@zoewangg zoewangg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still going through the PR.

/**
* Amount of demand requested but not yet fulfilled by the subscription
*/
private final AtomicInteger outstandingDemand = new AtomicInteger(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we need to track outstanding demand in a subscriber?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We track the demand that we requested but that has not yet been fulfilled by on next. This is to prevent requesting more than the maxInFlight.

* is a 'one-shot' class, it should <em>NOT</em> be reused for more than one multipart download.
*/
@SdkInternalApi
public class NonLinearMultipartDownloaderSubscriber
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ParallelMultipartDownloadSubscriber?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we use parallelSplitSupported we can rename that to ParallelMultipartDownloadSubscriber, yeah

return false;
}

firstPartLock.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason we need this? We only request more in whenComplete future of the first part, right?

MPU has similar logic where we need to wait for creatMPU to finish, can we use similar logic here? https://github.com/aws/aws-sdk-java-v2/blob/master/services/s3/src/main/java/software/amazon/awssdk/services/s3/internal/multipart/UploadWithUnknownContentLengthHelper.java#L191

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really reuse that logic because we need to wait for the first part to complete before we know if its a multipart object or not. Or maybe I don't exactly understand what you want to reuse, could you elaborate?


private DefaultAsyncResponseTransformerSplitResult(Builder<ResponseT, ResultT> builder) {
this.publisher = Validate.paramNotNull(
builder.publisher(), "asyncResponseTransformerPublisher");
this.future = Validate.paramNotNull(
builder.resultFuture(), "future");
this.supportsNonSerial = Validate.getOrDefault(builder.supportsNonSerial(), () -> false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these classes are all related to Async , and concurrency is good to have why do we disable concurrent splits by default ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because most implementation is AsyncResponseTransformer cannot support it. We send back data to customers serially and changing that would be a braking change. We would also probably need to update all our implemetation of AsyncResponseTransformer we have, which is out of scope


@Override
public void onResponse(T response) {
Optional<String> contentRangeList = response.sdkHttpResponse().firstMatchingHeader("x-amz-content-range");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like

                String contentRange = response.sdkHttpResponse()
                    .firstMatchingHeader("x-amz-content-range")
                    .orElseThrow(() -> new IllegalStateException("Content range header is missing"));

and do try catch for entire function

     public void onResponse(T response) {
         try {
//  existing 
 } catch (IllegalStateException e) {
                handleError(e.getMessage(), future);
            }
        }

// Some common handleError function
    private void handleError(String errorMessage, CompletableFuture<T> future) {
        IllegalStateException exception = new IllegalStateException(errorMessage);
        if (subscriber != null) {
            subscriber.onError(exception);
        }
        if (future != null) {
            future.completeExceptionally(exception);
        }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually don't like using exception as control flow (remember the createURI thing we had a while ago), but I can use a handle error method 👍


@Override
public void subscribe(Subscriber<? super AsyncResponseTransformer<T, T>> s) {
if (s == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use
this.subscriber =Validate.notNull(responseTransformer.path(), "subscriber");

* transformer will retry independently based on the retry configuration of the client it is used with. We only need to verify
* the completion state of the future of each individually
*/
private class IndividualFileTransformer implements AsyncResponseTransformer<T, T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move it to a separate class file ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmh it is convenient to have it as an inner class as it references variable of the outer class. We use similar pattern in other places for multipart operations


@Override
public void onResponse(T response) {
Optional<String> contentRangeList = response.sdkHttpResponse().firstMatchingHeader("x-amz-content-range");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are depending on header sent by service will this cause issue for Third party tools like minio or gcp ? Since we are erring out if the header is not present it would be good to know its impact when used with Third party s3 like minio or gcp

- renamed EmittingSubscription, mark it ThreadSafe
- Added comments
- some other renaming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants