-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Describe the feature
Please give us a way to track and aggressively retry slower operations when using concurrent range requests to download single large objects ("multi-part download"), in the manner advised by the S3 user guide.
Use Case
In my EC2 instance, I'd like to use awscli to quickly fetch a small number (4) of objects (about 10 GiB in total size) from a single S3 bucket in the same region.
Ideally I'd like to saturate the instance's inbound network bandwidth (bursting up to 12.5 Gb/s) to get the job done in a few seconds, however a minute would do. This latency is on a critical path for bootstrapping the instance in an auto scaling scenario; other options for getting data onto the instance have been ruled out for independent reasons.
My objects have been uploaded using multipart upload and I've experimented with setting threshold and chunk size to 16, 32, 64 or 128 MiB. On download, I set the same parameters as well as max concurrency to values like 16, 32 or 64. On download I'm connecting to a regional endpoint.
What I find is my download proceeds quickly, typically reaching speeds between 150 to 250 MiB/s. That's good, but it's still nowhere near the (1600 MiB/s) instance burst bandwidth limit. The process is not limited on instance network throughput. Downloading to /dev/null produces the same result to also rule out e.g. disk write throughput.
The bottleneck appears to be either in the S3 client, or upstream in the service. On repeated attempts in a loop we do in fact see improvements, as my object chunks make their way into hotter caches.
Looking for ideas I went to the page linked above, and realized I had not yet considered aggressively retrying laggard requests as it suggests. If I watch the progress meter on download, it does indeed begin strong, and deteriorates over time as the client runs out of chunks to fetch while waiting for the slow ones. I suspect eagerly retrying slow connections might recoup 10-15% of the latency in my scenario. Since I don't seriously expect to ultimately saturate my instance's network, this would still be an interesting win.
Looking in the documentation for awscli, and ultimately at the source code for botocore and s3transfer, I could not find where I might set a "chunk request timeout" or the percent of concurrent requests to retry.
Proposed Solution
The policy mentioned on the page seems reasonable to me:
For latency-sensitive applications, Amazon S3 advises tracking and aggressively retrying slower operations. When you retry a request, we recommend using a new connection to Amazon S3 and performing a fresh DNS lookup.
When you make large variably sized requests (for example, more than 128 MB), we advise tracking the throughput being achieved and retrying the slowest 5 percent of the requests. When you make smaller requests (for example, less than 512 KB), where median latencies are often in the tens of milliseconds range, a good guideline is to retry a GET or PUT operation after 2 seconds. If additional retries are needed, the best practice is to back off. For example, we recommend issuing one retry after 2 seconds and a second retry after an additional 4 seconds.
If your application makes fixed-size requests to Amazon S3, you should expect more consistent response times for each of these requests. In this case, a simple strategy is to identify the slowest 1 percent of requests and to retry them. Even a single retry is frequently effective at reducing latency.
Defining "slowest" might be tricky, but I'm interested in the multipart upload/"download" scenario where all chunks have the same, known size. Projected chunk download time perhaps?
How the policy above is expressed in configuration doesn't worry me particularly, so long as it can be quickly dropped into the config file like other tuning parameters. If this behavior appeared but the parameters were hard-wired, that would probably also be fine.
Other Information
No response
Acknowledgements
- I may be able to implement this feature request
- This feature might incur a breaking change
CLI version used
aws-cli/1.29.75 Python/3.10.12 Linux/6.5.4-76060504-generic botocore/1.31.75
Environment details (OS name and version, etc.)
Linux