-
Notifications
You must be signed in to change notification settings - Fork 27
Feat : Multiple download functionality #271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Feat : Multiple download functionality #271
Conversation
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Avantol13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had time to review other Python code, will get to that soon
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
533d187 to
97c29eb
Compare
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
gen3/cli/download.py
Outdated
| @click.argument("guid") | ||
| @click.option( | ||
| "--download-path", | ||
| default=".", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this default more dynamic and a folder by default.
maybe let's do a timestamped folder name like:
from datetime import datetime
...
default=f"download_{datetime.now().strftime("%d_%b_%Y")}",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the same thing for multiple download async
gen3/cli/download.py
Outdated
| ) | ||
| @click.option( | ||
| "--max-concurrent-requests", | ||
| default=300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really, really high. Let's scale this default to something like 20 and people can manually bump this up as needed.
| type=int, | ||
| ) | ||
| @click.option( | ||
| "--skip-completed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is working as intended. I get a lot of warnings when I try this:
$ poetry run gen3 -vv download-multiple-async --manifest MIDRC_case_manifest.json --download-path ./downloads --skip-completed
[2025-09-02 14:28:36,954][ DEBUG] Initializing auth..
Found 5178 files to download
Continue with async download? [y/N]: y
Downloading: 0%| | 0/5178 [00:00<?, ?it/s][2025-09-02 14:28:40,004][WARNING] File will be overwritten: downloads/10041569-u_H3HaB1lES6-HXVpiEfMA/2.16.840.1.114274.1818.48858790339993589885669552017090272896/2.16.840.1.114274.1818.56909342758958044433235758152671341713.zip
[2025-09-02 14:28:40,010][WARNING] File will be overwritten: downloads/10041569-FtpnR4GEPEe9JztuDUqrXg/2.16.840.1.114274.1818.544490635779373958610541144202466729913/2.16.840.1.114274.1818.55741368199706394947919032241968638388.zip
[2025-09-02 14:28:40,085][WARNING] File will be overwritten: downloads/10041569-CooqJZVIRkywB6_o9CJi0Q/2.16.840.1.114274.1818.567640554746210778914433462824885428113/2.16.840.1.114274.1818.523353596424499437510054502719882590104.zip
Actually... it seems like maybe it is actually skipping but we need to suppress all these warnings and/or not overwrite existing files, just skip downloading again.
gen3/cli/download.py
Outdated
| ) | ||
| @click.option( | ||
| "--skip-completed", | ||
| is_flag=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can have an is_flag and default to True b/c how do I set this to false then? it just needs to be a boolean but default to true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could reverse the logic here and just call it: redownload-completed and then leave is_flag=True but default to false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to update docs if you change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't change it, actually. We need to ensure we're matching previous commands exactly.
Here's the help from gen3-client:
Flags:
--download-path string The directory in which to store the downloaded files (default ".")
--filename-format string The format of filename to be used, including "original", "guid" and "combined" (default "original")
-h, --help help for download-multiple
--manifest string The manifest file to read from. A valid manifest can be acquired by using the "Download Manifest" button in Data Explorer from a data common's portal
--no-prompt If set to true, will not display user prompt message for confirmation
--numparallel int Number of downloads to run in parallel (default 1)
--protocol string Specify the preferred protocol with --protocol=s3
--rename Only useful when "--filename-format=original", will rename file by appending a counter value to its filename if set to true, otherwise the same filename will be used
--skip-completed If set to true, will check for filename and size before download and skip any files in "download-path" that matches both
Let's change the overall command name to match too. download-multiple instead of download-multiple-async
Keep skip completed but make it a boolean so I can pass in false.
I'm okay with --numparallel not existing and being different b/c the parallelization scheme in Python is a bit different... maybe we can allow that to be passed in though and just use the value as max-concurrent-requests if that isn't supplied. So people can easily port their commands over and have them work right away
docs/howto/asyncDownloadMultiple.md
Outdated
| - **High-bandwidth networks**: Increase the number of worker processes | ||
| - **Limited memory**: Reduce queue sizes to manage memory usage | ||
|
|
||
| ### Memory Management |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this section, I feel it's redundant given the above config info
docs/howto/asyncDownloadMultiple.md
Outdated
| - **Batch Size**: Balance between memory usage and processing overhead | ||
| - **Process Count**: Match available CPU cores for optimal performance | ||
|
|
||
| ### Network Optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's also remove this section
docs/howto/asyncDownloadMultiple.md
Outdated
|
|
||
| - Check network bandwidth and server limits | ||
| - Reduce concurrent request limits if server is overwhelmed | ||
| - Verify authentication token is valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auth token would not be a reason for slow downloads, let's remove this
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Avantol13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure to add unit tests for this new functionality
gen3/cli/__main__.py
Outdated
| main.add_command(drs_pull.drs_pull) | ||
| main.add_command(file.file) | ||
| main.add_command(download.download_single, name="download-single") | ||
| main.add_command(download.download_multiple_async, name="download-multiple-async") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| main.add_command(download.download_multiple_async, name="download-multiple-async") | |
| main.add_command(download.download_multiple, name="download-multiple") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to remember to retest after any changes to your code, this was broken b/c the the function doesn't exist anymore
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
| ] | ||
|
|
||
|
|
||
| def test_download_single_success(gen3_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a docstring for each test describing what you're testing
docs/howto/asyncDownloadMultiple.md
Outdated
| gen3 --endpoint data.commons.io --auth creds.json download-multiple \ | ||
| --manifest large_dataset.json \ | ||
| --download-path ./large_downloads \ | ||
| --max-concurrent-requests 20 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's change to:
| --max-concurrent-requests 20 \ | |
| --max-concurrent-requests 64 \ | |
| --max-concurrent-requests 8 |
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Avantol13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will also need to update this test. I got it working:
def test_download_single_basic_functionality(gen3_file):
"""
Test download_single basic functionality with synchronous download.
Verifies that download_single downloads a file successfully using
synchronous requests and returns True.
"""
gen3_file._auth_provider._refresh_token = {"api_key": "123"}
with patch.object(gen3_file, 'get_presigned_url') as mock_presigned, \
patch('gen3.file.requests.get') as mock_get, \
patch('gen3.index.Gen3Index.get_record') as mock_index:
mock_presigned.return_value = {"url": "https://fake-url.com/file"}
mock_index.return_value = {"file_name": "test-file.txt"}
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.headers = {"content-length": "12"}
mock_response.iter_content = lambda size: [b"test content"]
mock_get.return_value = mock_response
result = gen3_file.download_single(object_id="test-guid", path="/tmp", protocol="s3")
assert result == True
mock_presigned.assert_called_once_with("test-guid", protocol="s3")
mock_index.assert_called_once_with("test-guid")Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
…github.com/Dhiren-Mhatre/gen3sdk-python into feat/multiple-download-performance-testing
| @click.option( | ||
| "--numparallel", | ||
| default=None, | ||
| help="Number of downloads to run in parallel (compatibility with gen3-client)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| help="Number of downloads to run in parallel (compatibility with gen3-client)", | |
| help="Same as max-concurrent-requests, for compatibility with previous gen3-client arguments", |
| click.echo("\nFailed downloads:") | ||
| for failure in result["failed"]: | ||
| click.echo( | ||
| f" - {failure.get('guid', 'unknown')}: {failure.get('error', 'Unknown error')}" | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not working, I'm getting this:
[2026-01-21 15:19:03,719][ ERROR] Async batch download failed: 'str' object has no attribute 'get'
[2026-01-21 15:19:03,719][ ERROR] Async batch download failed: 'str' object has no attribute 'get'
if I provide the command poetry run gen3 download-multiple --manifest MIDRC_case_manifest_small.json --no-progress --no-prompt --protocol foo --skip-completed false
I suggest we remove this listing of failures entirely, as the ERROR logs themselves have the GUIDs that failed in them
| ) | ||
|
|
||
| click.echo( | ||
| "\nTo retry failed downloads, run the same command with --skip-completed flag:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "\nTo retry failed downloads, run the same command with --skip-completed flag:" | |
| "\nTo retry failed downloads, run the same command with --skip-completed true" |
| "\nTo retry failed downloads, run the same command with --skip-completed flag:" | ||
| ) | ||
|
|
||
| success_rate = len(result["succeeded"]) / len(manifest_data) * 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should rethink this because the way it's working, I get a non 100 success if I skip previously downloaded files.
$ poetry run gen3 -vv download-multiple --manifest MIDRC_case_manifest.json --no-prompt --max-concurrent-requests 64 --num-processes 8
...
Async Download Results:
✓ Succeeded: 4871
- Skipped: 307
Success rate: 94.1%
If --skip-completed is true, we should count skipped towards success imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is even more apparent when I rerun a successful download and it skips everything
Async Download Results:
✓ Succeeded: 0
- Skipped: 42
Success rate: 0.0%
| auth = ctx.obj["auth_factory"].get() | ||
|
|
||
| # Use numparallel as max_concurrent_requests if provided (for gen3-client compatibility) | ||
| if numparallel is not None and max_concurrent_requests == 20: # 20 is the default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a global for this default value, don't hard code and make a magic number
| help="Protocol for presigned URL (e.g., s3) (default: auto-detect)", | ||
| ) | ||
| @click.pass_context | ||
| def download_single( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like this support the same interface as the gen3-client, and ideally the same params (the ones that translate) from download-multiple. What we could maybe do is just wrap download-multiple behind the scenes but pass a manifest with a single GUID (e.g. don't repeat code we don't need to repeat).
As a reference, this is gen3-client's support:
$ gen3-client download-single --help
Gets a presigned URL for a file from a GUID and then downloads the specified file.
Usage:
gen3-client download-single [flags]
Examples:
./gen3-client download-single --profile=<profile-name> --guid=206dfaa6-bcf1-4bc9-b2d0-77179f0f48fc
Flags:
--download-path string The directory in which to store the downloaded files (default ".")
--filename-format string The format of filename to be used, including "original", "guid" and "combined" (default "original")
--guid string Specify the guid for the data you would like to work with
-h, --help help for download-single
--no-prompt If set to true, will not display user prompt message for confirmation
--protocol string Specify the preferred protocol with --protocol=gs
--rename Only useful when "--filename-format=original", will rename file by appending a counter value to its filename if set to true, otherwise the same filename will be used
--skip-completed If set to true, will check for filename and size before download and skip any files in "download-path" that matches both
Global Flags:
--profile string Specify profile to use
Avantol13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general things are looking good but see my comments. It's probably prudent to test using Python 3.13 (I did locally and had to fix a dep issue, I think just a bump, but double check that).
New Features
Dependency updates
Updated fastavro from 1.8.4 to 1.11.1
Updated pypfb to include extras: pypfb = {extras = ["gen3"], version = "^0.5.33"}
Updated importlib-metadata from 8.5.0 to 4.13.0