Skip to content

Conversation

@Dhiren-Mhatre
Copy link

New Features

  • Enhanced Asynchronous Multiple File Download: Introduces a high-performance async_download_multiple method utilizing hybrid concurrency (multiprocessing + asyncio) with memory-efficient queue-based management
  • New CLI Command: Adds gen3 download-multiple-async command with advanced options like resuming downloads (--skip-completed), flexible filename formatting, and progress bars
  • Performance Testing Suite: Added comprehensive benchmarking framework comparing the new implementation against CDIS Data Client for download speed, memory usage, and concurrency efficiency

Dependency updates
Updated fastavro from 1.8.4 to 1.11.1
Updated pypfb to include extras: pypfb = {extras = ["gen3"], version = "^0.5.33"}
Updated importlib-metadata from 8.5.0 to 4.13.0

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had time to review other Python code, will get to that soon

Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
@Dhiren-Mhatre Dhiren-Mhatre force-pushed the feat/multiple-download-performance-testing branch from 533d187 to 97c29eb Compare August 26, 2025 13:12
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
@click.argument("guid")
@click.option(
"--download-path",
default=".",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this default more dynamic and a folder by default.

maybe let's do a timestamped folder name like:

from datetime import datetime

...

default=f"download_{datetime.now().strftime("%d_%b_%Y")}",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the same thing for multiple download async

)
@click.option(
"--max-concurrent-requests",
default=300,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really, really high. Let's scale this default to something like 20 and people can manually bump this up as needed.

type=int,
)
@click.option(
"--skip-completed",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is working as intended. I get a lot of warnings when I try this:

$ poetry run gen3 -vv download-multiple-async --manifest MIDRC_case_manifest.json --download-path ./downloads --skip-completed 
[2025-09-02 14:28:36,954][  DEBUG] Initializing auth..
Found 5178 files to download
Continue with async download? [y/N]: y
Downloading:   0%|                                                                                                                                                                                                                      | 0/5178 [00:00<?, ?it/s][2025-09-02 14:28:40,004][WARNING] File will be overwritten: downloads/10041569-u_H3HaB1lES6-HXVpiEfMA/2.16.840.1.114274.1818.48858790339993589885669552017090272896/2.16.840.1.114274.1818.56909342758958044433235758152671341713.zip
[2025-09-02 14:28:40,010][WARNING] File will be overwritten: downloads/10041569-FtpnR4GEPEe9JztuDUqrXg/2.16.840.1.114274.1818.544490635779373958610541144202466729913/2.16.840.1.114274.1818.55741368199706394947919032241968638388.zip
[2025-09-02 14:28:40,085][WARNING] File will be overwritten: downloads/10041569-CooqJZVIRkywB6_o9CJi0Q/2.16.840.1.114274.1818.567640554746210778914433462824885428113/2.16.840.1.114274.1818.523353596424499437510054502719882590104.zip

Actually... it seems like maybe it is actually skipping but we need to suppress all these warnings and/or not overwrite existing files, just skip downloading again.

)
@click.option(
"--skip-completed",
is_flag=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can have an is_flag and default to True b/c how do I set this to false then? it just needs to be a boolean but default to true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could reverse the logic here and just call it: redownload-completed and then leave is_flag=True but default to false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to update docs if you change this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change it, actually. We need to ensure we're matching previous commands exactly.

Here's the help from gen3-client:

Flags:
      --download-path string     The directory in which to store the downloaded files (default ".")
      --filename-format string   The format of filename to be used, including "original", "guid" and "combined" (default "original")
  -h, --help                     help for download-multiple
      --manifest string          The manifest file to read from. A valid manifest can be acquired by using the "Download Manifest" button in Data Explorer from a data common's portal
      --no-prompt                If set to true, will not display user prompt message for confirmation
      --numparallel int          Number of downloads to run in parallel (default 1)
      --protocol string          Specify the preferred protocol with --protocol=s3
      --rename                   Only useful when "--filename-format=original", will rename file by appending a counter value to its filename if set to true, otherwise the same filename will be used
      --skip-completed           If set to true, will check for filename and size before download and skip any files in "download-path" that matches both

Let's change the overall command name to match too. download-multiple instead of download-multiple-async

Keep skip completed but make it a boolean so I can pass in false.

I'm okay with --numparallel not existing and being different b/c the parallelization scheme in Python is a bit different... maybe we can allow that to be passed in though and just use the value as max-concurrent-requests if that isn't supplied. So people can easily port their commands over and have them work right away

- **High-bandwidth networks**: Increase the number of worker processes
- **Limited memory**: Reduce queue sizes to manage memory usage

### Memory Management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this section, I feel it's redundant given the above config info

- **Batch Size**: Balance between memory usage and processing overhead
- **Process Count**: Match available CPU cores for optimal performance

### Network Optimization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also remove this section


- Check network bandwidth and server limits
- Reduce concurrent request limits if server is overwhelmed
- Verify authentication token is valid
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auth token would not be a reason for slow downloads, let's remove this

Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to add unit tests for this new functionality

main.add_command(drs_pull.drs_pull)
main.add_command(file.file)
main.add_command(download.download_single, name="download-single")
main.add_command(download.download_multiple_async, name="download-multiple-async")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
main.add_command(download.download_multiple_async, name="download-multiple-async")
main.add_command(download.download_multiple, name="download-multiple")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to remember to retest after any changes to your code, this was broken b/c the the function doesn't exist anymore

@Dhiren-Mhatre Dhiren-Mhatre changed the title Feat : Multiple download functionality with performance testing Feat : Multiple download functionality Sep 9, 2025
]


def test_download_single_success(gen3_file):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a docstring for each test describing what you're testing

gen3 --endpoint data.commons.io --auth creds.json download-multiple \
--manifest large_dataset.json \
--download-path ./large_downloads \
--max-concurrent-requests 20 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change to:

Suggested change
--max-concurrent-requests 20 \
--max-concurrent-requests 64 \
--max-concurrent-requests 8

Dhiren-Mhatre and others added 5 commits October 2, 2025 18:18
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Signed-off-by: Dhiren-Mhatre <kp064669@gmail.com>
Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will also need to update this test. I got it working:

def test_download_single_basic_functionality(gen3_file):
    """
    Test download_single basic functionality with synchronous download.

    Verifies that download_single downloads a file successfully using
    synchronous requests and returns True.
    """
    gen3_file._auth_provider._refresh_token = {"api_key": "123"}

    with patch.object(gen3_file, 'get_presigned_url') as mock_presigned, \
         patch('gen3.file.requests.get') as mock_get, \
         patch('gen3.index.Gen3Index.get_record') as mock_index:

        mock_presigned.return_value = {"url": "https://fake-url.com/file"}
        mock_index.return_value = {"file_name": "test-file.txt"}
        mock_response = MagicMock()
        mock_response.status_code = 200
        mock_response.headers = {"content-length": "12"}
        mock_response.iter_content = lambda size: [b"test content"]
        mock_get.return_value = mock_response

        result = gen3_file.download_single(object_id="test-guid", path="/tmp", protocol="s3")

        assert result == True
        mock_presigned.assert_called_once_with("test-guid", protocol="s3")
        mock_index.assert_called_once_with("test-guid")

jacob50231
jacob50231 previously approved these changes Jan 20, 2026
@jacob50231 jacob50231 dismissed their stale review January 20, 2026 16:51

requesting review from alex

@click.option(
"--numparallel",
default=None,
help="Number of downloads to run in parallel (compatibility with gen3-client)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="Number of downloads to run in parallel (compatibility with gen3-client)",
help="Same as max-concurrent-requests, for compatibility with previous gen3-client arguments",

Comment on lines +250 to +255
click.echo("\nFailed downloads:")
for failure in result["failed"]:
click.echo(
f" - {failure.get('guid', 'unknown')}: {failure.get('error', 'Unknown error')}"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not working, I'm getting this:

[2026-01-21 15:19:03,719][  ERROR] Async batch download failed: 'str' object has no attribute 'get'
[2026-01-21 15:19:03,719][  ERROR] Async batch download failed: 'str' object has no attribute 'get'

if I provide the command poetry run gen3 download-multiple --manifest MIDRC_case_manifest_small.json --no-progress --no-prompt --protocol foo --skip-completed false

I suggest we remove this listing of failures entirely, as the ERROR logs themselves have the GUIDs that failed in them

)

click.echo(
"\nTo retry failed downloads, run the same command with --skip-completed flag:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"\nTo retry failed downloads, run the same command with --skip-completed flag:"
"\nTo retry failed downloads, run the same command with --skip-completed true"

"\nTo retry failed downloads, run the same command with --skip-completed flag:"
)

success_rate = len(result["succeeded"]) / len(manifest_data) * 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rethink this because the way it's working, I get a non 100 success if I skip previously downloaded files.

$ poetry run gen3 -vv download-multiple --manifest MIDRC_case_manifest.json --no-prompt --max-concurrent-requests 64 --num-processes 8

...

Async Download Results:
✓ Succeeded: 4871
- Skipped: 307

Success rate: 94.1%

If --skip-completed is true, we should count skipped towards success imo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is even more apparent when I rerun a successful download and it skips everything

Async Download Results:
✓ Succeeded: 0
- Skipped: 42

Success rate: 0.0%

auth = ctx.obj["auth_factory"].get()

# Use numparallel as max_concurrent_requests if provided (for gen3-client compatibility)
if numparallel is not None and max_concurrent_requests == 20: # 20 is the default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a global for this default value, don't hard code and make a magic number

help="Protocol for presigned URL (e.g., s3) (default: auto-detect)",
)
@click.pass_context
def download_single(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like this support the same interface as the gen3-client, and ideally the same params (the ones that translate) from download-multiple. What we could maybe do is just wrap download-multiple behind the scenes but pass a manifest with a single GUID (e.g. don't repeat code we don't need to repeat).

As a reference, this is gen3-client's support:

$ gen3-client download-single --help                                                                        
Gets a presigned URL for a file from a GUID and then downloads the specified file.

Usage:
  gen3-client download-single [flags]

Examples:
./gen3-client download-single --profile=<profile-name> --guid=206dfaa6-bcf1-4bc9-b2d0-77179f0f48fc

Flags:
      --download-path string     The directory in which to store the downloaded files (default ".")
      --filename-format string   The format of filename to be used, including "original", "guid" and "combined" (default "original")
      --guid string              Specify the guid for the data you would like to work with
  -h, --help                     help for download-single
      --no-prompt                If set to true, will not display user prompt message for confirmation
      --protocol string          Specify the preferred protocol with --protocol=gs
      --rename                   Only useful when "--filename-format=original", will rename file by appending a counter value to its filename if set to true, otherwise the same filename will be used
      --skip-completed           If set to true, will check for filename and size before download and skip any files in "download-path" that matches both

Global Flags:
      --profile string   Specify profile to use

Copy link
Contributor

@Avantol13 Avantol13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general things are looking good but see my comments. It's probably prudent to test using Python 3.13 (I did locally and had to fix a dep issue, I think just a bump, but double check that).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants