Closed
Description
Describe the feature you'd like
"FastFile" to be an available option for s3_input_mode
in sagemaker.Processing.ProcessingInput
, in addition to "File"
and "Pipe"
. The s3 input mode is already available for TrainingInput since 2021 and greatly improves speed (-82%) according to an AWS Blog post.
How would this feature be used? Please describe.
To speed up processing jobs compared to donwloading all data and allow complex filtering of files before accessing them.
Describe alternatives you've considered
Other methods like
- downloading relevant files as part of training job with
sagemaker.s3.S3Downloader()
. Problem: I can't shard by s3 key and have to build my own sharding logic. - Using S3 prefix as the
s3_data_type
insagemaker.Processing.ProcessingInput
to filter out by prefix: Problem: Some data can't be easily filtered by prefix and you need more complex pattern matching. - Using a ManifestFile.
Additional context
I know it's not an SDK topic as long as the underlaying APIs don't provide that functionality but I don't know where I can put the feature request otherwise.