Description
Describe the feature you'd like
Being able to deploy huggingface multimodal models to a sagemaker endpoint.
Currently only language models are supported that require a prompt as input.
Multimodal models like Llava / CLIP / ... require a prompt and an image as input and this is currently not supported.
How would this feature be used? Please describe.
This is how the feature will be used by the end user:
import sagemaker
from sagemaker.huggingface.model import HuggingFaceModel
huggingface_model = HuggingFaceModel(
model_data="some-repo/some-multimodal-model", # <<----- specify the model
role=sagemaker.get_execution_role()
transformers_version="4.28.1",
pytorch_version="2.0.0",
py_version='py310',
model_server_workers=1
)
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.12xlarge",
container_startup_health_check_timeout=900, # increase timeout for large models
model_data_download_timeout=900, # increase timeout for large models
)
----------------!
### Call Llava
import base64
import requests
# request
data = {
"image" : "some base64 encoded image", # <---- specify the image
"question" : "Describe the image and color details.",
}
output = predictor.predict(data)
print(output)
Describe alternatives you've considered
You can package the model yourself and provide an inference.py script, but you have to download the model and tar.gz which takes a lot of time.
Additional context
I came up with this idea when I created a tar.gz for llava with an inference.py and made it available to the world. See my LinkedIn post here: https://www.linkedin.com/posts/vincent-claes-0b346337_aws-sagemaker-huggingface-activity-7141776348963885056-Uv0g?utm_source=share&utm_medium=member_desktop