-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
System Info
2.2.0
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
On 4*H100:
docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
--shm-size 10.24gb \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-e TRANSFORMERS_CACHE="/.cache/" -p \
5005:80 \
-v $HOME/.cache:/.cache/ \
-v $HOME/.cache/huggingface/hub/:/data \
--name llama31-70b-tgi \
ghcr.io/huggingface/text-generation-inference:2.2.0 \
--model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
--max-input-length 131072 \
--max-total-tokens 139264 \
--max-stop-sequences 6 \
--num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt
get:
RuntimeError: Not enough memory to handle 131122 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
vLLM works fine without errors.
Expected behavior
able to launch and use without error like vLLM
Jason-CKY, nrepesh, rishu931997, freegheist, jazken and 4 more
Metadata
Metadata
Assignees
Labels
No labels