Can't run llama3.1-70b at full context

### System Info

2.2.0

### Information

- [X] Docker
- [ ] The CLI directly

### Tasks

- [X] An officially supported command
- [ ] My own modifications

### Reproduction

On 4*H100:
```
docker stop llama31-70b-tgi ; docker remove llama31-70b-tgi
sudo docker run -d --restart=always --gpus '"device=0,1,2,3"' \
             --shm-size 10.24gb \
             -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
             -e TRANSFORMERS_CACHE="/.cache/" -p \
             5005:80 \
             -v $HOME/.cache:/.cache/ \
             -v $HOME/.cache/huggingface/hub/:/data \
             --name llama31-70b-tgi \
             ghcr.io/huggingface/text-generation-inference:2.2.0 \
             --model-id meta-llama/Meta-Llama-3.1-70B-Instruct \
             --max-input-length 131072 \
             --max-total-tokens 139264 \
              --max-stop-sequences 6 \
              --num-shard 4 --sharded true &>> logs.llama3.1-70b.tgi.txt
```

get:

```
RuntimeError: Not enough memory to handle 131122 prefill tokens. You need to decrease `--max-batch-prefill-tokens`
```

vLLM works fine without errors.

### Expected behavior

able to launch and use without error like vLLM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can't run llama3.1-70b at full context #2301

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can't run llama3.1-70b at full context #2301

Description

System Info

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions