Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit

# Environment
Runtime environment:
- Target: x86_64-unknown-linux-gnu
- Cargo version: 1.75.0
- Commit sha: c38a7d7ddd9c612e368adec1ef94583be602fc7e
- Docker label: sha-6c4496a

Kubernetes Cluster deployment

1 A100 GPU with 80GB RAM

12 CPU with 32 GB RAM

TGI version: 2.0.0


# What I am doing
I am running ``text-generation-benchmark`` to find the sweet spot between throughput and latency for my hardware. I am trying to maximize the batch tokens by looking at the inferred ``MAX_BATCH_TOTAL_TOKENS`` by ``text-generation-launcher``, however I get out of memory errors.

When running ``export LOG_LEVEL=INFO; text-generation-launcher --hostname 0.0.0.0 --port 8080`` I see the ``MAX_BATCH_TOTAL_TOKENS`` inferred to be ``425472``. 

```
2024-04-30T10:25:51.994120Z  INFO text_generation_launcher: Args { model_id: "/model_data/mistral7b-free", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(8000), max_total_tokens: Some(8512), waiting_served_ratio: 0.3, max_batch_prefill_tokens: Some(32768), max_batch_total_tokens: Some(4294967295), max_waiting_tokens: 0, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-04-30T10:25:51.994178Z  INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
2024-04-30T10:25:51.994184Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-30T10:25:51.994271Z  INFO download: text_generation_launcher: Starting download process.
2024-04-30T10:25:54.856372Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-30T10:25:55.330625Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-30T10:25:55.330812Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-30T10:26:05.338492Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-04-30T10:26:15.433799Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-04-30T10:26:17.323011Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2024-04-30T10:26:17.335204Z  INFO shard-manager: text_generation_launcher: Shard ready in 22.003836391s rank=0
2024-04-30T10:26:17.431088Z  INFO text_generation_launcher: Starting Webserver
2024-04-30T10:26:17.498225Z  INFO text_generation_router: router/src/main.rs:250: Using config Some(Mistral)
2024-04-30T10:26:17.498245Z  INFO text_generation_router: router/src/main.rs:257: Using local tokenizer config
2024-04-30T10:26:17.498263Z  WARN text_generation_router: router/src/main.rs:292: no pipeline tag found for model /model_data/mistral7b-free
2024-04-30T10:26:17.500561Z  INFO text_generation_router: router/src/main.rs:311: Warming up model
2024-04-30T10:26:20.987760Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]

2024-04-30T10:26:21.845520Z  WARN text_generation_router: router/src/main.rs:333: `--max-batch-total-tokens` is deprecated for Flash Attention models.
2024-04-30T10:26:21.845531Z  WARN text_generation_router: router/src/main.rs:337: Inferred max batch total tokens: 425472
2024-04-30T10:26:21.845534Z  INFO text_generation_router: router/src/main.rs:348: Setting max batch total tokens to 425472
2024-04-30T10:26:21.845536Z  INFO text_generation_router: router/src/main.rs:349: Connected
```

Therefore, even though ``text-generation-benchmark`` bypasses the router completely, I should be able to process ``425472`` tokens at the same time without running into out of memory errors right?

So I want to see the latency for this load type: 53 requests | 4000 sequence length | 4000 decode length -> 53 requests * (4000 in + 4000 out) = 424000 tokens concurrently. This is indeed lower that the inferred upper bound (424000 < 425472). 

This is the command I am running: ``text-generation-benchmark --tokenizer-name /model_data/mistral7b-free/ -b 53 --sequence-length 4000 --decode-length 4000 ``

# What is the unexpected behavior
However, I get out of memory errors. Here the logs:
### text-generation-launcher
```
024-04-30T10:20:12.210371Z  INFO text_generation_benchmark: benchmark/src/main.rs:138: Loading tokenizer
2024-04-30T10:20:12.210408Z  INFO text_generation_benchmark: benchmark/src/main.rs:144: Found local tokenizer
Model: /model_data/mistral7b-free/ | Sequence Length: 4000 | Decode Length: 4000                      <- | tab | ->: change batch tab | q / CTRL + c: quit | +/-: zoom
┌Tabs────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Batch: 53                                                                                                                                                          │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌Total Progress───────────────────────────────────────────────────────────────────┐┌Batch Progress───────────────────────────────────────────────────────────────────┐
│                                      0 / 1                                      ││                                      0 / 1                                      │
└─────────────────────────────────────────────────────────────────────────────────┘└─────────────────────────────────────────────────────────────────────────────────┘
┌                                      0 / 1                                      ┐┌                                      0 / 1                                      ┐
│Average: NaN ms                         ││                                       ││Average: NaN ms    ││Average: NaN ms    ││                                       │
│Lowest:  NaN ms                         ││                                       ││Lowest:  NaN ms    ││Lowest:  NaN ms    ││                                       │
│Highest: NaN ms                         ││                                       ││Highest: NaN ms    ││Highest: NaN ms    ││                                       │
│p50:     NaN ms                         ││                                       ││p50:     NaN ms    ││p50:     NaN ms    ││                                       │
│p90:     NaN ms                         ││                                       ││p90:     NaN ms    ││p90:     NaN ms    ││                                       │
│p99:     NaN ms                         ││                                       ││p99:     NaN ms    ││p99:     NaN ms    ││                                       │
└────────────────────────────────────────┘│                                       │└───────────────────┘└───────────────────┘│                                       │
┌Prefill Throughput──────────────────────┐│                                       │┌Decode Throughput───────────────────────┐│                                       │
│Average: NaN tokens/secs                ││                                       ││Average: NaN tokens/secs                ││                                       │
│Lowest:  NaN tokens/secs                ││                                       ││Lowest:  NaN tokens/secs                ││                                       │
│Highest: NaN tokens/secs                ││  NaN     NaN     NaN     NaN          ││Highest: NaN tokens/secs                ││  NaN     NaN     NaN     NaN          │
└────────────────────────────────────────┘└───────────────────────────────────────┘└────────────────────────────────────────┘└───────────────────────────────────────┘
┌Prefill throughput over latency──────────────────────────────────────────────────┐┌Decode throughput over latency───────────────────────────────────────────────────┐
│NaN │tokens/secs                                                                 ││NaN │tokens/secs                                                                 │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│NaN │                                                                            ││NaN │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│NaN │                                                                            ││NaN │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│NaN │                                                                            ││NaN │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│0.00│                                                                          ms││0.00│                                                                          ms│
│    └────────────────────────────────────────────────────────────────────────────││    └────────────────────────────────────────────────────────────────────────────│
│ 0.00                      NaN            NaN            NaN                  NaN││ 0.00                      NaN            NaN            NaN                  NaN│
└─────────────────────────────────────────────────────────────────────────────────┘└─────────────────────────────────────────────────────────────────────────────────┘
2024-04-30T10:35:01.733254Z ERROR prefill{id=0 size=53}:prefill{id=0 size=53}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
```

### text-generation-launcher
```
 File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 159, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.32 GiB. GPU 0 has a total capacty of 79.14 GiB of which 3.98 GiB is free. Process 2218444 has 75.15 GiB memory in use. Of the allocated memory 73.66 GiB is allocated by PyTorch, and 963.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
```

### nvidia-smi after OOM error
```
-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:17:00.0 Off |                    0 |
| N/A   46C    P0             87W /  300W |   76959MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
```


# Extra notes

- I have also tried with lower batch sizes: 50, 40, 32 and they also lead to an OOM error. I could get ``batch_size=2`` to work though.
- I have also tried to not bypass the router and send 60 concurrent requests with ``sequence+decode lengths = 8000`` and this DOES WORK, so I do NOT understand why it does not work when bypassing the router if the router is only really preventing you from going over the ``MAX_BATCH_TOTAL_TOKENS`` which I am explicitly not going over when using ``text-generation-benchmark``. What am I missing?

Grafana dashboard of ``tgi_batch_current_max_tokens`` when going through the router (423k tokens in a batch very close to the inferred ``MAX_BATCH_TOTAL_TOKENS``)
<img width="1019" alt="Screenshot 2024-04-30 at 12 42 32" src="https://github.com/huggingface/text-generation-inference/assets/58919465/4cdb387b-1960-4cc1-ae2d-991f2c7e19d4">
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit #1831

Environment

What I am doing

What is the unexpected behavior

text-generation-launcher

text-generation-launcher

nvidia-smi after OOM error

Extra notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit #1831

Description

Environment

What I am doing

What is the unexpected behavior

text-generation-launcher

text-generation-launcher

nvidia-smi after OOM error

Extra notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions