Skip to content

Out of Memory Errors When Running text-generation-benchmark Despite Compliant Batch Token Limit #1831

@martinigoyanes

Description

@martinigoyanes

Environment

Runtime environment:

  • Target: x86_64-unknown-linux-gnu
  • Cargo version: 1.75.0
  • Commit sha: c38a7d7
  • Docker label: sha-6c4496a

Kubernetes Cluster deployment

1 A100 GPU with 80GB RAM

12 CPU with 32 GB RAM

TGI version: 2.0.0

What I am doing

I am running text-generation-benchmark to find the sweet spot between throughput and latency for my hardware. I am trying to maximize the batch tokens by looking at the inferred MAX_BATCH_TOTAL_TOKENS by text-generation-launcher, however I get out of memory errors.

When running export LOG_LEVEL=INFO; text-generation-launcher --hostname 0.0.0.0 --port 8080 I see the MAX_BATCH_TOTAL_TOKENS inferred to be 425472.

2024-04-30T10:25:51.994120Z  INFO text_generation_launcher: Args { model_id: "/model_data/mistral7b-free", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(8000), max_total_tokens: Some(8512), waiting_served_ratio: 0.3, max_batch_prefill_tokens: Some(32768), max_batch_total_tokens: Some(4294967295), max_waiting_tokens: 0, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-04-30T10:25:51.994178Z  INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
2024-04-30T10:25:51.994184Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-30T10:25:51.994271Z  INFO download: text_generation_launcher: Starting download process.
2024-04-30T10:25:54.856372Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-30T10:25:55.330625Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-30T10:25:55.330812Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-30T10:26:05.338492Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-04-30T10:26:15.433799Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-04-30T10:26:17.323011Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2024-04-30T10:26:17.335204Z  INFO shard-manager: text_generation_launcher: Shard ready in 22.003836391s rank=0
2024-04-30T10:26:17.431088Z  INFO text_generation_launcher: Starting Webserver
2024-04-30T10:26:17.498225Z  INFO text_generation_router: router/src/main.rs:250: Using config Some(Mistral)
2024-04-30T10:26:17.498245Z  INFO text_generation_router: router/src/main.rs:257: Using local tokenizer config
2024-04-30T10:26:17.498263Z  WARN text_generation_router: router/src/main.rs:292: no pipeline tag found for model /model_data/mistral7b-free
2024-04-30T10:26:17.500561Z  INFO text_generation_router: router/src/main.rs:311: Warming up model
2024-04-30T10:26:20.987760Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]

2024-04-30T10:26:21.845520Z  WARN text_generation_router: router/src/main.rs:333: `--max-batch-total-tokens` is deprecated for Flash Attention models.
2024-04-30T10:26:21.845531Z  WARN text_generation_router: router/src/main.rs:337: Inferred max batch total tokens: 425472
2024-04-30T10:26:21.845534Z  INFO text_generation_router: router/src/main.rs:348: Setting max batch total tokens to 425472
2024-04-30T10:26:21.845536Z  INFO text_generation_router: router/src/main.rs:349: Connected

Therefore, even though text-generation-benchmark bypasses the router completely, I should be able to process 425472 tokens at the same time without running into out of memory errors right?

So I want to see the latency for this load type: 53 requests | 4000 sequence length | 4000 decode length -> 53 requests * (4000 in + 4000 out) = 424000 tokens concurrently. This is indeed lower that the inferred upper bound (424000 < 425472).

This is the command I am running: text-generation-benchmark --tokenizer-name /model_data/mistral7b-free/ -b 53 --sequence-length 4000 --decode-length 4000

What is the unexpected behavior

However, I get out of memory errors. Here the logs:

text-generation-launcher

024-04-30T10:20:12.210371Z  INFO text_generation_benchmark: benchmark/src/main.rs:138: Loading tokenizer
2024-04-30T10:20:12.210408Z  INFO text_generation_benchmark: benchmark/src/main.rs:144: Found local tokenizer
Model: /model_data/mistral7b-free/ | Sequence Length: 4000 | Decode Length: 4000                      <- | tab | ->: change batch tab | q / CTRL + c: quit | +/-: zoom
┌Tabs────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Batch: 53                                                                                                                                                          │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌Total Progress───────────────────────────────────────────────────────────────────┐┌Batch Progress───────────────────────────────────────────────────────────────────┐
│                                      0 / 1                                      ││                                      0 / 1                                      │
└─────────────────────────────────────────────────────────────────────────────────┘└─────────────────────────────────────────────────────────────────────────────────┘
┌                                      0 / 1                                      ┐┌                                      0 / 1                                      ┐
│Average: NaN ms                         ││                                       ││Average: NaN ms    ││Average: NaN ms    ││                                       │
│Lowest:  NaN ms                         ││                                       ││Lowest:  NaN ms    ││Lowest:  NaN ms    ││                                       │
│Highest: NaN ms                         ││                                       ││Highest: NaN ms    ││Highest: NaN ms    ││                                       │
│p50:     NaN ms                         ││                                       ││p50:     NaN ms    ││p50:     NaN ms    ││                                       │
│p90:     NaN ms                         ││                                       ││p90:     NaN ms    ││p90:     NaN ms    ││                                       │
│p99:     NaN ms                         ││                                       ││p99:     NaN ms    ││p99:     NaN ms    ││                                       │
└────────────────────────────────────────┘│                                       │└───────────────────┘└───────────────────┘│                                       │
┌Prefill Throughput──────────────────────┐│                                       │┌Decode Throughput───────────────────────┐│                                       │
│Average: NaN tokens/secs                ││                                       ││Average: NaN tokens/secs                ││                                       │
│Lowest:  NaN tokens/secs                ││                                       ││Lowest:  NaN tokens/secs                ││                                       │
│Highest: NaN tokens/secs                ││  NaN     NaN     NaN     NaN          ││Highest: NaN tokens/secs                ││  NaN     NaN     NaN     NaN          │
└────────────────────────────────────────┘└───────────────────────────────────────┘└────────────────────────────────────────┘└───────────────────────────────────────┘
┌Prefill throughput over latency──────────────────────────────────────────────────┐┌Decode throughput over latency───────────────────────────────────────────────────┐
│NaN │tokens/secs                                                                 ││NaN │tokens/secs                                                                 │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│NaN │                                                                            ││NaN │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│NaN │                                                                            ││NaN │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│NaN │                                                                            ││NaN │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│    │                                                                            ││    │                                                                            │
│0.00│                                                                          ms││0.00│                                                                          ms│
│    └────────────────────────────────────────────────────────────────────────────││    └────────────────────────────────────────────────────────────────────────────│
│ 0.00                      NaN            NaN            NaN                  NaN││ 0.00                      NaN            NaN            NaN                  NaN│
└─────────────────────────────────────────────────────────────────────────────────┘└─────────────────────────────────────────────────────────────────────────────────┘
2024-04-30T10:35:01.733254Z ERROR prefill{id=0 size=53}:prefill{id=0 size=53}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED

text-generation-launcher

 File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 159, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.32 GiB. GPU 0 has a total capacty of 79.14 GiB of which 3.98 GiB is free. Process 2218444 has 75.15 GiB memory in use. Of the allocated memory 73.66 GiB is allocated by PyTorch, and 963.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

nvidia-smi after OOM error

-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:17:00.0 Off |                    0 |
| N/A   46C    P0             87W /  300W |   76959MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

Extra notes

  • I have also tried with lower batch sizes: 50, 40, 32 and they also lead to an OOM error. I could get batch_size=2 to work though.
  • I have also tried to not bypass the router and send 60 concurrent requests with sequence+decode lengths = 8000 and this DOES WORK, so I do NOT understand why it does not work when bypassing the router if the router is only really preventing you from going over the MAX_BATCH_TOTAL_TOKENS which I am explicitly not going over when using text-generation-benchmark. What am I missing?

Grafana dashboard of tgi_batch_current_max_tokens when going through the router (423k tokens in a batch very close to the inferred MAX_BATCH_TOTAL_TOKENS)
Screenshot 2024-04-30 at 12 42 32

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions