-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Environment
Runtime environment:
- Target: x86_64-unknown-linux-gnu
- Cargo version: 1.75.0
- Commit sha: c38a7d7
- Docker label: sha-6c4496a
Kubernetes Cluster deployment
1 A100 GPU with 80GB RAM
12 CPU with 32 GB RAM
TGI version: 2.0.0
What I am doing
I am running text-generation-benchmark
to find the sweet spot between throughput and latency for my hardware. I am trying to maximize the batch tokens by looking at the inferred MAX_BATCH_TOTAL_TOKENS
by text-generation-launcher
, however I get out of memory errors.
When running export LOG_LEVEL=INFO; text-generation-launcher --hostname 0.0.0.0 --port 8080
I see the MAX_BATCH_TOTAL_TOKENS
inferred to be 425472
.
2024-04-30T10:25:51.994120Z INFO text_generation_launcher: Args { model_id: "/model_data/mistral7b-free", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(8000), max_total_tokens: Some(8512), waiting_served_ratio: 0.3, max_batch_prefill_tokens: Some(32768), max_batch_total_tokens: Some(4294967295), max_waiting_tokens: 0, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 8080, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false }
2024-04-30T10:25:51.994178Z INFO text_generation_launcher: Model supports up to 32768 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=32818 --max-total-tokens=32768 --max-input-tokens=32767`.
2024-04-30T10:25:51.994184Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-30T10:25:51.994271Z INFO download: text_generation_launcher: Starting download process.
2024-04-30T10:25:54.856372Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-04-30T10:25:55.330625Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-30T10:25:55.330812Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-30T10:26:05.338492Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-04-30T10:26:15.433799Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-04-30T10:26:17.323011Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-04-30T10:26:17.335204Z INFO shard-manager: text_generation_launcher: Shard ready in 22.003836391s rank=0
2024-04-30T10:26:17.431088Z INFO text_generation_launcher: Starting Webserver
2024-04-30T10:26:17.498225Z INFO text_generation_router: router/src/main.rs:250: Using config Some(Mistral)
2024-04-30T10:26:17.498245Z INFO text_generation_router: router/src/main.rs:257: Using local tokenizer config
2024-04-30T10:26:17.498263Z WARN text_generation_router: router/src/main.rs:292: no pipeline tag found for model /model_data/mistral7b-free
2024-04-30T10:26:17.500561Z INFO text_generation_router: router/src/main.rs:311: Warming up model
2024-04-30T10:26:20.987760Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]
2024-04-30T10:26:21.845520Z WARN text_generation_router: router/src/main.rs:333: `--max-batch-total-tokens` is deprecated for Flash Attention models.
2024-04-30T10:26:21.845531Z WARN text_generation_router: router/src/main.rs:337: Inferred max batch total tokens: 425472
2024-04-30T10:26:21.845534Z INFO text_generation_router: router/src/main.rs:348: Setting max batch total tokens to 425472
2024-04-30T10:26:21.845536Z INFO text_generation_router: router/src/main.rs:349: Connected
Therefore, even though text-generation-benchmark
bypasses the router completely, I should be able to process 425472
tokens at the same time without running into out of memory errors right?
So I want to see the latency for this load type: 53 requests | 4000 sequence length | 4000 decode length -> 53 requests * (4000 in + 4000 out) = 424000 tokens concurrently. This is indeed lower that the inferred upper bound (424000 < 425472).
This is the command I am running: text-generation-benchmark --tokenizer-name /model_data/mistral7b-free/ -b 53 --sequence-length 4000 --decode-length 4000
What is the unexpected behavior
However, I get out of memory errors. Here the logs:
text-generation-launcher
024-04-30T10:20:12.210371Z INFO text_generation_benchmark: benchmark/src/main.rs:138: Loading tokenizer
2024-04-30T10:20:12.210408Z INFO text_generation_benchmark: benchmark/src/main.rs:144: Found local tokenizer
Model: /model_data/mistral7b-free/ | Sequence Length: 4000 | Decode Length: 4000 <- | tab | ->: change batch tab | q / CTRL + c: quit | +/-: zoom
┌Tabs────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Batch: 53 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
┌Total Progress───────────────────────────────────────────────────────────────────┐┌Batch Progress───────────────────────────────────────────────────────────────────┐
│ 0 / 1 ││ 0 / 1 │
└─────────────────────────────────────────────────────────────────────────────────┘└─────────────────────────────────────────────────────────────────────────────────┘
┌ 0 / 1 ┐┌ 0 / 1 ┐
│Average: NaN ms ││ ││Average: NaN ms ││Average: NaN ms ││ │
│Lowest: NaN ms ││ ││Lowest: NaN ms ││Lowest: NaN ms ││ │
│Highest: NaN ms ││ ││Highest: NaN ms ││Highest: NaN ms ││ │
│p50: NaN ms ││ ││p50: NaN ms ││p50: NaN ms ││ │
│p90: NaN ms ││ ││p90: NaN ms ││p90: NaN ms ││ │
│p99: NaN ms ││ ││p99: NaN ms ││p99: NaN ms ││ │
└────────────────────────────────────────┘│ │└───────────────────┘└───────────────────┘│ │
┌Prefill Throughput──────────────────────┐│ │┌Decode Throughput───────────────────────┐│ │
│Average: NaN tokens/secs ││ ││Average: NaN tokens/secs ││ │
│Lowest: NaN tokens/secs ││ ││Lowest: NaN tokens/secs ││ │
│Highest: NaN tokens/secs ││ NaN NaN NaN NaN ││Highest: NaN tokens/secs ││ NaN NaN NaN NaN │
└────────────────────────────────────────┘└───────────────────────────────────────┘└────────────────────────────────────────┘└───────────────────────────────────────┘
┌Prefill throughput over latency──────────────────────────────────────────────────┐┌Decode throughput over latency───────────────────────────────────────────────────┐
│NaN │tokens/secs ││NaN │tokens/secs │
│ │ ││ │ │
│ │ ││ │ │
│ │ ││ │ │
│ │ ││ │ │
│NaN │ ││NaN │ │
│ │ ││ │ │
│ │ ││ │ │
│ │ ││ │ │
│NaN │ ││NaN │ │
│ │ ││ │ │
│ │ ││ │ │
│ │ ││ │ │
│NaN │ ││NaN │ │
│ │ ││ │ │
│ │ ││ │ │
│ │ ││ │ │
│0.00│ ms││0.00│ ms│
│ └────────────────────────────────────────────────────────────────────────────││ └────────────────────────────────────────────────────────────────────────────│
│ 0.00 NaN NaN NaN NaN││ 0.00 NaN NaN NaN NaN│
└─────────────────────────────────────────────────────────────────────────────────┘└─────────────────────────────────────────────────────────────────────────────────┘
2024-04-30T10:35:01.733254Z ERROR prefill{id=0 size=53}:prefill{id=0 size=53}: text_generation_client: router/client/src/lib.rs:33: Server error: CANCELLED
text-generation-launcher
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/layers.py", line 159, in forward
return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.32 GiB. GPU 0 has a total capacty of 79.14 GiB of which 3.98 GiB is free. Process 2218444 has 75.15 GiB memory in use. Of the allocated memory 73.66 GiB is allocated by PyTorch, and 963.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
nvidia-smi after OOM error
-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:17:00.0 Off | 0 |
| N/A 46C P0 87W / 300W | 76959MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
Extra notes
- I have also tried with lower batch sizes: 50, 40, 32 and they also lead to an OOM error. I could get
batch_size=2
to work though. - I have also tried to not bypass the router and send 60 concurrent requests with
sequence+decode lengths = 8000
and this DOES WORK, so I do NOT understand why it does not work when bypassing the router if the router is only really preventing you from going over theMAX_BATCH_TOTAL_TOKENS
which I am explicitly not going over when usingtext-generation-benchmark
. What am I missing?
Grafana dashboard of tgi_batch_current_max_tokens
when going through the router (423k tokens in a batch very close to the inferred MAX_BATCH_TOTAL_TOKENS
)