-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hello,
I'm experiencing significant issues when trying to use Text Generation Inference (TGI) with TensorRT-LLM as the backend.
Problem 1: Version Compatibility
I cannot use the latest version of TGI due to a known bug (see: #3296).
I'm therefore using version: ghcr.io/huggingface/text-generation-inference:3.3.4-trtllm
However, this version uses TensorRT-LLM v0.17.0.post1, while the latest optimum-nvidia version ([v0.1.0b9]) uses TensorRT-LLM 0.16.0.
When I try to launch TGI with my engine built using optimum-nvidia, I get the following error:
root@5ddf177112d7:/usr/local/tgi/bin# /usr/local/tgi/bin/text-generation-launcher --model-id "/engines/llama-3.2-3b-instruct-optimum/GPU/engines" --tokenizer-name "/models/llama-3.2-3b-instruct" --executor-worker "/usr/local/tgi/bin/executorWorker"
2025-07-27T06:16:40.717109Z INFO text_generation_backends_trtllm: backends/trtllm/src/main.rs:293: Successfully retrieved tokenizer /models/llama-3.2-3b-instruct
[2025-07-27 06:16:40.717] [info] [ffi.hpp:164] Initializing TGI - TensoRT-LLM Backend (v0.17.0.post1)
[2025-07-27 06:16:40.747] [info] [ffi.hpp:173] [FFI] Detected 1 Nvidia GPU(s)
[2025-07-27 06:16:40.758] [info] [backend.cpp:22] Detected single engine deployment, using leader mode
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 64
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 262144
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095 = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 4096 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 6981 MiB
[TensorRT-LLM][ERROR] IRuntime::deserializeCudaEngine: Error Code 6: API Usage Error (The engine plan file is not compatible with this version of TensorRT, expecting library version 10.8.0.43 got
..)
Error: Runtime("[TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/usr/src/text-generation-inference/target/release/build/text-generation-backends-trtllm-479f10d4b58ebb37/out/build/_deps/trtllm-src/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:239)")
Problem 2: Building Engine with trtllm-build
I attempted to build my engine directly using trtllm-build
, but when launching TGI, I encounter this error:
2025-07-27T06:15:55.033318Z INFO text_generation_backends_trtllm: backends/trtllm/src/main.rs:293: Successfully retrieved tokenizer /models/llama-3.2-3b-instruct
[2025-07-27 06:15:55.034] [info] [ffi.hpp:164] Initializing TGI - TensoRT-LLM Backend (v0.17.0.post1)
[2025-07-27 06:15:55.101] [info] [ffi.hpp:173] [FFI] Detected 1 Nvidia GPU(s)
terminate called after throwing an instance of 'nlohmann::json_abi_v3_11_3::detail::parse_error'
what(): [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON
The error suggests it cannot find a JSON file, but the config.json
file is present in the engine directory:
root@5ddf177112d7:/usr/local/tgi/bin# ls -l /engines/llama-3.2-3b-instruct/
total 3033324
-rw-r--r-- 1 root root 7848 Jul 26 17:21 config.json
-rw-r--r-- 1 root root 3106108276 Jul 26 17:21 rank0.engine
Environment:
- Model: llama-3.2-3b-instruct
- TGI Version: 3.3.4-trtllm
- TensorRT-LLM Version: v0.17.0.post1
Could you please help resolve these compatibility issues or provide guidance on the correct workflow for using TensorRT-LLM with TGI?
Information
- Docker
- The CLI directly
Tasks
- An officially supported command
- My own modifications
Reproduction
1/ Build your engine :
docker run --rm -it --gpus=1 --shm-size=1g -v "/home/jyce/unmute.mcp/volumes/llm-tgi/engines:/engines" -v "/home/jyce/unmute.mcp/volumes/llm-tgi/models:/models" huggingface/optimum-nvidia:v0.1.0b8-py310 bash
optimum-cli export trtllm \
--tp=1 \
--pp=1 \
--max-batch-size=64 \
--max-input-length 4096 \
--max-output-length 8192 \
--max-beams-width=1 \
--destination /engines/llama-3.2-3b-instruct-optimum /models/llama-3.2-3b-instruct
2/ Start TGI with the engine:
docker run --gpus 1 --shm-size=1g -it --rm -p 8000:8000 -e MODEL="/models/llama-3.2-3b-instruct" -e PORT=8000 -e HF_TOKEN="" -v "/home/jyce/unmute.mcp/volumes/llm-tgi/models:/models" -v "/home/jyce/unmute.mcp/volumes/llm-tgi/engines:/engines" --entrypoint=bash ghcr.io/huggingface/text-generation-inference:3.3.4-trtllm
/usr/local/tgi/bin/text-generation-launcher --model-id "/engines/llama-3.2-3b-instruct-optimum/GPU/engines" --tokenizer-name "/models/llama-3.2-3b-instruct" --executor-worker "/usr/local/tgi/bin/executorWorker"
Expected behavior
TGI serving my engine