Skip to content

using trtllm-build instead of optimum-nvidia for engine building or optimum-nvidia wrong version ? #3304

@psykokwak-com

Description

@psykokwak-com

Hello,

I'm experiencing significant issues when trying to use Text Generation Inference (TGI) with TensorRT-LLM as the backend.

Problem 1: Version Compatibility
I cannot use the latest version of TGI due to a known bug (see: #3296).

I'm therefore using version: ghcr.io/huggingface/text-generation-inference:3.3.4-trtllm

However, this version uses TensorRT-LLM v0.17.0.post1, while the latest optimum-nvidia version ([v0.1.0b9]) uses TensorRT-LLM 0.16.0.

When I try to launch TGI with my engine built using optimum-nvidia, I get the following error:

root@5ddf177112d7:/usr/local/tgi/bin# /usr/local/tgi/bin/text-generation-launcher --model-id "/engines/llama-3.2-3b-instruct-optimum/GPU/engines" --tokenizer-name "/models/llama-3.2-3b-instruct" --executor-worker "/usr/local/tgi/bin/executorWorker"
2025-07-27T06:16:40.717109Z  INFO text_generation_backends_trtllm: backends/trtllm/src/main.rs:293: Successfully retrieved tokenizer /models/llama-3.2-3b-instruct
[2025-07-27 06:16:40.717] [info] [ffi.hpp:164] Initializing TGI - TensoRT-LLM Backend (v0.17.0.post1)
[2025-07-27 06:16:40.747] [info] [ffi.hpp:173] [FFI] Detected 1 Nvidia GPU(s)
[2025-07-27 06:16:40.758] [info] [backend.cpp:22] Detected single engine deployment, using leader mode
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 64
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 262144
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 4096 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 6981 MiB
[TensorRT-LLM][ERROR] IRuntime::deserializeCudaEngine: Error Code 6: API Usage Error (The engine plan file is not compatible with this version of TensorRT, expecting library version 10.8.0.43 got
..)
Error: Runtime("[TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/usr/src/text-generation-inference/target/release/build/text-generation-backends-trtllm-479f10d4b58ebb37/out/build/_deps/trtllm-src/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:239)")

Problem 2: Building Engine with trtllm-build
I attempted to build my engine directly using trtllm-build, but when launching TGI, I encounter this error:

2025-07-27T06:15:55.033318Z  INFO text_generation_backends_trtllm: backends/trtllm/src/main.rs:293: Successfully retrieved tokenizer /models/llama-3.2-3b-instruct
[2025-07-27 06:15:55.034] [info] [ffi.hpp:164] Initializing TGI - TensoRT-LLM Backend (v0.17.0.post1)
[2025-07-27 06:15:55.101] [info] [ffi.hpp:173] [FFI] Detected 1 Nvidia GPU(s)
terminate called after throwing an instance of 'nlohmann::json_abi_v3_11_3::detail::parse_error'
  what():  [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON

The error suggests it cannot find a JSON file, but the config.json file is present in the engine directory:

root@5ddf177112d7:/usr/local/tgi/bin# ls -l /engines/llama-3.2-3b-instruct/
total 3033324
-rw-r--r-- 1 root root       7848 Jul 26 17:21 config.json
-rw-r--r-- 1 root root 3106108276 Jul 26 17:21 rank0.engine

Environment:

  • Model: llama-3.2-3b-instruct
  • TGI Version: 3.3.4-trtllm
  • TensorRT-LLM Version: v0.17.0.post1

Could you please help resolve these compatibility issues or provide guidance on the correct workflow for using TensorRT-LLM with TGI?

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

1/ Build your engine :
docker run --rm -it --gpus=1 --shm-size=1g -v "/home/jyce/unmute.mcp/volumes/llm-tgi/engines:/engines" -v "/home/jyce/unmute.mcp/volumes/llm-tgi/models:/models" huggingface/optimum-nvidia:v0.1.0b8-py310 bash

 optimum-cli export trtllm \
    --tp=1 \
    --pp=1 \
    --max-batch-size=64 \
    --max-input-length 4096 \
    --max-output-length 8192 \
    --max-beams-width=1 \
    --destination /engines/llama-3.2-3b-instruct-optimum /models/llama-3.2-3b-instruct

2/ Start TGI with the engine:

docker run --gpus 1 --shm-size=1g -it --rm -p 8000:8000 -e MODEL="/models/llama-3.2-3b-instruct" -e PORT=8000 -e HF_TOKEN="" -v "/home/jyce/unmute.mcp/volumes/llm-tgi/models:/models" -v "/home/jyce/unmute.mcp/volumes/llm-tgi/engines:/engines" --entrypoint=bash ghcr.io/huggingface/text-generation-inference:3.3.4-trtllm

/usr/local/tgi/bin/text-generation-launcher --model-id "/engines/llama-3.2-3b-instruct-optimum/GPU/engines" --tokenizer-name "/models/llama-3.2-3b-instruct" --executor-worker "/usr/local/tgi/bin/executorWorker"

Expected behavior

TGI serving my engine

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions