using trtllm-build instead of optimum-nvidia for engine building or optimum-nvidia wrong version ?


Hello,

I'm experiencing significant issues when trying to use Text Generation Inference (TGI) with TensorRT-LLM as the backend.

**Problem 1: Version Compatibility**
I cannot use the latest version of TGI due to a known bug (see: https://github.com/huggingface/text-generation-inference/issues/3296).

I'm therefore using version: `ghcr.io/huggingface/text-generation-inference:3.3.4-trtllm`

However, this version uses TensorRT-LLM v0.17.0.post1, while the latest optimum-nvidia version ([[v0.1.0b9](https://github.com/huggingface/optimum-nvidia/releases/tag/v0.1.0b9)]) uses TensorRT-LLM 0.16.0.

When I try to launch TGI with my engine built using optimum-nvidia, I get the following error:
```
root@5ddf177112d7:/usr/local/tgi/bin# /usr/local/tgi/bin/text-generation-launcher --model-id "/engines/llama-3.2-3b-instruct-optimum/GPU/engines" --tokenizer-name "/models/llama-3.2-3b-instruct" --executor-worker "/usr/local/tgi/bin/executorWorker"
2025-07-27T06:16:40.717109Z  INFO text_generation_backends_trtllm: backends/trtllm/src/main.rs:293: Successfully retrieved tokenizer /models/llama-3.2-3b-instruct
[2025-07-27 06:16:40.717] [info] [ffi.hpp:164] Initializing TGI - TensoRT-LLM Backend (v0.17.0.post1)
[2025-07-27 06:16:40.747] [info] [ffi.hpp:173] [FFI] Detected 1 Nvidia GPU(s)
[2025-07-27 06:16:40.758] [info] [backend.cpp:22] Detected single engine deployment, using leader mode
[TensorRT-LLM][INFO] Engine version 0.16.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 1, MPI local size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 64
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 4096
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (4096) * 28
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 262144
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 4095  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 4096 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 6981 MiB
[TensorRT-LLM][ERROR] IRuntime::deserializeCudaEngine: Error Code 6: API Usage Error (The engine plan file is not compatible with this version of TensorRT, expecting library version 10.8.0.43 got
..)
Error: Runtime("[TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine. (/usr/src/text-generation-inference/target/release/build/text-generation-backends-trtllm-479f10d4b58ebb37/out/build/_deps/trtllm-src/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:239)")
```

**Problem 2: Building Engine with trtllm-build**
I attempted to build my engine directly using `trtllm-build`, but when launching TGI, I encounter this error:

```
2025-07-27T06:15:55.033318Z  INFO text_generation_backends_trtllm: backends/trtllm/src/main.rs:293: Successfully retrieved tokenizer /models/llama-3.2-3b-instruct
[2025-07-27 06:15:55.034] [info] [ffi.hpp:164] Initializing TGI - TensoRT-LLM Backend (v0.17.0.post1)
[2025-07-27 06:15:55.101] [info] [ffi.hpp:173] [FFI] Detected 1 Nvidia GPU(s)
terminate called after throwing an instance of 'nlohmann::json_abi_v3_11_3::detail::parse_error'
  what():  [json.exception.parse_error.101] parse error at line 1, column 1: attempting to parse an empty input; check that your input string or stream contains the expected JSON
```

The error suggests it cannot find a JSON file, but the `config.json` file is present in the engine directory:

```bash
root@5ddf177112d7:/usr/local/tgi/bin# ls -l /engines/llama-3.2-3b-instruct/
total 3033324
-rw-r--r-- 1 root root       7848 Jul 26 17:21 config.json
-rw-r--r-- 1 root root 3106108276 Jul 26 17:21 rank0.engine
```

**Environment:**
- Model: llama-3.2-3b-instruct
- TGI Version: 3.3.4-trtllm
- TensorRT-LLM Version: v0.17.0.post1

Could you please help resolve these compatibility issues or provide guidance on the correct workflow for using TensorRT-LLM with TGI?

### Information

- [x] Docker
- [ ] The CLI directly

### Tasks

- [x] An officially supported command
- [ ] My own modifications

### Reproduction

**1/ Build your engine :** 
`docker run   --rm   -it   --gpus=1   --shm-size=1g   -v "/home/jyce/unmute.mcp/volumes/llm-tgi/engines:/engines" -v "/home/jyce/unmute.mcp/volumes/llm-tgi/models:/models"  huggingface/optimum-nvidia:v0.1.0b8-py310     bash
`
```
 optimum-cli export trtllm \
    --tp=1 \
    --pp=1 \
    --max-batch-size=64 \
    --max-input-length 4096 \
    --max-output-length 8192 \
    --max-beams-width=1 \
    --destination /engines/llama-3.2-3b-instruct-optimum /models/llama-3.2-3b-instruct
```

**2/ Start TGI with the engine:**

`docker run   --gpus 1   --shm-size=1g   -it   --rm   -p 8000:8000   -e MODEL="/models/llama-3.2-3b-instruct"   -e PORT=8000   -e HF_TOKEN=""   -v "/home/jyce/unmute.mcp/volumes/llm-tgi/models:/models"   -v "/home/jyce/unmute.mcp/volumes/llm-tgi/engines:/engines" --entrypoint=bash ghcr.io/huggingface/text-generation-inference:3.3.4-trtllm`

`/usr/local/tgi/bin/text-generation-launcher --model-id "/engines/llama-3.2-3b-instruct-optimum/GPU/engines" --tokenizer-name "/models/llama-3.2-3b-instruct" --executor-worker "/usr/local/tgi/bin/executorWorker"`

### Expected behavior

TGI serving my engine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

using trtllm-build instead of optimum-nvidia for engine building or optimum-nvidia wrong version ? #3304

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

using trtllm-build instead of optimum-nvidia for engine building or optimum-nvidia wrong version ? #3304

Description

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions