Skip to content

Unable to launch triton server with TP #577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
dhruvmullick opened this issue Aug 19, 2024 · 5 comments
Open
2 of 4 tasks

Unable to launch triton server with TP #577

dhruvmullick opened this issue Aug 19, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@dhruvmullick
Copy link

dhruvmullick commented Aug 19, 2024

System Info

Built tensorrtllm_backend from source using dockerfile/Dockerfile.trt_llm_backend

tensorrt_llm 0.13.0.dev2024081300
tritonserver 2.48.0
triton image: 24.07
Cuda 12.5

Who can help?

@Tracin @kaiyux @schetlur-nv

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I've built a TRTLLM engine for meta llama 3 8B and I'm seeing the triton server get stuck while spawning if using tensor parallelism > 1.

Things work if I don't use tp while building the engine and spawning it.

Build the Engine:

python3 quantize.py --model_dir meta_llama_3_8B_instruct_fp16 \
		--dtype float16 \
	        --qformat int4_awq \
		--awq_block_size 128 \
	        --output_dir /tmp/trt_checkpoint \
		--batch_size 8 \
		--calib_size 32 \
		--tp_size 2
										
CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir /tmp/trt_checkpoint \
		--gemm_plugin float16 \
		--gpt_attention_plugin float16 \
		--kv_cache_type=paged \
		--remove_input_padding enable \
		--context_fmha enable \
		--use_paged_context_fmha enable \
		--max_seq_len 8000 \
		--max_num_tokens 4096 \
		--max_batch_size 8 \
		--output_dir trt_model \
		--log_level verbose \
		--multiple_profiles enable \
		--workers 2

Command used to launch the server:

python3 launch_triton_server.py --model_repo=triton_model_repo_copy     
--world_size 2  
--tensorrt_llm_model_name=meta_llama_3_8B_instruct_trt       
--log   
--log-file /tmp/logs.txt 
--force

Expected behavior

The server should spawn and start serving requests on localhost.

actual behavior

I see the logs on the console:

I0819 17:56:19.460307 16867 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0819 17:56:19.460335 16867 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0819 17:56:19.752444 16867 model_lifecycle.cc:472] "loading: meta_llama_3_8B_instruct_trt:1"
I0819 17:56:19.918688 16867 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"
I0819 17:56:19.918732 16867 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"
I0819 17:56:19.918737 16867 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"
I0819 17:56:19.918742 16867 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
I0819 17:56:19.933735 16867 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: meta_llama_3_8B_instruct_trt (version 1)"
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Initializing MPI with thread mode 3
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Initialized MPI
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1
[TensorRT-LLM][INFO] Refreshed the MPI local session
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8000
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8000) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 7999  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 8000
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8000 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: (8000) * 32
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 4096
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 7999  = maxSequenceLen - 1 since chunked context is enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: 8000 = maxSequenceLen.
[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 2771 MiB
[TensorRT-LLM][INFO] Loaded engine size: 2771 MiB
[TensorRT-LLM][INFO] Detecting local TP group for rank 1
[TensorRT-LLM][INFO] Detecting local TP group for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 0
[TensorRT-LLM][INFO] TP group is intra-node for rank 1
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 240.02 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 240.02 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 1. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 2. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 3. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 4. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 2723 (MiB)
[TensorRT-LLM][INFO] Switching optimization profile from: 0 to 5. Please ensure there are no enqueued operations pending in this context prior to switching profiles
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.44 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 3.44 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 15.05 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 74.69 GiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 15.05 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 74.69 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 17210
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 17210
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 125
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 125
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 67.23 GiB for max tokens in paged KV cache (1101440).
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 67.23 GiB for max tokens in paged KV cache (1101440).
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)
[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000
[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0
[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true
[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value
[TensorRT-LLM][WARNING] enable_chunked_context is set to true, will use context chunking (requires building the model with use_paged_context_fmha).
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false
[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false
[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0
[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty
[TensorRT-LLM][INFO] Engine version 0.13.0.dev2024081300 found in the config file, assuming engine(s) built by new builder API.

In the /tmp/logs.txt file, I see the last output:

I0819 17:57:02.475316 16866 backend_model_instance.cc:783] "Starting backend thread for meta_llama_3_8B_instruct_trt_0_0 at nice 0 on device 0..."
I0819 17:57:02.475672 16866 backend_model.cc:675] "Created model instance named 'meta_llama_3_8B_instruct_trt_0_0' with device id '0'"

And nothing after this.

additional notes

NA

@dhruvmullick
Copy link
Author

Even tried without quantization, following the steps given in the official examples

python convert_checkpoint.py --model_dir meta_llama_3_8B_instruct \
                            --output_dir /tmp/tllm_checkpoint_2gpu_tp2 \
                            --dtype bfloat16 \
                            --tp_size 2

trtllm-build --checkpoint_dir /tmp/tllm_checkpoint_2gpu_tp2 \
            --output_dir meta_llama_3_1_8B_instruct/bf16/2-gpu/ \
	    --max_batch_size 8 \
            --gemm_plugin auto

Still stuck.

Tried making batch size consistent between triton model config and built engine, but no gain.

@dhruvmullick
Copy link
Author

dhruvmullick commented Aug 30, 2024

I tried the official image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 which was launched 2 days back, and built the TRT engines from this

Problem remains though, even with reduce_fusion enabled. Logs below:

Logs

root@763cf08503e3:/workspace/dhruv_artificial_agency/inference_service# python3 launch_triton_server.py --world_size=2 --model_repo=models_dhruv --log --tensorrt_llm_model_name=meta_llama_3_1_8B_instruct_vanilla_trt

root@763cf08503e3:/workspace/dhruv_artificial_agency/inference_service# I0830 02:11:11.780256 2914 pinned_memory_manager.cc:277] "Pinned memory pool is created at '0x7fe13e000000' with size 268435456"

I0830 02:11:11.797627 2914 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"

I0830 02:11:11.797656 2914 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"

I0830 02:11:12.083192 2914 model_lifecycle.cc:472] "loading: meta_llama_3_1_8B_instruct_vanilla_trt:1"

I0830 02:11:12.265538 2914 libtensorrtllm.cc:55] "TRITONBACKEND_Initialize: tensorrtllm"

I0830 02:11:12.265587 2914 libtensorrtllm.cc:62] "Triton TRITONBACKEND API version: 1.19"

I0830 02:11:12.265592 2914 libtensorrtllm.cc:66] "'tensorrtllm' TRITONBACKEND API version: 1.19"

I0830 02:11:12.265597 2914 libtensorrtllm.cc:86] "backend configuration:\n{\"cmdline\":{\"auto-complete-config\":\"false\",\"backend-directory\":\"/opt/tritonserver/backends\",\"min-compute-capability\":\"6.000000\",\"default-max-batch-size\":\"4\"}}"

I0830 02:11:12.281716 2914 libtensorrtllm.cc:114] "TRITONBACKEND_ModelInitialize: meta_llama_3_1_8B_instruct_vanilla_trt (version 1)"

[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000

[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0

[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value

[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true

[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)

[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value

[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64

[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8

[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05

[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB

[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false

[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false

[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000

[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0

[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value

[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true

[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)

[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value

[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64

[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8

[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05

[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB

[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false

[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false

[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0

[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0

[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty

[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty

[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.

[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.

[TensorRT-LLM][INFO] Initializing MPI with thread mode 3

[TensorRT-LLM][INFO] Initializing MPI with thread mode 3

[TensorRT-LLM][INFO] Initialized MPI

[TensorRT-LLM][INFO] Initialized MPI

[TensorRT-LLM][INFO] Refreshed the MPI local session

[TensorRT-LLM][INFO] Refreshed the MPI local session

[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0

[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1

[TensorRT-LLM][INFO] Rank 0 is using GPU 0

[TensorRT-LLM][INFO] Rank 1 is using GPU 1

[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8

[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8

[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1

[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1048576

[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0

[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1048576

[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0

[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1

[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192

[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled

[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).

[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION

[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None

[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 8

[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8

[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1

[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 1048576

[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0

[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 1048576

[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0

[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1

[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192

[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled

[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).

[TensorRT-LLM][INFO] Capacity Scheduler Policy: MAX_UTILIZATION

[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None

[TensorRT-LLM][INFO] Loaded engine size: 8231 MiB

[TensorRT-LLM][INFO] Loaded engine size: 8231 MiB

[TensorRT-LLM][INFO] Detecting local TP group for rank 1

[TensorRT-LLM][INFO] Detecting local TP group for rank 0

[TensorRT-LLM][INFO] TP group is intra-node for rank 0

[TensorRT-LLM][INFO] TP group is intra-node for rank 1

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.04 MiB for execution context memory.

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 480.04 MiB for execution context memory.

[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8223 (MiB)

[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 8223 (MiB)

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 195.95 MB GPU memory for runtime buffers.

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 195.95 MB GPU memory for runtime buffers.

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 148.07 MB GPU memory for decoder.

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 148.07 MB GPU memory for decoder.

[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 68.94 GiB

[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 79.14 GiB, available: 68.94 GiB

[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 15885

[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true

[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 15885

[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 10728, onboard blocks to primary memory before reuse: true

[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 1016640

[TensorRT-LLM][WARNING] maxAttentionWindow and maxSequenceLen are too large for at least one sequence to fit in kvCache. they are reduced to 1016640

[TensorRT-LLM][INFO] Max KV cache pages per sequence: 15885

[TensorRT-LLM][INFO] Max KV cache pages per sequence: 15885

[TensorRT-LLM][INFO] Number of tokens per block: 64.

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.05 GiB for max tokens in paged KV cache (1016640).

[TensorRT-LLM][INFO] Number of tokens per block: 64.

[TensorRT-LLM][INFO] [MemUsageChange] Allocated 62.05 GiB for max tokens in paged KV cache (1016640).

[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)

[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)

[TensorRT-LLM][WARNING] cancellation_check_period_ms is not specified, will be set to 100 (ms)

[TensorRT-LLM][WARNING] stats_check_period_ms is not specified, will be set to 100 (ms)

[TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000

[TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0

[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value

[TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true

[TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)

[TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value

[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64

[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8

[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05

[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB

[TensorRT-LLM][WARNING] multiBlockMode is not specified, will be set to false

[TensorRT-LLM][WARNING] enableContextFMHAFP32Acc is not specified, will be set to false

[TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0

[TensorRT-LLM][WARNING] encoder_model_path is not specified, will be left empty

[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024080600 found in the config file, assuming engine(s) built by new builder API.

Way to recreate:

  1. Enter image nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
  2. Use the same commands as Unable to launch triton server with TP #577 (comment)
  3. Use the config file:
Config


name: "meta_llama_3_1_8B_instruct_vanilla_trt"
backend: "tensorrtllm"
max_batch_size: 8

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "draft_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "decoder_input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    optional: true
    reshape: { shape: [ ] }
  },
  {
    name: "draft_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "draft_acceptance_threshold"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_INT32
    dims: [ 1 ] # TRTLLM only supports a single end id
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "embedding_bias"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p_reset_ids"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "early_stopping"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "frequency_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_context_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_generation_logits"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "streaming"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  # the unique task ID for the given LoRA.
  # To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given.
  # The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`.
  # If the cache is full the oldest LoRA will be evicted to make space for new ones.  An error is returned if `lora_task_id` is not cached.
  {
    name: "lora_task_id"
	data_type: TYPE_UINT64
	dims: [ 1 ]
    reshape: { shape: [ ] }
	optional: true
  },
  # weights for a lora adapter shape [ num_lora_modules_layers, D x Hi + Ho x D ]
  # where the last dimension holds the in / out adapter weights for the associated module (e.g. attn_qkv) and model layer
  # each of the in / out tensors are first flattened and then concatenated together in the format above.
  # D=adapter_size (R value), Hi=hidden_size_in, Ho=hidden_size_out.
  {
    name: "lora_weights"
	data_type: TYPE_FP16
	dims: [ -1, -1 ]
	optional: true
	allow_ragged_batch: true
  },
  # module identifier (same size a first dimension of lora_weights)
  # See LoraModule::ModuleType for model id mapping
  #
  # "attn_qkv": 0     # compbined qkv adapter
  # "attn_q": 1       # q adapter
  # "attn_k": 2       # k adapter
  # "attn_v": 3       # v adapter
  # "attn_dense": 4   # adapter for the dense layer in attention
  # "mlp_h_to_4h": 5  # for llama2 adapter for gated mlp layer after attention / RMSNorm: up projection
  # "mlp_4h_to_h": 6  # for llama2 adapter for gated mlp layer after attention / RMSNorm: down projection
  # "mlp_gate": 7     # for llama2 adapter for gated mlp later after attention / RMSNorm: gate
  #
  # last dim holds [ module_id, layer_idx, adapter_size (D aka R value) ]
  {
    name: "lora_config"
	data_type: TYPE_INT32
	dims: [ -1, 3 ]
	optional: true
	allow_ragged_batch: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "context_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  },
  {
    name: "generation_logits"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "batch_index"
    data_type: TYPE_INT32
    dims: [ 1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_GPU
    gpus: [ 0, 1 ]
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "1"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "data/trt_models/meta_llama_3_1_8B_instruct/vanilla_06_08_24_4bit"
  }
}
parameters: {
  key: "encoder_model_path"
  value: {
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
  }
}
parameters: {
  key: "max_attention_window_size"
  value: {
    string_value: "max_sequence_length"
  }
}
parameters: {
  key: "sink_token_length"
  value: {
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "max_utilization"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "0.9"
  }
}
parameters: {
  key: "kv_cache_host_memory_bytes"
  value: {
    string_value: "45000000000"
  }
}
parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "cancellation_check_period_ms"
  value: {
  }
}
parameters: {
  key: "stats_check_period_ms"
  value: {
  }
}
parameters: {
  key: "iter_stats_max_iterations"
  value: {
  }
}
parameters: {
  key: "request_stats_max_iterations"
  value: {
  }
}
parameters: {
  key: "enable_kv_cache_reuse"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "normalize_log_probs"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "enable_chunked_context"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "gpu_device_ids"
  value: {
    string_value: "0, 1"
  }
}
parameters: {
  key: "lora_cache_optimal_adapter_size"
  value: {
  }
}
parameters: {
  key: "lora_cache_max_adapter_size"
  value: {
  }
}
parameters: {
  key: "lora_cache_gpu_memory_fraction"
  value: {
  }
}
parameters: {
  key: "lora_cache_host_memory_bytes"
  value: {
  }
}
parameters: {
  key: "decoding_mode"
  value: {
    string_value: "top_k"
  }
}
parameters: {
  key: "executor_worker_path"
  value: {
    string_value: "/opt/tritonserver/backends/tensorrtllm/trtllmExecutorWorker"
  }
}
parameters: {
  key: "medusa_choices"
    value: {
  }
}
parameters: {
  key: "gpu_weights_percent"
    value: {
  }
}
parameters: {
  key: "enable_context_fmha_fp32_acc"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "multi_block_mode"
  value: {
    string_value: "false"
  }
}
parameters: {
  key: "max_num_tokens"
  value: {
    string_value: "16384"
  }
}

Spawn triton server using python3 launch_triton_server.py --world_size=2 --model_repo=models_dhruv --log --tensorrt_llm_model_name=meta_llama_3_1_8B_instruct_vanilla_trt

@imihic
Copy link

imihic commented Oct 11, 2024

@dhruvmullick I'm facing the same problem on my multi-GPU server with 4x L40S. Have you managed to solve it?

@dhruvmullick
Copy link
Author

dhruvmullick commented Oct 11, 2024

@imihic, after spending a week on this, I pivoted to vLLM.

@jasonngap1
Copy link

Is there any updates to this issue please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants