-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Closed
Labels
Description
Your current environment
The output of python collect_env.py
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version : Could not collect
CMake version : version 3.22.1
Libc version : glibc-2.35
==============================
PyTorch Info
==============================
PyTorch version : 2.6.0+cu124
Is debug build : False
CUDA used to build PyTorch : 12.4
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0] (64-bit runtime)
Python platform : Linux-6.2.0-1018-aws-x86_64-with-glibc2.35
==============================
CUDA / GPU Info
==============================
Is CUDA available : False
CUDA runtime version : No CUDA
CUDA_MODULE_LOADING set to : N/A
GPU models and configuration : No CUDA
Nvidia driver version : No CUDA
cuDNN version : No CUDA
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8488C
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 48
Socket(s): 2
Stepping: 8
BogoMIPS: 4800.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 4.5 MiB (96 instances)
L1i cache: 3 MiB (96 instances)
L2 cache: 192 MiB (96 instances)
L3 cache: 210 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-47,96-143
NUMA node1 CPU(s): 48-95,144-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] neuron-torch-tools==1.0.0.32411+2acfe92b3
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==27.0.0
[pip3] torch==2.6.0
[pip3] torch-neuronx==2.6.0.2.8.7619+7447eaa1
[pip3] torch-xla==2.6.1
[pip3] torchmetrics==1.2.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : (0, 'instance-type: trn2.48xlarge\ninstance-id: i-0b5d7330db629f74d\nlogical-neuroncore-config: 2\n+--------+--------+----------+--------+---------------+--------------+---------------+------+\n| NEURON | NEURON | NEURON | NEURON | CONNECTED | PCI | CPU | NUMA |\n| DEVICE | CORES | CORE IDS | MEMORY | DEVICES | BDF | AFFINITY | NODE |\n+--------+--------+----------+--------+---------------+--------------+---------------+------+\n| 0 | 4 | 0-3 | 96 GB | 12, 3, 4, 1 | 0000:cc:00.0 | 48-95,144-191 | 1 |\n| 1 | 4 | 4-7 | 96 GB | 13, 0, 5, 2 | 0000:b5:00.0 | 48-95,144-191 | 1 |\n| 2 | 4 | 8-11 | 96 GB | 14, 1, 6, 3 | 0000:b6:00.0 | 48-95,144-191 | 1 |\n| 3 | 4 | 12-15 | 96 GB | 15, 2, 7, 0 | 0000:cb:00.0 | 48-95,144-191 | 1 |\n| 4 | 4 | 16-19 | 96 GB | 0, 7, 8, 5 | 0000:6f:00.0 | 0-47,96-143 | 0 |\n| 5 | 4 | 20-23 | 96 GB | 1, 4, 9, 6 | 0000:58:00.0 | 0-47,96-143 | 0 |\n| 6 | 4 | 24-27 | 96 GB | 2, 5, 10, 7 | 0000:59:00.0 | 0-47,96-143 | 0 |\n| 7 | 4 | 28-31 | 96 GB | 3, 6, 11, 4 | 0000:6e:00.0 | 0-47,96-143 | 0 |\n| 8 | 4 | 32-35 | 96 GB | 4, 11, 12, 9 | 0000:9b:00.0 | 0-47,96-143 | 0 |\n| 9 | 4 | 36-39 | 96 GB | 5, 8, 13, 10 | 0000:84:00.0 | 0-47,96-143 | 0 |\n| 10 | 4 | 40-43 | 96 GB | 6, 9, 14, 11 | 0000:85:00.0 | 0-47,96-143 | 0 |\n| 11 | 4 | 44-47 | 96 GB | 7, 10, 15, 8 | 0000:9a:00.0 | 0-47,96-143 | 0 |\n| 12 | 4 | 48-51 | 96 GB | 8, 15, 0, 13 | 0000:f8:00.0 | 48-95,144-191 | 1 |\n| 13 | 4 | 52-55 | 96 GB | 9, 12, 1, 14 | 0000:e1:00.0 | 48-95,144-191 | 1 |\n| 14 | 4 | 56-59 | 96 GB | 10, 13, 2, 15 | 0000:e2:00.0 | 48-95,144-191 | 1 |\n| 15 | 4 | 60-63 | 96 GB | 11, 14, 3, 12 | 0000:f7:00.0 | 48-95,144-191 | 1 |\n+--------+--------+----------+--------+---------------+--------------+---------------+------+', '')
vLLM Version : 0.9.0.dev
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect
==============================
Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
get_current_vllm_config
is called too early in Fp8LinearOp
before set_current_vllm_config
is even invoked. As a result, get_current_vllm_config
initializes an empty VllmConfig
, causing it's __post_init__
to fail in the check_and_update_config
step when accessing fields of a None object, e.g., vllm_config.model_config.max_model_len
.
StackTrace of the call chain:
File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/ktest/vllm/entrypoints/openai/api_server.py", line 1391, in <module>
uvloop.run(run_server(args))
File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run
return loop.run_until_complete(wrapper())
File "/opt/conda/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper
return await main
File "/root/ktest/vllm/entrypoints/openai/api_server.py", line 1327, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/root/ktest/vllm/entrypoints/openai/api_server.py", line 1347, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/root/ktest/vllm/entrypoints/openai/api_server.py", line 156, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/opt/conda/lib/python3.10/contextlib.py", line 199, in __aenter__
return await anext(self.gen)
File "/root/ktest/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
vllm_config = engine_args.create_engine_config(usage_context=usage_context)
File "/root/ktest/vllm/engine/arg_utils.py", line 1192, in create_engine_config
config = VllmConfig(
File "/opt/conda/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
File "/root/ktest/vllm/config.py", line 4397, in __post_init__
self.quant_config = VllmConfig._get_quantization_config(
File "/root/ktest/vllm/config.py", line 4330, in _get_quantization_config
quant_config = get_quant_config(model_config, load_config)
File "/root/ktest/vllm/model_executor/model_loader/weight_utils.py", line 165, in get_quant_config
return quant_cls.from_config(hf_quant_config)
File "/root/ktest/vllm/model_executor/layers/quantization/fbgemm_fp8.py", line 61, in from_config
return cls(ignore_list=ignore_list, input_scale_ub=input_scale_ub)
File "/root/ktest/vllm/model_executor/layers/quantization/fbgemm_fp8.py", line 39, in __init__
self.fp8_linear = Fp8LinearOp()
File "/root/ktest/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 313, in __init__
config = get_current_vllm_config().compilation_config
File "/root/ktest/vllm/config.py", line 4666, in get_current_vllm_config
traceback.print_stack()
To reproduce, we can run a any model that relies on Fp8LinearOp
.
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="[fp8 quantized model")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Done