-
Notifications
You must be signed in to change notification settings - Fork 294
Description
⚙️ Your current environment
The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.17.8-arch1-1-x86_64-with-glibc2.42`
Python Version: `3.12.9 (main, Mar 17 2025, 21:01:58) [Clang 20.1.0 ]`
llm-compressor Version: `0.8.2.dev61+ga270f33a`
compressed-tensors Version: `0.12.3a20251114`
transformers Version: `4.56.2`
torch Version: `2.8.0+cu129`
CUDA Devices: `['NVIDIA RTX PRO 6000 Blackwell Workstation Edition', 'NVIDIA RTX PRO 6000 Blackwell Workstation Edition']`
AMD Devices: `None`
🐛 Describe the bug
I tried the new model_free_ptq pipeline but the compressed models are now failing to load in vLLM
(APIServer pid=1) INFO 11-20 08:27:22 [model.py:630] Resolved architecture: Glm4MoeForCausalLM
(APIServer pid=1) INFO 11-20 08:27:22 [model.py:1728] Using max model len 131072
(APIServer pid=1) INFO 11-20 08:27:22 [scheduler.py:254] Chunked prefill is enabled with max_num_batched_tokens=16384.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1) sys.exit(main())
(APIServer pid=1) ^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1) args.dispatch_function(args)
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/cli/serve.py", line 59, in cmd
(APIServer pid=1) uvloop.run(run_server(args))
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1) return __asyncio.run(
(APIServer pid=1) ^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1) return runner.run(main)
(APIServer pid=1) ^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1) return self._loop.run_until_complete(task)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1) return await main
(APIServer pid=1) ^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2006, in run_server
(APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2025, in run_server_worker
(APIServer pid=1) async with build_async_engine_client(
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 195, in build_async_engine_client
(APIServer pid=1) async with build_async_engine_client_from_engine_args(
(APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1) return await anext(self.gen)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=1) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) File "/workspace/vllm/vllm/engine/arg_utils.py", line 1645, in create_engine_config
(APIServer pid=1) config = VllmConfig(
(APIServer pid=1) ^^^^^^^^^^^
(APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=1) s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 2 validation errors for VllmConfig
(APIServer pid=1) scale_dtype
(APIServer pid=1) Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
(APIServer pid=1) For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
(APIServer pid=1) zp_dtype
(APIServer pid=1) Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
(APIServer pid=1) For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
This is likely linked to this update of compressed-tensors vllm-project/compressed-tensors#508
I'm not sure if this should be fixed on the LLMCompressor side to ensure the config.json does not include non-backward compatible fields or fixed on vLLM side to ignore unknown field and issue a warning but the current situation will make all new weights quantized by LLM compressor incompatible with vLLM from just a couple months ago. Given upgrade procedures that might be slow (say once every 6 months) in certain companies, that might block them from using weights.
For my own weights the fix is straightforward, I just need to remove the offending fields in config.json
"quantization_config": {
"config_groups": {
"config_group_0": {
"format": "float-quantized",
"input_activations": {
"actorder": null,
"block_structure": null,
"dynamic": true,
"group_size": 128,
"num_bits": 8,
"observer": null,
"observer_kwargs": {},
"scale_dtype": null,
"strategy": "group",
"symmetric": true,
"type": "float",
"zp_dtype": null
},
"output_activations": null,
"targets": [
"Linear"
],
"weights": {
"actorder": null,
"block_structure": [
32,
32
],
"dynamic": false,
"group_size": null,
"num_bits": 8,
"observer": "static_minmax",
"observer_kwargs": {},
"scale_dtype": null, <----------
"strategy": "block",
"symmetric": true,
"type": "float",
"zp_dtype": null <----------
}
}
},
...🛠️ Steps to reproduce
No response