Skip to content

[Bug]: The new scale_dtype and zp_dtype are not backward compatible with released vLLM #2057

@mratsim

Description

@mratsim

⚙️ Your current environment

The output of python collect_env.py
### Environment Information ###
Operating System: `Linux-6.17.8-arch1-1-x86_64-with-glibc2.42`
Python Version: `3.12.9 (main, Mar 17 2025, 21:01:58) [Clang 20.1.0 ]`
llm-compressor Version: `0.8.2.dev61+ga270f33a`
compressed-tensors Version: `0.12.3a20251114`
transformers Version: `4.56.2`
torch Version: `2.8.0+cu129`
CUDA Devices: `['NVIDIA RTX PRO 6000 Blackwell Workstation Edition', 'NVIDIA RTX PRO 6000 Blackwell Workstation Edition']`
AMD Devices: `None`

🐛 Describe the bug

I tried the new model_free_ptq pipeline but the compressed models are now failing to load in vLLM

(APIServer pid=1) INFO 11-20 08:27:22 [model.py:630] Resolved architecture: Glm4MoeForCausalLM
(APIServer pid=1) INFO 11-20 08:27:22 [model.py:1728] Using max model len 131072
(APIServer pid=1) INFO 11-20 08:27:22 [scheduler.py:254] Chunked prefill is enabled with max_num_batched_tokens=16384.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/cli/serve.py", line 59, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2006, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 2025, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 195, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=1)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/workspace/vllm/vllm/engine/arg_utils.py", line 1645, in create_engine_config
(APIServer pid=1)     config = VllmConfig(
(APIServer pid=1)              ^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=1)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 2 validation errors for VllmConfig
(APIServer pid=1) scale_dtype
(APIServer pid=1)   Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
(APIServer pid=1)     For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden
(APIServer pid=1) zp_dtype
(APIServer pid=1)   Extra inputs are not permitted [type=extra_forbidden, input_value=None, input_type=NoneType]
(APIServer pid=1)     For further information visit https://errors.pydantic.dev/2.12/v/extra_forbidden

This is likely linked to this update of compressed-tensors vllm-project/compressed-tensors#508

I'm not sure if this should be fixed on the LLMCompressor side to ensure the config.json does not include non-backward compatible fields or fixed on vLLM side to ignore unknown field and issue a warning but the current situation will make all new weights quantized by LLM compressor incompatible with vLLM from just a couple months ago. Given upgrade procedures that might be slow (say once every 6 months) in certain companies, that might block them from using weights.

For my own weights the fix is straightforward, I just need to remove the offending fields in config.json

"quantization_config": {
    "config_groups": {
      "config_group_0": {
        "format": "float-quantized",
        "input_activations": {
          "actorder": null,
          "block_structure": null,
          "dynamic": true,
          "group_size": 128,
          "num_bits": 8,
          "observer": null,
          "observer_kwargs": {},
          "scale_dtype": null,
          "strategy": "group",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null
        },
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": [
            32,
            32
          ],
          "dynamic": false,
          "group_size": null,
          "num_bits": 8,
          "observer": "static_minmax",
          "observer_kwargs": {},
          "scale_dtype": null, <----------
          "strategy": "block",
          "symmetric": true,
          "type": "float",
          "zp_dtype": null <----------
        }
      }
    },
    ...

🛠️ Steps to reproduce

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions