Skip to content

Conversation

kylesayrs
Copy link
Contributor

@kylesayrs kylesayrs commented Feb 3, 2025

Purpose

  • Fix type error in use_mla which was causing models with quantization and without config_groups to fail
  • Support loading models which do not have config_groups in the compressed tensors config. While these models aren't accelerated, they can still be generated from valid recipes in LLM Compressor and are useful for testing.

Changes

  • Fix type error in use_mla
  • Add warning and return no scheme if the model is attempted to load without quantization

Testing

  • Tested nm-testing/Qwen2-VL-2B-Instruct-Sparse-0.6
    • Confirmed that MLA is indeed disabled in the fallback case
WARNING 02-03 16:37:36 config.py:1008] compressed-tensors MLA support requires fp8 activations and weights in group 'group_0', but got activations type 'None' and weights type 'int'.
  • Tested neuralmagic/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16

Copy link

github-actions bot commented Feb 3, 2025

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@kylesayrs kylesayrs force-pushed the kylesayrs/fix-use_mla-type_error branch from 8a63916 to 92f4d0f Compare February 3, 2025 21:52
Signed-off-by: Kyle Sayers <[email protected]>
This reverts commit d5ccf6233c3afaf967566d664681e140e1be32fe.

Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs force-pushed the kylesayrs/fix-use_mla-type_error branch from a864fd0 to d3c6349 Compare February 3, 2025 22:00
@kylesayrs kylesayrs changed the title [Quant] Support loading pure-sparsity Compressed Tensors configs [Quant] Fix use_mla default and support loading pure-sparsity Compressed Tensors configs Feb 3, 2025
@kylesayrs kylesayrs changed the title [Quant] Fix use_mla default and support loading pure-sparsity Compressed Tensors configs [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs Feb 3, 2025
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 4, 2025
@simon-mo simon-mo merged commit 4896d0c into vllm-project:main Feb 4, 2025
58 of 65 checks passed
@kylesayrs kylesayrs deleted the kylesayrs/fix-use_mla-type_error branch February 5, 2025 01:17
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants