Skip to content

Adds vLLM as Option for Local App #693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Sep 4, 2024
Merged

Conversation

EliMCosta
Copy link
Contributor

Adds vLLM as option for "Local apps" in Hugginface

Adds vLLM as option for "Local apps" in Hugginface
@EliMCosta EliMCosta changed the title Update local-apps.ts Adds vLLM as Option for Local Apps in Hugginface May 21, 2024
@EliMCosta EliMCosta changed the title Adds vLLM as Option for Local Apps in Hugginface Adds vLLM as Option for Local App May 21, 2024
api_key="token-abc123",
)
completion = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

prettyLabel: "vLLM",
docsUrl: "https://docs.vllm.ai",
mainTask: "text-generation",
displayOnModelPage: isGptqModel && isAwqModel,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you define those methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how would you define those methods?

In fact, the suggested vLLM method deploys the non-quantized version from the Hugginface repository. All examples of type "text-generation" in the code are GGUF. Any suggestion?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! Concretely we support a set of architectures which is readable from the model data

architectures?: string[];

https://github.com/vllm-project/vllm/blob/757b62c49560baa6f294310a53032348a0d95939/vllm/model_executor/models/__init__.py#L13-L63

And for quantization method we can read in config.quantization_config.quant_method which we support awq, gptq, aqlm, and marlin

https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json#L28

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome @simon-mo, super clear!

Copy link
Member

@julien-c julien-c May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've pushed 2123430 on this PR to type config.quantization_config.quant_method which we now parse & pass from the Hub

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've pushed 2123430 on this PR to type config.quantization_config.quant_method which we now parse & pass from the Hub

I made some changes, I need your help to review

EliMCosta and others added 5 commits May 22, 2024 11:14
fix dynamic model ids
Adds functions in order to check types of quantization
@krampstudio
Copy link
Collaborator

@EliMCosta do you have the vLLM icon as a square SVG?

@EliMCosta
Copy link
Contributor Author

@EliMCosta do you have the vLLM icon as a square SVG?

vllm-logo

@julien-c
Copy link
Member

julien-c commented Jun 3, 2024

this is mostly ready to merge no? wdyt?

Comment on lines 65 to 67
function isFullModel(model: ModelData): boolean {
// Assuming a full model is identified by not having a quant_method
return !model.config?.quantization_config?.quant_method;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFullModel creates a lot of false positive.

Instead we can maybe check against the supported architectures as suggested by @simon-mo

Something like this

const VLLM_SUPPORTED_ARCHS = [
    "AquilaForCausalLM", "ArcticForCausalLM", "BaiChuanForCausalLM", "BloomForCausalLM", ...
];
model.config?.architectures?.some((arch) => VLLM_SUPPORTED_ARCHS.includes(arch)

cc @julien-c

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can query the vLLM package for this list once you have it installed:

>>> from vllm import ModelRegistry
>>> ModelRegistry.get_supported_archs()
['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'OlmoForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'ArcticForCausalLM', 'XverseForCausalLM', 'Phi3SmallForCausalLM', 'MistralModel']

@youkaichao
Copy link

any progress here? I think we can add a tag vLLM, so that when a model is supported by vLLM, it can have a vLLM tag, like https://huggingface.co/meta-llama/Llama-2-7b-hf .

@krampstudio krampstudio requested a review from pcuenca as a code owner September 2, 2024 14:58
@krampstudio
Copy link
Collaborator

krampstudio commented Sep 2, 2024

@simon-mo @EliMCosta @mgoin we're back on the integration. I've merged it with #744

Let me know if the snippets are ok for you.
It renders like this:

Screenshot 2024-09-03 at 14 24 56

Screenshot 2024-09-03 at 14 26 08

@youkaichao
Copy link

@krampstudio I would recommend using the new cli vllm serve instead of the long python -m vllm.entrypoints.xxx

@krampstudio
Copy link
Collaborator

@youkaichao I've updated the snippets (and the screenshots in the comment above)

function isGgufModel(model: ModelData) {
return model.tags.includes("gguf");
function isGgufModel(model: ModelData): boolean {
return model.config?.quantization_config?.quant_method === "gguf" || model.tags.includes("gguf");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure about the first part?

Copy link
Collaborator

@krampstudio krampstudio Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, let's rely only on the tag. see d63b7cb

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks in good shape!

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@krampstudio krampstudio merged commit b15ff77 into huggingface:main Sep 4, 2024
4 checks passed
@mishig25 mishig25 mentioned this pull request Oct 4, 2024
mishig25 added a commit that referenced this pull request Oct 4, 2024
Follow up to #693

1. Add missing comma
2. Add line splitter between two commands

<img width="600" alt="Screenshot 2024-10-04 at 14 52 09"
src="https://github.com/user-attachments/assets/99a30553-4081-47fe-b89b-ca0c2a7d9fca">
mishig25 added a commit that referenced this pull request Oct 7, 2024
Follow up to #693

There was trailing space in vlmm snippet which made the snippet
unusable. When you try the snippet, vllm server would respond with
error:
```
mishig@machine:~$ # Call the server using curl:
curl -X POST "http://localhost:8000/v1/chat/completions" \ 
        -H "Content-Type: application/json" \ 
        --data '{
                "model": "meta-llama/Llama-3.2-3B-Instruct",
                "messages": [
                        {"role": "user", "content": "Hello!"}
                ]
        }'

{"object":"error","message":"[{'type': 'missing', 'loc': ('body',), 'msg': 'Field required', 'input': None}]","type":"BadRequestError","param":null,"code":400}curl: (3) URL using bad/illegal format or missing URL
-H: command not found
--data: command not found
mishig@machine:~$ 
```

Explanation:  trailing space was breaking the escaping
@EliMCosta EliMCosta deleted the patch-4 branch October 22, 2024 15:48
@julien-c julien-c mentioned this pull request Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants