-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Add arcee model #21296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add arcee model #21296
Conversation
Signed-off-by: alyosha-swamy <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]>
…ment - Remove deprecated supported_lora_modules attribute - Add ArceeForCausalLM to test registry Signed-off-by: alyosha-swamy <[email protected]>
- Set is_available_online=False in test registry for CI compatibility Signed-off-by: alyosha-swamy <[email protected]>
- Inherit from LlamaForCausalLM for most functionality - Set is_available_online=False in test registry for CI compatibility Signed-off-by: alyosha-swamy <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Generated Outputs:Prompt: 'Hello, my name is'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The code changes introduce the Arcee model. Refactor the weight loading mechanism to align better with vLLM's standard practices, and correct the test registry entry for the Arcee model.
vllm/model_executor/models/arcee.py
Outdated
| # Use AutoWeightsLoader for consistency with vLLM's loading mechanism | ||
| from vllm.model_executor.models.utils import AutoWeightsLoader | ||
| loader = AutoWeightsLoader( | ||
| self, | ||
| skip_prefixes=(["lm_head."] | ||
| if self.config.tie_word_embeddings else None)) | ||
| # AutoWeightLoader handles weight name remapping, including fusing | ||
| # separate q_proj, k_proj, v_proj into qkv_proj | ||
| return loader.load_weights(weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This load_weights method can be improved to fully leverage AutoWeightsLoader and simplify the overall implementation. The custom load_weights method in ArceeModel (lines 275-368) duplicates logic from AutoWeightsLoader and is less maintainable. By adding skip_prefixes and skip_substrs to the AutoWeightsLoader initialization here, you can handle the Arcee-specific weight skipping. This allows for the complete removal of the ArceeModel.load_weights method, resulting in cleaner code that follows vLLM's standard practices.
from vllm.model_executor.models.utils import AutoWeightsLoader
loader = AutoWeightsLoader(
self,
skip_prefixes=(['lm_head.']
if self.config.tie_word_embeddings else None),
skip_substrs=["gate_proj"])
return loader.load_weights(weights)
jeejeelee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM, please fix the pre-commit failure
Signed-off-by: alyosha-swamy <[email protected]>
|
@alyosha-swamy please stop making new PRs for this model. It's making discussion about the PR unnecessarily hard to follow: |
|
In #21267 it was pointed out that
Repeatedly making new PRs will hide this from reviewers. |
Understood, however to enable native vLLM support for better performance, it would be better if we could add this |
|
Have you tested the performance with The performance should be close to or the same as a natively supported model. |
|
Model outputs are nearly identical; however, the model loads in less than half the time when using it directly versus with the model impl HF flag. |
|
AFM is in this release https://github.com/huggingface/transformers/releases/tag/v4.53.0 |
How is the performance of the model once loaded with Thank you for the feedback about model loading speed, this is very useful to know. Some effort could be put into improving this, which would affect all models loaded with
Oh I see, thank you for the additional context. When you said that the checkpoint hadn't been released, I assumed the implementation hadn't been released on Transformers either. In that case you wouldn't need |
I have verified the logprobs for these and can confirm they are matching with the native impl. Since all the other fully supported models still need to be included in registry.py as well as have their own file, we would like to have this PR implemented. |
Signed-off-by: Jee Jee Li <[email protected]>
Signed-off-by: Jee Jee Li <[email protected]>
This isn't actually necessary. Models can be officially supported using the Transformers backend with an entry like: "ArceeForCausalLM": ("transformers", "TransformersForCausalLM"),If it's necessary to add this model explicitly now, I won't block it. But in future we'd prefer not to maintain copies of models from Transformers. |
|
I'd prefer to merge it for now, understood for future reference |
Signed-off-by: alyosha-swamy <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: qizixi <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: alyosha-swamy <[email protected]> Signed-off-by: Jee Jee Li <[email protected]> Co-authored-by: Jee Jee Li <[email protected]>
[New Model] Support Arcee (Arcee Foundational Models)
1. Purpose (Why this PR?)
Add inference support for Arcee Foundational Model (AFM) so that users can serve it with vLLM in both Python and API-server workflows. AFM uses a unique ReLU² activation in its MLP layers, differentiating it from standard Llama-based models.
2. Model details
3. Implementation overview
ArceeForCausalLMclass invllm/model_executor/models/arcee.pywith customArceeMLPusing ReLU² activation_TEXT_GENERATION_MODELSinvllm/model_executor/models/registry.pydocs/models/supported_models.mdwith Arcee entry in text generation tableLlamaAttentionfrom existing Llama implementation for attention layers4. Performance / sanity check
Expected: Coherent completion about life's meaning
Observed: " a question that has been asked throughout the history of mankind. The search for an answer to this question has inspired countless works of art, literature, and philosophy. Whether we consider the existentialist ideas of Albert Camus or the religious perspectives of spiritual leaders"
5. Test plan ✔️
pytest tests/models/test_arcee.pypython -c "from vllm import LLM; llm = LLM('arcee-ai/AFM-4.5B-Base')"vllm serve arcee-ai/AFM-4.5B-Base --trust-remote-codecurl localhost:8000/v1/completions6. Documentation
docs/models/supported_models.mdunder Text Generation modelsArceeForCausalLMwith example modelarcee-ai/AFM-4.5B-BaseChecklist
pre-commit run --all-files(ruff formatting)pytest -q)Notes for reviewers
The key architectural difference from standard Llama models is the MLP activation function. Arcee uses ReLU² (squared ReLU) instead of SiLU:
ArceeMLPimplements:x = torch.pow(torch.relu(x), 2)gate_proj), onlyup_projanddown_projThe model has been tested with an internal HF repo during development, but the official model is
arcee-ai/AFM-4.5B-Base.Test result
All outputs are coherent and contextually appropriate.