-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Add Arcee (AFM) model support to vLLM #21263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The code introduces a new model, ArceeForCausalLM, with a unique ReLU² activation. The implementation is well-structured, but some import statements are misplaced and a minor performance improvement can be made.
| if hidden_act != "relu2": | ||
| raise ValueError(f"Unsupported activation: {hidden_act}. Only 'relu2' is supported for AFM.") | ||
| # Define ReLU^2 activation: (ReLU(x))^2 elementwise | ||
| self.act_fn = lambda x: torch.pow(torch.relu(x), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| attention_bias = config.qkv_bias | ||
|
|
||
| # Self-Attention (using LLaMA’s attention structure) | ||
| from vllm.model_executor.models.llama import LlamaAttention # import here to avoid circular import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.aux_hidden_state_layers: Tuple[int, ...] = tuple() | ||
|
|
||
| # Prepare factory for empty intermediate tensors (for pipeline scheduling) | ||
| from vllm.model_executor.models.utils import make_empty_intermediate_tensors_factory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Handle quantization KV cache scales if present | ||
| if hasattr(self, "quant_config") and self.quant_config is not None: | ||
| # If name corresponds to a quantization scale parameter, remap and load it | ||
| from vllm.model_executor.model_loader.weight_utils import default_weight_loader, maybe_remap_kv_scale_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the import statements from vllm.model_executor.model_loader.weight_utils import default_weight_loader, maybe_remap_kv_scale_name and from vllm.model_executor.models.utils import is_pp_missing_parameter to the top of the file to adhere to PEP 8 guidelines. This improves code readability and makes dependencies explicit at the beginning of the file.
| self.unpadded_vocab_size += lora_config.lora_extra_vocab_size | ||
|
|
||
| # Import DEFAULT_VOCAB_PADDING_SIZE | ||
| from vllm.model_executor.layers.vocab_parallel_embedding import DEFAULT_VOCAB_PADDING_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]) -> set[str]: | ||
| """Load weights into the model (delegates to inner model and handles tied embeddings).""" | ||
| # Use AutoWeightsLoader for consistency with vLLM's loading mechanism | ||
| from vllm.model_executor.models.utils import AutoWeightsLoader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[New Model] Support Arcee (Arcee Foundational Models)
1. Purpose (Why this PR?)
Add inference support for Arcee Foundational Model (AFM) so that users can serve it with vLLM in both Python and API-server workflows. AFM uses a unique ReLU² activation in its MLP layers, differentiating it from standard Llama-based models.
2. Model details
3. Implementation overview
ArceeForCausalLMclass invllm/model_executor/models/arcee.pywith customArceeMLPusing ReLU² activation_TEXT_GENERATION_MODELSinvllm/model_executor/models/registry.pydocs/models/supported_models.mdwith Arcee entry in text generation tableLlamaAttentionfrom existing Llama implementation for attention layers4. Performance / sanity check
Expected: Coherent completion about AI
Observed: "The future of artificial intelligence is bright and full of possibilities. As AI continues to evolve, we can expect to see significant advancements in areas such as natural language processing, computer vision, and machine learning..."
5. Test plan ✔️
pytest tests/models/test_arcee.pypython -c "from vllm import LLM; llm = LLM('arcee-ai/AFM-4.5B-Base')"vllm serve arcee-ai/AFM-4.5B-Base --trust-remote-codecurl localhost:8000/v1/completions6. Documentation
docs/models/supported_models.mdunder Text Generation modelsArceeForCausalLMwith example modelarcee-ai/AFM-4.5B-BaseChecklist
pre-commit run --all-files(ruff formatting)pytest -q)Notes for reviewers
The key architectural difference from standard Llama models is the MLP activation function. Arcee uses ReLU² (squared ReLU) instead of SiLU:
ArceeMLPimplements:x = torch.pow(torch.relu(x), 2)gate_proj), onlyup_projanddown_projThe model has been tested with an internal HF repo during development, but the official model is
arcee-ai/AFM-4.5B-Base.Test result
All outputs are coherent and contextually appropriate.