Skip to content

Conversation

@alyosha-swamy
Copy link
Contributor

@alyosha-swamy alyosha-swamy commented Jul 20, 2025

[New Model] Support Arcee (Arcee Foundational Models)

1. Purpose (Why this PR?)

Add inference support for Arcee Foundational Model (AFM) so that users can serve it with vLLM in both Python and API-server workflows. AFM uses a unique ReLU² activation in its MLP layers, differentiating it from standard Llama-based models.

2. Model details

Field Value / Reference
Source repo / HF id huggingface.co/arcee-ai/AFM-4.5B-Base
Architecture Llama-style decoder-only transformer with ReLU² MLP activation
Context length 64k tokens
Hidden size / #layers 4096 / 32
License CC BY-NC 4.0
Special quirks Uses ReLU² (squared ReLU) activation instead of SiLU in MLP layers

3. Implementation overview

  • Added ArceeForCausalLM class in vllm/model_executor/models/arcee.py with custom ArceeMLP using ReLU² activation
  • Registered model in _TEXT_GENERATION_MODELS in vllm/model_executor/models/registry.py
  • Updated docs/models/supported_models.md with Arcee entry in text generation table
  • Reused LlamaAttention from existing Llama implementation for attention layers
  • Implemented proper LoRA and Pipeline Parallelism support

4. Performance / sanity check

$ python -m vllm.entrypoints.openai.api_server --model arcee-ai/AFM-4.5B-Base --trust-remote-code
$ curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
    "model": "arcee-ai/AFM-4.5B-Base",
    "prompt": "The future of artificial intelligence is",
    "max_tokens": 50
}'

Expected: Coherent completion about life's meaning

Observed: " a question that has been asked throughout the history of mankind. The search for an answer to this question has inspired countless works of art, literature, and philosophy. Whether we consider the existentialist ideas of Albert Camus or the religious perspectives of spiritual leaders"

5. Test plan ✔️

Test Command Expected
Unit pytest tests/models/test_arcee.py All tests pass
Model Loading python -c "from vllm import LLM; llm = LLM('arcee-ai/AFM-4.5B-Base')" Model loads without errors
Integration vllm serve arcee-ai/AFM-4.5B-Base --trust-remote-code Server starts, responds to requests
Generation curl localhost:8000/v1/completions 200 OK + valid completions

6. Documentation

  • Added row to docs/models/supported_models.md under Text Generation models
  • Model listed as ArceeForCausalLM with example model arcee-ai/AFM-4.5B-Base
  • Marked as supporting LoRA (✅), Pipeline Parallel (✅), and V1 (✅)

Checklist

  • I ran pre-commit run --all-files (ruff formatting)
  • All CI tests pass locally (pytest -q)
  • The PR description follows vLLM's "Essential Elements" template
  • No breaking changes for existing model classes

Notes for reviewers

The key architectural difference from standard Llama models is the MLP activation function. Arcee uses ReLU² (squared ReLU) instead of SiLU:

  • ArceeMLP implements: x = torch.pow(torch.relu(x), 2)
  • No gating mechanism (no gate_proj), only up_proj and down_proj
  • All other components (attention, layer norm, etc.) reuse existing Llama implementations

The model has been tested with an internal HF repo during development, but the official model is arcee-ai/AFM-4.5B-Base.

Test result

seq Prompt vLLM Output
0 "The meaning of life is" " a question that has been asked throughout the history of mankind. The search for an answer to this question has inspired countless works of art, literature, and philosophy. Whether we consider the existentialist ideas of Albert Camus or the religious perspectives of spiritual leaders"
1 "Climate change is primarily caused by" " human activity, specifically the emission of greenhouse gases such as carbon dioxide (CO2) and methane (CH4). It leads to changes in average temperatures and weather patterns, impacting both nature and human society."
2 "Machine learning algorithms work by" " training a predictive model using labeled training data: the model detects patterns in the training data and learns from it. That model is then tested using a test set, which it must predict to achieve a good accuracy rate."

All outputs are coherent and contextually appropriate.

@alyosha-swamy alyosha-swamy requested a review from hmellor as a code owner July 20, 2025 20:57
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation new-model Requests to new models labels Jul 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Arcee (AFM) model to vLLM. The implementation correctly identifies the key architectural difference—the ReLU² activation in the MLP—and reuses existing components like LlamaAttention effectively. The changes are well-structured and include necessary updates to documentation and model registration.

I've identified one area for improvement in the ArceeModel.load_weights method concerning inefficient imports and incomplete support for quantization scale loading. Addressing this will improve model loading performance and ensure correctness for features like AWQ quantization. Overall, this is a solid contribution.

Comment on lines +213 to +273
def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]) -> set[str]:
"""Load pre-trained weights from HuggingFace format into the model."""
# Mapping for merging or renaming weight parameters from HF into our model
stacked_params_mapping = [
# Each tuple: (combined_param_name, hf_subparam_name, index_or_key)
(".qkv_proj", ".q_proj", "q"),
(".qkv_proj", ".k_proj", "k"),
(".qkv_proj", ".v_proj", "v"),
# Note: No gate_proj since AFM has no gated MLP
]
params_dict = dict(self.named_parameters())
loaded_params: set[str] = set()
for name, loaded_weight in weights:
# Skip rotary cache parameters if present (not actual model weights)
if "rotary_emb.inv_freq" in name or "rotary_emb.cos_cached" in name or "rotary_emb.sin_cached" in name:
continue
# Handle quantization KV cache scales if present
if hasattr(self, "quant_config") and self.quant_config is not None:
# If name corresponds to a quantization scale parameter, remap and load it
from vllm.model_executor.model_loader.weight_utils import default_weight_loader, maybe_remap_kv_scale_name
if "scale" in name:
maybe_name = maybe_remap_kv_scale_name(name, params_dict)
if maybe_name is None:
continue
name = maybe_name
# Pipeline parallel: skip parameters not on this rank
from vllm.model_executor.models.utils import is_pp_missing_parameter
from vllm.model_executor.model_loader.weight_utils import default_weight_loader
if is_pp_missing_parameter(name, self):
continue

# Attempt to map and load merged parameters
for param_name, weight_name, shard_id in stacked_params_mapping:
if weight_name not in name:
continue
mapped_name = name.replace(weight_name, param_name)
if mapped_name.endswith(".bias") and mapped_name not in params_dict:
# Skip any unexpected biases (e.g., from certain quantization or GPTQ checkpoints)
break
if mapped_name in params_dict:
param = params_dict[mapped_name]
weight_loader = getattr(param, "weight_loader", default_weight_loader)
weight_loader(param, loaded_weight, shard_id) # load the shard into the combined param
loaded_params.add(mapped_name)
else:
logging.warning(f"Unexpected parameter in checkpoint: {name}")
break
else:
# No special mapping, try direct load
if name in params_dict:
# For tied embeddings, skip loading lm_head if it will be tied
if name.startswith("lm_head.") and getattr(self.config, "tie_word_embeddings", False):
continue
param = params_dict[name]
weight_loader = getattr(param, "weight_loader", default_weight_loader)
weight_loader(param, loaded_weight)
loaded_params.add(name)
else:
# Silently skip any unmatched parameters (e.g., vision tower weights in multimodal models)
logging.debug(f"Ignoring unmatched checkpoint parameter: {name}")
return loaded_params
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The load_weights method in ArceeModel has a couple of issues:

  1. Inefficient Imports: Imports are performed inside the main loop, which is inefficient as they will be executed for every weight parameter. These should be moved to the top of the method.
  2. Incomplete Quantization Support: The logic for handling quantization scales is incomplete. It's missing the handling for AWQ KV cache scales, which is present in other models like Llama. This will cause issues when using AWQ-quantized versions of this model.

I've provided a refactored version of the method that addresses these points by moving imports out of the loop and adding the correct logic for handling quantization scales, aligning it with the implementation in LlamaModel.

    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]) -> set[str]:
        """Load pre-trained weights from HuggingFace format into the model."""
        from vllm.model_executor.model_loader.weight_utils import (
            default_weight_loader, maybe_remap_kv_scale_name)
        from vllm.model_executor.models.utils import is_pp_missing_parameter

        # Mapping for merging or renaming weight parameters from HF into our model
        stacked_params_mapping = [
            # Each tuple: (combined_param_name, hf_subparam_name, index_or_key)
            (".qkv_proj", ".q_proj", "q"),
            (".qkv_proj", ".k_proj", "k"),
            (".qkv_proj", ".v_proj", "v"),
            # Note: No gate_proj since AFM has no gated MLP
        ]
        params_dict = dict(self.named_parameters())
        loaded_params: set[str] = set()
        for name, loaded_weight in weights:
            # Skip rotary cache parameters if present (not actual model weights)
            if "rotary_emb.inv_freq" in name or "rotary_emb.cos_cached" in name or "rotary_emb.sin_cached" in name:
                continue

            # Handle quantization KV cache scales
            if self.quant_config is not None and (
                    scale_name := self.quant_config.get_cache_scale(name)):
                param = params_dict[scale_name]
                weight_loader = getattr(param, "weight_loader",
                                        default_weight_loader)
                loaded_weight = (loaded_weight if loaded_weight.dim() == 0 else
                                 loaded_weight[0])
                weight_loader(param, loaded_weight)
                loaded_params.add(scale_name)
                continue

            if "scale" in name:
                # Remapping the name of FP8 kv-scale.
                maybe_name = maybe_remap_kv_scale_name(name, params_dict)
                if maybe_name is None:
                    continue
                name = maybe_name

            # Pipeline parallel: skip parameters not on this rank
            if is_pp_missing_parameter(name, self):
                continue

            # Attempt to map and load merged parameters
            for param_name, weight_name, shard_id in stacked_params_mapping:
                if weight_name not in name:
                    continue
                mapped_name = name.replace(weight_name, param_name)
                if mapped_name.endswith(".bias") and mapped_name not in params_dict:
                    # Skip any unexpected biases (e.g., from certain quantization or GPTQ checkpoints)
                    break
                if mapped_name in params_dict:
                    param = params_dict[mapped_name]
                    weight_loader = getattr(param, "weight_loader", default_weight_loader)
                    weight_loader(param, loaded_weight, shard_id)  # load the shard into the combined param
                    loaded_params.add(mapped_name)
                else:
                    logging.warning(f"Unexpected parameter in checkpoint: {name}")
                break
            else:
                # No special mapping, try direct load
                if name in params_dict:
                    # For tied embeddings, skip loading lm_head if it will be tied
                    if name.startswith("lm_head.") and getattr(self.config, "tie_word_embeddings", False):
                        continue
                    param = params_dict[name]
                    weight_loader = getattr(param, "weight_loader", default_weight_loader)
                    weight_loader(param, loaded_weight)
                    loaded_params.add(name)
                else:
                    # Silently skip any unmatched parameters (e.g., vision tower weights in multimodal models)
                    logging.debug(f"Ignoring unmatched checkpoint parameter: {name}")
        return loaded_params

@hmellor hmellor mentioned this pull request Jul 21, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation new-model Requests to new models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant