-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Add: Support for multiple hidden layers in Eagle3 #26164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add: Support for multiple hidden layers in Eagle3 #26164
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for multiple hidden layers in Eagle3 speculative decoding. The changes correctly modify the model initialization to create multiple decoder layers and update the forward pass to iterate through them. A new test case is also added to verify the functionality.
My main concern is a critical issue in LlamaDecoderLayer.forward
where the residual connection is broken between layers for layer_idx > 0
. The residual
from the previous layer is ignored, which will likely lead to incorrect model outputs. I've provided a detailed comment and a partial code suggestion to address this.
629829f
to
f6c7337
Compare
451022d
to
53614da
Compare
Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
53614da
to
b1bb2c4
Compare
Signed-off-by: Rahul Tuli <[email protected]>
/gemini re-review please! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for multiple hidden layers in Eagle3, which is a significant enhancement. The changes are logical and well-structured. I've identified one high-severity issue related to memory efficiency that should be addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although I am not super familiar with this speculator family atm.
Would you say this is expected behavior (num_layers>2 for eagle speculators) or is this a behavior we should be mentioning in the docs?
pytest.param( | ||
"nm-testing/random-weights-llama3.1.8b-2layer-eagle3", | ||
id="llama3-eagl3-multiple-layers", | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is a mock model right? I think we can comment that
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
vllm/vllm/model_executor/models/llama_eagle3.py
Lines 191 to 205 in a039040
def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]: | |
stacked_params_mapping = [ | |
# (param_name, shard_name, shard_id) | |
(".qkv_proj", ".q_proj", "q"), | |
(".qkv_proj", ".k_proj", "k"), | |
(".qkv_proj", ".v_proj", "v"), | |
(".gate_up_proj", ".gate_proj", 0), | |
(".gate_up_proj", ".up_proj", 1), | |
] | |
params_dict = dict(self.named_parameters()) | |
loaded_params: set[str] = set() | |
for name, loaded_weight in weights: | |
if "midlayer." in name: | |
name = name.replace("midlayer.", "layers.0.") | |
for param_name, weight_name, shard_id in stacked_params_mapping: |
The commit now instantiates num_hidden_layers
drafter layers, but load_weights
still rewrites every checkpoint tensor whose name contains midlayer.
to layers.0.
. Multi-layer Eagle3 checkpoints necessarily distinguish layers (e.g. midlayer.layers.1.*
), and this unconditional replacement collapses all such tensors onto the first module. As a result, only layer 0 receives weights while later layers either fail to load (KeyError
in params_dict[name]
) or remain randomly initialized. The mapping should retain the original layer index (e.g. midlayer.layers.{i}
→ layers.{i}
) instead of hardcoding layers.0.
.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
As of now by default most eagle3 checkpoints just have a single layer, however there has been interest in trying multiple layers in the eagle3 drafter to see if that improves acceptance rates; our team is training one such speculator already! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also fix the pre-commit issue so that we can land
…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...
Signed-off-by: Rahul Tuli <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Summary
Support multiple layers in Eagle3 checkpoints. The dimensions of the first layer should be
2 * hidden_size
and those of second (and subsequent layers) should behidden_size
because no input embeddings are present to concatenate for layers after the first.Changes
LlamaDecoderLayer
to accept alayer_idx
parameter and dynamically set QKV input size based on layer positionLlamaModel
to support multiple Eagle3 layers vianum_hidden_layers
config parametertest_eagle3.py
Testing
Smoke Test Added
Added parametrized test case
llama3-eagl3-multiple-layers
usingnm-testing/random-weights-llama3.1.8b-2layer-eagle3
to verify multi-layer Eagle3 initialization and execution.Manual Verification Script
Run command:
Verification Output
Checkpoint for verification: https://huggingface.co/nm-testing/random-weights-llama3.1.8b-2layer-eagle3