-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Model] Add FlexOlmo model implementation #24923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Model] Add FlexOlmo model implementation #24923
Conversation
Signed-off-by: Shane A <[email protected]>
Signed-off-by: Shane A <[email protected]>
Signed-off-by: Shane A <[email protected]>
next_tokens = self.sampler(logits, sampling_metadata) | ||
return next_tokens | ||
|
||
def load_weights(self, weights: Iterable[tuple[str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use AutoWeightsLoader
, and move this function into FlexOlmoModel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d1adc72 Done. I had written the FlexOlmo implementation a while ago so I missed improvements like this one.
|
||
# Params for weights, fp8 weight scales, fp8 activation scales | ||
# (param_name, weight_name, expert_id, shard_id) | ||
expert_params_mapping = FusedMoE.make_expert_params_mapping( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can impl expert_params_mapping
like https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/qwen2_moe.py#L557, thus this model can support BNB quantization directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
56b3cdf Done
Signed-off-by: Shane A <[email protected]>
Signed-off-by: Shane A <[email protected]>
"Fairseq2LlamaForCausalLM": _HfExamplesInfo("mgleize/fairseq2-dummy-Llama-3.2-1B"), # noqa: E501 | ||
"FalconForCausalLM": _HfExamplesInfo("tiiuae/falcon-7b"), | ||
"FalconH1ForCausalLM":_HfExamplesInfo("tiiuae/Falcon-H1-0.5B-Base"), | ||
"FlexOlmoForCausalLM": _HfExamplesInfo("shanearora/Flex-reddit-2x7B-1T"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"FlexOlmoForCausalLM": _HfExamplesInfo("shanearora/Flex-reddit-2x7B-1T"), | |
"FlexOlmoForCausalLM": _HfExamplesInfo(allenai/Flex-reddit-2x7B-1T"), |
Ignore this, I misunderstood
Is the Transformers impl compatible with the Transformers backend? If yes then the vLLM PR can be reduced to 3 lines, similar to #22665 |
It probably is compatible (just put out the transformers PR), but the perf is quite poor compared to a native implementation like this. I see that there's ongoing work to address moe perf, so if you'd prefer the 3-line approach then I can try do that instead. |
Ah I didn't see this was an MoE model. You're right that the performance of MoE's on the Transformers backend is not good today. In my PoC PR #22650 I managed to get performance to be within 1% of the native vLLM implementation 🚀 |
So let's move forward |
class FlexOlmoForCausalLM(nn.Module, SupportsPP): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had bandwidth to review this PR in detail yet, will take a review ASAP (about tomorrow)
But with a glance, seems FlexOlmo
's implementation is quite similar to exiting OlMoE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference is that RMS norm is applied after attention/feedforward rather than before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, we can integrate OlMoE to avoid redundant code, and we can refer to motif's implementation approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing how motif's implementation relates to our situation, but it sounds like you want FlexOlmo's implementation to be done within olmoe.py
(which sounds doable enough).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
motif inherits from llama and only implements their decoder layer. Your model integrates olmoe, implements your decoder layer, and modifies some of olmoe's code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's it! You got it right
Signed-off-by: Shane A <[email protected]>
Signed-off-by: Shane A <[email protected]>
Gentle ping |
| `FalconForCausalLM` | Falcon | `tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc. | | ✅︎ | ✅︎ | | ||
| `FalconMambaForCausalLM` | FalconMamba | `tiiuae/falcon-mamba-7b`, `tiiuae/falcon-mamba-7b-instruct`, etc. | | ✅︎ | ✅︎ | | ||
| `FalconH1ForCausalLM` | Falcon-H1 | `tiiuae/Falcon-H1-34B-Base`, `tiiuae/Falcon-H1-34B-Instruct`, etc. | ✅︎ | ✅︎ | ✅︎ | | ||
| `FlexOlmoForCausalLM` | FlexOlmo | TBA | | ✅︎ | ✅︎ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Is this model still not announced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. LGTM!
Purpose
This PR adds the implementation for the FlexOlmo models. The HF implementation is being added concurrently, so the PR includes the config too.
Test Plan
The test plan is to see that basic generation (via
examples/offline_inference/basic/generate.py
) produces sensible output. I cannot run HF vs vLLM (in a shareable manner) because the HF implementation is being added concurrently. Nevertheless, I used a custom script to do HF vs vLLM and saw only minor errors (that would eventually propagate to be larger) with identical output.Test Result
Result of running examples/offline_inference/basic/generate.py:
Excerpt of diff between HF and vLLM activations (using a custom script).
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.