Skip to content

Commit fb4ee9c

Browse files
garg-amitBernardZach
authored andcommitted
* onboard phimoe model * removed debug code * added unit tests * updated docs * formatted * fixed unit tests * fixed test case * fixed format * refactored code * fixed expected outputs in the integration tests * Added a warning msg * Addressed comments * Addressed comments * fixed test cases * added paper link * Addressed comments * Refactored PhimoeForCausalLM forward fn * Refactored PhimoeRotaryEmbedding class * fixed test cases * fixed testcase * fixed test case * Addressed comments * fixed test cases * fixed testcases * Used cache position instead to get the seq len
1 parent b41978a commit fb4ee9c

File tree

16 files changed

+2682
-2
lines changed

16 files changed

+2682
-2
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -522,6 +522,8 @@
522522
title: Phi
523523
- local: model_doc/phi3
524524
title: Phi-3
525+
- local: model_doc/phimoe
526+
title: PhiMoE
525527
- local: model_doc/phobert
526528
title: PhoBERT
527529
- local: model_doc/plbart

docs/source/en/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,6 +257,7 @@ Flax), PyTorch, and/or TensorFlow.
257257
| [Persimmon](model_doc/persimmon) ||||
258258
| [Phi](model_doc/phi) ||||
259259
| [Phi3](model_doc/phi3) ||||
260+
| [Phimoe](model_doc/phimoe) ||||
260261
| [PhoBERT](model_doc/phobert) ||||
261262
| [Pix2Struct](model_doc/pix2struct) ||||
262263
| [Pixtral](model_doc/pixtral) ||||

docs/source/en/model_doc/phimoe.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contains specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# PhiMoE
18+
19+
## Overview
20+
21+
The PhiMoE model was proposed in [Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone](https://arxiv.org/abs/2404.14219) by Microsoft.
22+
23+
### Summary
24+
25+
The abstract from the Phi-3 paper is the following:
26+
27+
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.
28+
29+
The original code for PhiMoE can be found [here](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct).
30+
31+
## Usage tips
32+
33+
- This model is very similar to `Mixtral` with the main difference of [`Phi3LongRoPEScaledRotaryEmbedding`], where they are used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP's up and gate projection layers are also fused.
34+
- The tokenizer used for this model is identical to the [`LlamaTokenizer`], with the exception of additional tokens.
35+
36+
## How to use PhiMoE
37+
38+
<Tip warning={true}>
39+
40+
Phi-3.5-MoE-instruct has been integrated in the development version (4.44.2.dev) of `transformers`. Until the official version is released through `pip`, ensure that you are doing the following:
41+
* When loading the model, ensure that `trust_remote_code=True` is passed as an argument of the `from_pretrained()` function.
42+
43+
The current `transformers` version can be verified with: `pip list | grep transformers`.
44+
45+
Examples of required packages:
46+
```
47+
flash_attn==2.5.8
48+
torch==2.3.1
49+
accelerate==0.31.0
50+
transformers==4.43.0
51+
```
52+
53+
</Tip>
54+
55+
```python
56+
import torch
57+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
58+
59+
torch.random.manual_seed(0)
60+
61+
model = AutoModelForCausalLM.from_pretrained(
62+
"microsoft/Phi-3.5-MoE-instruct",
63+
device_map="cuda",
64+
torch_dtype="auto",
65+
trust_remote_code=True,
66+
)
67+
68+
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-MoE-instruct")
69+
70+
messages = [
71+
{"role": "system", "content": "You are a helpful AI assistant."},
72+
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
73+
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
74+
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
75+
]
76+
77+
pipe = pipeline(
78+
"text-generation",
79+
model=model,
80+
tokenizer=tokenizer,
81+
)
82+
83+
generation_args = {
84+
"max_new_tokens": 500,
85+
"return_full_text": False,
86+
"temperature": 0.0,
87+
"do_sample": False,
88+
}
89+
90+
output = pipe(messages, **generation_args)
91+
print(output[0]['generated_text'])
92+
```
93+
94+
## PhimoeConfig
95+
96+
[[autodoc]] PhimoeConfig
97+
98+
<frameworkcontent>
99+
<pt>
100+
101+
## PhimoeModel
102+
103+
[[autodoc]] PhimoeModel
104+
- forward
105+
106+
## PhimoeForCausalLM
107+
108+
[[autodoc]] PhimoeForCausalLM
109+
- forward
110+
- generate
111+
112+
## PhimoeForSequenceClassification
113+
114+
[[autodoc]] PhimoeForSequenceClassification
115+
- forward
116+
117+
</pt>
118+
</frameworkcontent>

docs/source/en/perf_infer_gpu_one.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ FlashAttention-2 is currently supported for the following architectures:
7979
* [OPT](https://huggingface.co/docs/transformers/model_doc/opt#transformers.OPTModel)
8080
* [Phi](https://huggingface.co/docs/transformers/model_doc/phi#transformers.PhiModel)
8181
* [Phi3](https://huggingface.co/docs/transformers/model_doc/phi3#transformers.Phi3Model)
82+
* [PhiMoE](https://huggingface.co/docs/transformers/model_doc/phimoe#transformers.PhimoeModel)
8283
* [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel)
8384
* [Starcoder2](https://huggingface.co/docs/transformers/model_doc/starcoder2#transformers.Starcoder2Model)
8485
* [Qwen2](https://huggingface.co/docs/transformers/model_doc/qwen2#transformers.Qwen2Model)
@@ -248,6 +249,7 @@ For now, Transformers supports SDPA inference and training for the following arc
248249
* [PaliGemma](https://huggingface.co/docs/transformers/model_doc/paligemma#transformers.PaliGemmaForConditionalGeneration)
249250
* [Phi](https://huggingface.co/docs/transformers/model_doc/phi#transformers.PhiModel)
250251
* [Phi3](https://huggingface.co/docs/transformers/model_doc/phi3#transformers.Phi3Model)
252+
* [PhiMoE](https://huggingface.co/docs/transformers/model_doc/phimoe#transformers.PhimoeModel)
251253
* [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel)
252254
* [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel)
253255
* [mBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel)

src/transformers/__init__.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -655,6 +655,7 @@
655655
"models.persimmon": ["PersimmonConfig"],
656656
"models.phi": ["PhiConfig"],
657657
"models.phi3": ["Phi3Config"],
658+
"models.phimoe": ["PhimoeConfig"],
658659
"models.phobert": ["PhobertTokenizer"],
659660
"models.pix2struct": [
660661
"Pix2StructConfig",
@@ -3031,6 +3032,14 @@
30313032
"Phi3PreTrainedModel",
30323033
]
30333034
)
3035+
_import_structure["models.phimoe"].extend(
3036+
[
3037+
"PhimoeForCausalLM",
3038+
"PhimoeForSequenceClassification",
3039+
"PhimoeModel",
3040+
"PhimoePreTrainedModel",
3041+
]
3042+
)
30343043
_import_structure["models.pix2struct"].extend(
30353044
[
30363045
"Pix2StructForConditionalGeneration",
@@ -5505,6 +5514,7 @@
55055514
)
55065515
from .models.phi import PhiConfig
55075516
from .models.phi3 import Phi3Config
5517+
from .models.phimoe import PhimoeConfig
55085518
from .models.phobert import PhobertTokenizer
55095519
from .models.pix2struct import (
55105520
Pix2StructConfig,
@@ -7561,6 +7571,12 @@
75617571
Phi3Model,
75627572
Phi3PreTrainedModel,
75637573
)
7574+
from .models.phimoe import (
7575+
PhimoeForCausalLM,
7576+
PhimoeForSequenceClassification,
7577+
PhimoeModel,
7578+
PhimoePreTrainedModel,
7579+
)
75647580
from .models.pix2struct import (
75657581
Pix2StructForConditionalGeneration,
75667582
Pix2StructPreTrainedModel,

src/transformers/modeling_rope_utils.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,7 @@ def _compute_longrope_parameters(
251251
device (`torch.device`):
252252
The device to use for initialization of the inverse frequencies.
253253
seq_len (`int`, *optional*):
254-
The current sequence length. Unused for this type of RoPE.
254+
The current sequence length.
255255
rope_kwargs (`Dict`, *optional*):
256256
BC compatibility with the previous RoPE class instantiation, will be removed in v4.45.
257257
Returns:
@@ -279,8 +279,11 @@ def _compute_longrope_parameters(
279279
# `original_max_position_embeddings` field containing the pretrained value. They use the ratio between these two
280280
# values to compute the default attention scaling factor, instead of using `factor`.
281281
if hasattr(config, "original_max_position_embeddings"):
282+
if seq_len and seq_len < config.original_max_position_embeddings:
283+
expanded_max_position_embeddings = config.original_max_position_embeddings
284+
else:
285+
expanded_max_position_embeddings = config.max_position_embeddings
282286
max_position_embeddings = config.original_max_position_embeddings
283-
expanded_max_position_embeddings = config.max_position_embeddings
284287
factor = expanded_max_position_embeddings / max_position_embeddings
285288
else:
286289
max_position_embeddings = config.max_position_embeddings

src/transformers/models/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,7 @@
191191
persimmon,
192192
phi,
193193
phi3,
194+
phimoe,
194195
phobert,
195196
pix2struct,
196197
pixtral,

src/transformers/models/auto/configuration_auto.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,7 @@
211211
("persimmon", "PersimmonConfig"),
212212
("phi", "PhiConfig"),
213213
("phi3", "Phi3Config"),
214+
("phimoe", "PhimoeConfig"),
214215
("pix2struct", "Pix2StructConfig"),
215216
("pixtral", "PixtralVisionConfig"),
216217
("plbart", "PLBartConfig"),
@@ -522,6 +523,7 @@
522523
("persimmon", "Persimmon"),
523524
("phi", "Phi"),
524525
("phi3", "Phi3"),
526+
("phimoe", "Phimoe"),
525527
("phobert", "PhoBERT"),
526528
("pix2struct", "Pix2Struct"),
527529
("pixtral", "Pixtral"),

src/transformers/models/auto/modeling_auto.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -199,6 +199,7 @@
199199
("persimmon", "PersimmonModel"),
200200
("phi", "PhiModel"),
201201
("phi3", "Phi3Model"),
202+
("phimoe", "PhimoeModel"),
202203
("pixtral", "PixtralVisionModel"),
203204
("plbart", "PLBartModel"),
204205
("poolformer", "PoolFormerModel"),
@@ -519,6 +520,7 @@
519520
("persimmon", "PersimmonForCausalLM"),
520521
("phi", "PhiForCausalLM"),
521522
("phi3", "Phi3ForCausalLM"),
523+
("phimoe", "PhimoeForCausalLM"),
522524
("plbart", "PLBartForCausalLM"),
523525
("prophetnet", "ProphetNetForCausalLM"),
524526
("qdqbert", "QDQBertLMHeadModel"),
@@ -951,6 +953,7 @@
951953
("persimmon", "PersimmonForSequenceClassification"),
952954
("phi", "PhiForSequenceClassification"),
953955
("phi3", "Phi3ForSequenceClassification"),
956+
("phimoe", "PhimoeForSequenceClassification"),
954957
("plbart", "PLBartForSequenceClassification"),
955958
("qdqbert", "QDQBertForSequenceClassification"),
956959
("qwen2", "Qwen2ForSequenceClassification"),

src/transformers/models/auto/tokenization_auto.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -389,6 +389,7 @@
389389
),
390390
("phi", ("CodeGenTokenizer", "CodeGenTokenizerFast" if is_tokenizers_available() else None)),
391391
("phi3", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
392+
("phimoe", ("LlamaTokenizer", "LlamaTokenizerFast" if is_tokenizers_available() else None)),
392393
("phobert", ("PhobertTokenizer", None)),
393394
("pix2struct", ("T5Tokenizer", "T5TokenizerFast" if is_tokenizers_available() else None)),
394395
("pixtral", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),

0 commit comments

Comments
 (0)