@@ -159,6 +159,72 @@ A variety of speculative models of this type are available on HF hub:
159
159
- [ granite-7b-instruct-accelerator] ( https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator )
160
160
- [ granite-20b-code-instruct-accelerator] ( https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator )
161
161
162
+ ## Speculating using EAGLE based draft models
163
+
164
+ The following code configures vLLM to use speculative decoding where proposals are generated by
165
+ an [ EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)] ( https://arxiv.org/pdf/2401.15077 ) based draft model.
166
+
167
+ ``` python
168
+ from vllm import LLM , SamplingParams
169
+
170
+ prompts = [
171
+ " The future of AI is" ,
172
+ ]
173
+ sampling_params = SamplingParams(temperature = 0.8 , top_p = 0.95 )
174
+
175
+ llm = LLM(
176
+ model = " meta-llama/Meta-Llama-3-8B-Instruct" ,
177
+ tensor_parallel_size = 4 ,
178
+ speculative_model = " path/to/modified/eagle/model" ,
179
+ speculative_draft_tensor_parallel_size = 1 ,
180
+ )
181
+
182
+ outputs = llm.generate(prompts, sampling_params)
183
+
184
+ for output in outputs:
185
+ prompt = output.prompt
186
+ generated_text = output.outputs[0 ].text
187
+ print (f " Prompt: { prompt!r } , Generated text: { generated_text!r } " )
188
+
189
+ ```
190
+
191
+ A few important things to consider when using the EAGLE based draft models:
192
+
193
+ 1 . The EAGLE draft models available in the [ HF repository for EAGLE models] ( https://huggingface.co/yuhuili ) cannot be
194
+ used directly with vLLM due to differences in the expected layer names and model definition.
195
+ To use these models with vLLM, use the [ following script] ( https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d )
196
+ to convert them. Note that this script does not modify the model's weights.
197
+
198
+ In the above example, use the script to first convert
199
+ the [ yuhuili/EAGLE-LLaMA3-Instruct-8B] ( https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B ) model
200
+ and then use the converted checkpoint as the draft model in vLLM.
201
+
202
+ 2 . The EAGLE based draft models need to be run without tensor parallelism
203
+ (i.e. speculative_draft_tensor_parallel_size is set to 1), although
204
+ it is possible to run the main model using tensor parallelism (see example above).
205
+
206
+ 3 . When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
207
+ reported in the reference implementation [ here] ( https://github.com/SafeAILab/EAGLE ) . This issue is under
208
+ investigation and tracked here: [ https://github.com/vllm-project/vllm/issues/9565 ] ( https://github.com/vllm-project/vllm/issues/9565 ) .
209
+
210
+
211
+ A variety of EAGLE draft models are available on the Hugging Face hub:
212
+
213
+ | Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
214
+ | ---------------------------------------------------------------------| -------------------------------------------| --------------------|
215
+ | Vicuna-7B-v1.3 | yuhuili/EAGLE-Vicuna-7B-v1.3 | 0.24B |
216
+ | Vicuna-13B-v1.3 | yuhuili/EAGLE-Vicuna-13B-v1.3 | 0.37B |
217
+ | Vicuna-33B-v1.3 | yuhuili/EAGLE-Vicuna-33B-v1.3 | 0.56B |
218
+ | LLaMA2-Chat 7B | yuhuili/EAGLE-llama2-chat-7B | 0.24B |
219
+ | LLaMA2-Chat 13B | yuhuili/EAGLE-llama2-chat-13B | 0.37B |
220
+ | LLaMA2-Chat 70B | yuhuili/EAGLE-llama2-chat-70B | 0.99B |
221
+ | Mixtral-8x7B-Instruct-v0.1 | yuhuili/EAGLE-mixtral-instruct-8x7B | 0.28B |
222
+ | LLaMA3-Instruct 8B | yuhuili/EAGLE-LLaMA3-Instruct-8B | 0.25B |
223
+ | LLaMA3-Instruct 70B | yuhuili/EAGLE-LLaMA3-Instruct-70B | 0.99B |
224
+ | Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
225
+ | Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
226
+
227
+
162
228
## Lossless guarantees of Speculative Decoding
163
229
164
230
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
0 commit comments