Lossless Large Language Model Acceleration via Self-Speculative Decoding #3435
KerfuffleV2
started this conversation in
Ideas
Replies: 3 comments 1 reply
-
:( :( couldn't we just ask paper authors? |
Beta Was this translation helpful? Give feedback.
1 reply
-
Repo with code (not yet available): https://github.com/dilab-zju/self-speculative-decoding |
Beta Was this translation helpful? Give feedback.
0 replies
-
I asked the authors and got a very helpful reply. See: #3565 (comment) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Link: https://arxiv.org/abs/2309.08168
Basically this is like the existing speculative decoding stuff except it doesn't use a separate speculation model but instead runs only some of the main model's layers to generate the draft. The big advantage is the "draft" model's output will definitely be in sync with the main model and you don't need to load in a whole separate model: the existing model can be reused.
Unfortunately, they don't really include specific information about which layers to skip are optimal so that's something we'd have to find out ourselves. The first step to that might be extending the inference API to allow passing a list of the layers to run and an example that could run perplexity on various permutations of layers.
Beta Was this translation helpful? Give feedback.
All reactions