Closed
Description
Speculative sampling is explained here: https://arxiv.org/abs/2302.01318
In more simple terms here:
- Combine large LLM with small LLM for faster inference #630 (comment)
- Combine large LLM with small LLM for faster inference #630 (comment)
For start, the "draft" model can be generated using the train-text-from-scratch example using the same vocab as LLaMA. Later, we can try to utilize better models.
We also assume that batching multiple tokens with the "main" model is significantly faster compared to processing the tokens one-by-one. This may not yet be the case, but it will be when we close ggml-org/ggml#293