-
Notifications
You must be signed in to change notification settings - Fork 13k
Description
Description
I'm seeing a strange issue where batches created via llama_batch_get_one
give better results than batches populated with llama_batch_add
.
I was trying to convert my code to use llama_batch_add
because llama_batch_get_one
has a deprecation note on it, but when I made this conversion, the quality of responses I was getting went down. This appears to be the case whether or not layers are offloaded to the GPU.
I may not understand the batch API correctly, so it seems plausible that there is a mistake in my code, rather than this being a true bug. However, if I am using it correctly, it seemed good to raise, as the removal of llama_batch_get_one
as the comment indicates, would result in either a speed or a quality regression in my project.
System Information
llama_cpp hash: f87f7b8
llama_cpp backend: Vulkan
OS: Windows 10 Pro 64-bit
GPU: Nvidia Geforce RTX 3080
CPU: AMD Ryzen 9 3950X
Model: mistral-7b-instruct-v0.2.Q6_K.gguf (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)
Repro Demonstration Code
main.cpp.txt
This cpp file, when compiled, creates a program that can be called with two arguments.
- The first argument is one of
new
|old
|single
to swap between methods of filling a llama_batch. - The second argument is a path to the model to load for testing
Bad Result
main.exe new "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"
- This uses
llama_batch_add
to parse the prompt, similar to thesimple
example. - Results always begin with "Qu"-like tokens, usually resulting in the first English word being something like "Question:" or "Questioner,"
- Changing the last instruction usually still yields things like "Questioner" or "User" as the first word.
"""
Questioner, allow me to paint a vivid tableau of the three most distinguished realms within the intricately woven tapestry of my fantastical universe:
"""
Good Result A
main.exe old "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"
- This uses
llama_batch_get_one
to parse the prompt, similar to themain
example. - First non-prompt word is highly varied and leads into a logical response.
- Changing the last instruction yields logical changes, such as "Who is the most famous person in your books?" yielding "Once," and other such first words.
"""
In the heart of my fantastical realm, where towering mountains meet vast emerald forests and azure seas stretch as far as the eye can see, lie the three grand kingdoms: Valoria, Elidor, and Thundertop.
"""
Good Result B
main.exe single "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"
- This uses
llama_batch_get_one
to parse the prompt, but dispatches a batch with only a single token each time. - First non-prompt word is highly varied and leads into a logical response.
- Changing the last instruction yields logical changes, such as "Who is the most famous person in your books?" yielding "Once," and other such first words.
- This method takes longer to evaluate than
old
"""
In the vast expanse of Eldoria, the realm of magic and wonder, three distinct kingdoms rose like proud pillars against the ever-changing tapestry of the land. Each unique in its history, culture, and people, they stood as beacons of hope and prosperity for their inhabitants.
"""
Thank you!