Skip to content

batch_add gives lower quality results than batch_get_one #6475

@TheFlipbook

Description

@TheFlipbook

Description

I'm seeing a strange issue where batches created via llama_batch_get_one give better results than batches populated with llama_batch_add.

I was trying to convert my code to use llama_batch_add because llama_batch_get_one has a deprecation note on it, but when I made this conversion, the quality of responses I was getting went down. This appears to be the case whether or not layers are offloaded to the GPU.

I may not understand the batch API correctly, so it seems plausible that there is a mistake in my code, rather than this being a true bug. However, if I am using it correctly, it seemed good to raise, as the removal of llama_batch_get_one as the comment indicates, would result in either a speed or a quality regression in my project.

System Information

llama_cpp hash: f87f7b8
llama_cpp backend: Vulkan
OS: Windows 10 Pro 64-bit
GPU: Nvidia Geforce RTX 3080
CPU: AMD Ryzen 9 3950X
Model: mistral-7b-instruct-v0.2.Q6_K.gguf (https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF)

Repro Demonstration Code

main.cpp.txt
This cpp file, when compiled, creates a program that can be called with two arguments.

  • The first argument is one of new|old|single to swap between methods of filling a llama_batch.
  • The second argument is a path to the model to load for testing

Bad Result

main.exe new "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"

  • This uses llama_batch_add to parse the prompt, similar to the simple example.
  • Results always begin with "Qu"-like tokens, usually resulting in the first English word being something like "Question:" or "Questioner,"
  • Changing the last instruction usually still yields things like "Questioner" or "User" as the first word.
    """
    Questioner, allow me to paint a vivid tableau of the three most distinguished realms within the intricately woven tapestry of my fantastical universe:
    """

Good Result A

main.exe old "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"

  • This uses llama_batch_get_one to parse the prompt, similar to the main example.
  • First non-prompt word is highly varied and leads into a logical response.
  • Changing the last instruction yields logical changes, such as "Who is the most famous person in your books?" yielding "Once," and other such first words.
    """
    In the heart of my fantastical realm, where towering mountains meet vast emerald forests and azure seas stretch as far as the eye can see, lie the three grand kingdoms: Valoria, Elidor, and Thundertop.
    """

Good Result B

main.exe single "C:\\Dev\\SDK\\models\\gguf\\mistral-7b-instruct-v0.2.Q6_K.gguf"

  • This uses llama_batch_get_one to parse the prompt, but dispatches a batch with only a single token each time.
  • First non-prompt word is highly varied and leads into a logical response.
  • Changing the last instruction yields logical changes, such as "Who is the most famous person in your books?" yielding "Once," and other such first words.
  • This method takes longer to evaluate than old
    """
    In the vast expanse of Eldoria, the realm of magic and wonder, three distinct kingdoms rose like proud pillars against the ever-changing tapestry of the land. Each unique in its history, culture, and people, they stood as beacons of hope and prosperity for their inhabitants.
    """

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions