Skip to content

Vulkan backend regression: gibberish output when layers offloaded to GPU #8092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Adriankhl opened this issue Jun 24, 2024 · 13 comments · Fixed by #8479
Closed

Vulkan backend regression: gibberish output when layers offloaded to GPU #8092

Adriankhl opened this issue Jun 24, 2024 · 13 comments · Fixed by #8479
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

Comments

@Adriankhl
Copy link
Contributor

Adriankhl commented Jun 24, 2024

What happened?

OS: Windows
Compiler: cl or clang-cl
Build command: cmake .. -GNinja -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DLLAMA_NATIVE=OFF -DLLAMA_VULKAN=ON -DCMAKE_BUILD_TYPE=Debug
Apu: amd 780m
Vulkan Instance Version: 1.3.261
Vulkan SDK version: 1.3.283

This PR #7947 causes gibberish output when running

.\bin\llama-cli.exe -m "C:\Users\adriankhl\git\models\Meta-Llama-3-8B-Instruct.Q5_K_M.gguf" --prompt "Hello world. " -ngl 33

while setting -ngl 0 produces normal output.

Name and Version

version: 3213 (52fc870)
built with Clang 18.1.6 for

What operating system are you seeing the problem on?

Windows

Relevant log output

No response

@Adriankhl Adriankhl added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Jun 24, 2024
@Adriankhl Adriankhl changed the title Vulkan backend regression: produce gibberish output when there are layers offloaded to GPU Vulkan backend regression: pr gibberish output when there are layers offloaded to GPU Jun 24, 2024
@Adriankhl Adriankhl changed the title Vulkan backend regression: pr gibberish output when there are layers offloaded to GPU Vulkan backend regression: gibberish output when layers offloaded to GPU Jun 24, 2024
@stduhpf
Copy link
Contributor

stduhpf commented Jun 24, 2024

Not happening on my end. Maybe you should give more details about the setup you're using (GPU model, driver version...)

@Adriankhl
Copy link
Contributor Author

Not happening on my end. Maybe you should give more details about the setup you're using (GPU model, driver version...)

Added those information. It is an AMD apu, I feel like it is much more frequent to have problems on integrated GPU, probably because @0cc4m is testing on a different setting.

@dspasyuk
Copy link
Contributor

dspasyuk commented Jun 29, 2024

@Adriankhl Your issue might be with the prompt, you need to use Llama prompt style, try this: https://github.com/dspasyuk/llama.cui

or:

../llama.cpp/llama-cli --model ../../models/Meta-Llama-3-8B-Instruct_Q4_K_S.gguf --n-gpu-layers 35 -cnv --interactive --interactive-first --simple-io -b 2048 -n -1 -e --ctx_size 0 --temp 0.3 --top_k 10 --multiline-input --repeat_penalty 1.12 -t 8 -ptc 10 -r <|start_header_id|>user --no-display-prompt

and then prompt the model like so it should work: <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Role and Purpose: You are Alice, a large language model. Your purpose is to assist users by providing information, answering questions, and engaging in meaningful conversations based on the data you were trained on. Behavior and Tone: Be informative, engaging, and respectful. Maintain a neutral and unbiased tone. Ensure that responses are clear and concise. Capabilities: Use your training data to provide accurate and relevant information. Explain complex concepts in an easy-to-understand manner.Provide sources when referencing specific information or data. Output Formatting: Use this formatting for code: language /n<|eot_id|><|start_header_id|>user<|end_header_id|>

Answer the following questions:

  1. The day before two days after the day before tomorrow is Saturday. What day is it today?
  2. What is the square root of 169?
  3. Solve the equation 3y = 6y + 11 and find y.
  4. There are two ducks in front of a duck, two ducks behind a duck, and a duck in the middle. How many ducks are there?
  5. How many days does it take to travel from New York City to London by plane, assuming non-stop flights and average speeds?
  6. What are the products of the chemical reaction between salicylic acid and acetic anhydride?
  7. If five cats can catch five mice in five minutes, how long will it take one cat to catch one mouse?
  8. Create a JS program that prints the first 100 Fibonacci numbers. <|eot_id|><|start_header_id|>assistant<|end_header_id|>

@stduhpf
Copy link
Contributor

stduhpf commented Jun 29, 2024

@dspasyuk If it was an issue with prompt format, it wouldn't matter wether it was with -ngl 33or -ngl 0.

@dspasyuk
Copy link
Contributor

@stduhpf perhaps, in my hands with Llama-instruct, if you do not use the proper prompt it is random, one prompt is fine then the other is not. It becomes more apparent when you use it in conversation mode (-cnv).

@giladgd
Copy link
Contributor

giladgd commented Jun 29, 2024

I also encountered this issue where it generates gibberish, and it only happens with Vulkan.
Using the same code compiled with CUDA works just fine.

Running this command:

./llama-cli --model Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --prompt 'Hi there!' --n-predict 15 --ctx-size 4096 -ngl 33

Generates this output:

Hi there!riereolle301-wahunjar301vangvangvang Ruby Schemejs Colei

Running this command (with no GPU layers offloading):

./llama-cli --model Meta-Llama-3-8B-Instruct.Q4_K_M.gguf --prompt 'Hi there!' --n-predict 15 --ctx-size 4096 -ngl 0

Generates this output:

Hi there! I'm excited to share with you the 5th part of my

I used this model in these examples, and the code is compiled from the latest release (b3265 release).
Running on Ubuntu 22.04.2 LTS with NVIDIA RTX A6000.

@definitelyuncertain
Copy link

definitelyuncertain commented Jul 3, 2024

I can confirm too. Tried many releases before and after the aforementioned commit.

Gibberish output with Mistral 7B Instruct, Meta Llama 3 8B Instruct and several other models when using Vulkan, but CPU works fine.

Arch Linux ALHP repos + AMD RX 6600 XT

Update: Following #7056 I downloaded a very recent Mistral 7B Q4_K_M and it works as expected with the Vulkan backend on GPU. So the issue I was facing might have been entirely due to model format incompatibility in the old models I was using.

@LostRuins
Copy link
Collaborator

Hello, I can confirm that this PR #7947 definitely does break Vulkan for me with incoherent responses when GPU layers are offloaded (works fine if offload is disabled). Reverting that PR solves the issue.

Tagging @0cc4m to see if they are able to repro it.

I am using a Nvidia RTX 2060 (laptop), and running the model https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/blob/main/airoboros-mistral2.2-7b.Q4_K_S.gguf

@LostRuins
Copy link
Collaborator

@0cc4m additional logs and exact llama.cpp repoduction steps as requested:

  1. Obtain llama-b3153-bin-win-vulkan-x64 (before PR 7947) and llama-b3154-bin-win-vulkan-x64 (after PR 7947)
  2. Obtain the model airoboros-mistral2.2-7b.Q4_K_S.gguf with SHA256 of ea9a7c81...37
  3. Run the CLI llama-cli.exe -m E:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf -c 512 -ngl 20 -n 20 -p "Hello, my name is" and observe output. Repeat for both builds.
  4. Attached are my execution logs for both attempts.

Note: You may wish to repeat the test a few times. The output is not complete rubbish, it's still an english sentence, but it is an incoherent continuation.
3153.txt
3154.txt

@definitelyuncertain
Copy link

@LostRuins I observed that very recent models worked on my setup, e.g https://huggingface.co/rubra-ai/Mistral-7B-Instruct-v0.2-GGUF (Q4_K_M) whereas anything I tried from 2023 gave gibberish

The model you linked is from Oct 2023. So I recommend trying out the above or other similarly recent models on your setup to see if it might be a format issue.

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 14, 2024

@LostRuins Thank you for the report, I can reproduce the problem that way. Some issue with q4_k matrix multiplication. I'll try to find and fix the bug.

@0cc4m 0cc4m mentioned this issue Jul 14, 2024
4 tasks
@0cc4m
Copy link
Collaborator

0cc4m commented Jul 14, 2024

@LostRuins @Adriankhl @definitelyuncertain Can you check whether your issues are resolved with #8479 ?

LostRuins added a commit to LostRuins/koboldcpp that referenced this issue Jul 14, 2024
@LostRuins
Copy link
Collaborator

@0cc4m this seems to fix the issue for me. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants