-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
I am running several large language models on my small GPU cluster using the latest version of llama.cpp. The GPU cluster has multiple NVIDIA RTX 3070 GPUs. Inference on a single GPU, enforced by CUDA_VISIBLE_DEVICES=0
, of different flavors of LLMs (llama, mistral, mistral german) works as expected, i.e. the model answers my prompt in the appropriate language (German/English) .
CUDA_VISIBLE_DEVICES=0 ./main -ngl 99 -m ../LLM_stack/models/llama-2-7b.Q5_K_M.gguf --color -c 1500 --temp 0.01 -p "Why is the sky blue? Answer for a 5 year old child." -n 100
[...]
Why is the sky blue? Answer for a 5 year old child.
The sky is blue because of the scattering of light by molecules in the atmosphere. The sunlight that reaches us from space has all colors mixed together, but when it passes through our atmosphere, some of its color is scattered away. Blue light scatters more than other colors, so we see a blue sky.
Current Behavior
However, the model is simply returning characters and sharps (#) once I run inference on multiple GPUs:
CUDA_VISIBLE_DEVICES=0,1 ./main -ngl 99 -m ../LLM_stack/models/llama-2-7b.Q5_K_M.gguf --color -c 1500 --temp 0.01 -p "Why is the sky blue? Answer for a 5 year old child." -n 100`
Why is the sky blue? Answer for a 5 year old child. dispos###################################################################################################
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
- Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-7600K CPU @ 3.80GHz
CPU family: 6
Model: 158
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 9
CPU(s) scaling MHz: 19%
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 7599.80
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
Virtualization: VT-x
L1d cache: 128 KiB (4 instances)
L1i cache: 128 KiB (4 instances)
L2 cache: 1 MiB (4 instances)
L3 cache: 6 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerability Gather data sampling: Vulnerable: No microcode
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Vulnerable: No microcode
Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT disabled
- Operating System, e.g. for Linux:
$ uname -a
Linux ml 6.1.0-13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.55-1 (2023-09-29) x86_64 GNU/Linux
- SDK version, e.g. for Linux:
$ python3 --version
Python 3.9.18
$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ g++ --version
g++ (Debian 12.2.0-14) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
$ nvidia-smi
Wed Oct 25 05:57:04 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 On | 00000000:02:00.0 Off | N/A |
| 0% 43C P8 22W / 220W | 2MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3070 On | 00000000:03:00.0 Off | N/A |
| 0% 45C P8 15W / 220W | 2MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 3070 On | 00000000:04:00.0 Off | N/A |
| 0% 40C P8 20W / 220W | 2MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
[...]
Failure Information (for bugs)
The issue seems to be unrelated to the actual model as well as its size. I'm observing this issue with llama models ranging from 7B to 70B parameters.
It almost doesn't depend on the choice of -ngl
as the model is producing broken output for any value larger than 0. Context size -c
, generated tokens -n
, --no-mmap
, -nommq
don't resolve the issue either.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
- Get code
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
- Build with CUDA support
make LLAMA_CUBLAS=1
-
Get model in GGUF format e.g. huggingface.co/TheBloke/Llama-2-7B-GGUF)
-
Query model
CUDA_VISIBLE_DEVICES=0,1 ./main -ngl 99 -m ../LLM_stack/models/llama-2-7b.Q5_K_M.gguf --color -c 1500 --temp 0.01 -p "Why is the sky blue? Answer for a 5 year old child." -n 100
Failure Logs
Verbose console output for inference of llama-2 7B: output.log
Make log: make.log