-
Notifications
You must be signed in to change notification settings - Fork 13.1k
Closed
Labels
performanceSpeed related topicsSpeed related topics
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
There is a regression on Context processing introduced in commit 2b4ea35
This is specifically for Pascal (6.1) with 1/64th fp16 performance. Problem is worse with longer CTX, getting up to 6x slower by 8kCTX
Device 0: Tesla P40, compute capability 6.1
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | pp 512 | 485.03 ± 0.34 |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | tg 128 | 18.30 ± 0.00 |
build: daab3d7 (1421)
Current Behavior
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | pp 512 | 207.34 ± 0.28 |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | tg 128 | 18.28 ± 0.01 |
build: 2b4ea35 (1422)
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1
| model | size | params | backend | ngl | threads | main_gpu | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ---------- | ---------------: |
warning: cannot set main_device=1 because there are only 1 devices. Using device 0 instead.
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | 1 | pp 512 | 208.54 ± 0.58 |
| llama 13B mostly Q8_0 | 12.88 GiB | 13.02 B | CUDA | 99 | 1 | 1 | tg 128 | 18.29 ± 0.00 |
build: 207b519 (1446)
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
-
5800X + 64GB DDR 3733
-
3060ti (8GB) + TESLA P40 (24GB)
-
Operating System, e.g. for Linux: Windows 11
-
SDK version, : MSVC 2022
$ python3 -- 3.10.11
$ Cmake --version 3.27.4
LostRuins, 0cc4m, AlpinDale and bobqianicLostRuins
Metadata
Metadata
Assignees
Labels
performanceSpeed related topicsSpeed related topics