CTX Processing regression for Pascal - Commit 2b4ea35

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [X] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [X] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

There is a regression on Context processing introduced in commit https://github.com/ggerganov/llama.cpp/commit/2b4ea35e56792064598e922e46d081e02bc96b94 

This is specifically for Pascal (6.1) with 1/64th fp16 performance.  Problem is worse with longer CTX, getting up to 6x slower by 8kCTX


```ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    485.03 ± 0.34 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.30 ± 0.00 |

build: daab3d7 (1421)
```
# Current Behavior
```
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------- | ---------------: |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | pp 512     |    207.34 ± 0.28 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 | tg 128     |     18.28 ± 0.01 |

build: 2b4ea35 (1422)
```
```
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1
| model                          |       size |     params | backend    | ngl |    threads |   main_gpu | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ---------- | ---------------: |
warning: cannot set main_device=1 because there are only 1 devices. Using device 0 instead.
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 |          1 | pp 512     |    208.54 ± 0.58 |
| llama 13B mostly Q8_0          |  12.88 GiB |    13.02 B | CUDA       |  99 |          1 |          1 | tg 128     |     18.29 ± 0.00 |

build: 207b519 (1446)
```
# Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

* 5800X + 64GB DDR 3733 
* 3060ti (8GB) + TESLA P40 (24GB) 

* Operating System, e.g. for Linux: Windows 11


* SDK version, : MSVC 2022

```
$ python3 -- 3.10.11
$ Cmake --version 3.27.4
```

@LostRuins 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CTX Processing regression for Pascal - Commit 2b4ea35 #3869

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CTX Processing regression for Pascal - Commit 2b4ea35 #3869

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions