Bug: Flash Attention performs worse under ROCM

### What happened?

Turning on flash attention degrades the performance when used under ROCM (at least it does with a 7900 xtx). Using batched bench, the degradation is quite minor at a batchsize of 1.

prompt processing: 461 -> 434
token generation: 24.26 -> 23.84

However, when running multiple batches of requests at the same time, the effect is MUCH more pronounced. Especially with batch sizes of 16 the difference is massive:

prompt processing: 678 -> 375
token generation: 169.65 -> 86.87

Flash Attention is needed to be able to use quantization for the KV-cache, but the performance hit is drastic. Can this be fixed?

### Name and Version

build: 4123 (2eb76b2a) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

### What operating system are you seeing the problem on?

Linux

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Flash Attention performs worse under ROCM #10439

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Flash Attention performs worse under ROCM #10439

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions