[whisper] static kv cache #31166

sanchit-gandhi · 2024-05-31T14:15:36Z

What does this PR do?

Supersedes #28931 and extends it by adding static k/v cache support for Whisper. Also improves the performance of the eager attention implementation by removing un-necessary reshapes (inspired by LlamaAttention).

Similar to #28931, we use a separate cache for the self-attention and cross-attention layers. We define a lightweight EncoderDecoderCache wrapper that holds these two cache classes and implements common base methods (e.g. to_legacy_cache()) by calling the corresponding methods for each cache class.

However, there is one hurdle in enabling compatibility with torch.compile. Namely, we have to determine whether we're in the first decoding step, or second step onwards:

In the first decoding step, we compute the cross-attention k/v states and update the cache accordingly
In the second step onwards, we re-use the k/v states directly from the cache. There’s no further update to the cross-attention cache, since the k/v states are derived entirely from the encoder hidden-states (which stay fixed)

=> the difficulty is in detecting whether we’re in the first decoding step (1), or second step onwards (2). With eager mode, we can condition on past_key_values.get_seq_length() to determine the decoding step. However, for torch.compile this introduces a graph break. Consequently, we add a boolean flag is_updated to the StaticCache class, which informs us whether the cache has been updated or not. The alternative would be to employ the same logic we do in the Flax code, where we re-compute the cross-attention k/v states each time. Benchmarks show this approach is 1.4x slower than adding the CPU flag.

Using the .generate API with Whisper medium, we get approximately 5x speed-up when generating 64 tokens using sdpa attention. Note here that we compile the forward pass only:

bsz	dynamic tok/s	compiled tok/s	Speed-up
1	55.6	270.7	4.9
2	111.4	541.3	4.9
4	222.3	1078.8	4.9
8	446.3	2167.4	4.9

Extended results:

Whisper large-v3

bsz	dynamic tok/s	compiled tok/s	Speed-up
1	41.1	190.4	4.6
2	82.1	381.2	4.6
4	162.9	761.2	4.7
8	331.3	1522.5	4.6

Distil-Whisper distil-large-v3

bsz	dynamic tok/s	compiled tok/s	Speed-up
1	278.7	449.1	1.6
2	560.5	900.3	1.6
4	1113.2	1798.7	1.6
8	2225.0	3592.8	1.6

As expected, the speed-ups for Distil-Whisper are less pronounced:

With only 2 decoder layers, the decoder forward pass is already >6x faster than Whisper, and we have a very small decoder graph that can be compiled
The overhead from the logits post-processing now occupies a greater proportion of the generation time. Compiling the logits processors is a good next step for speeding-up generation further.

Code example:

from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, AutoProcessor
import torch
import logging
import time

torch._logging.set_logs(graph_breaks=True, recompiles=True)

torch_device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

processor = AutoProcessor.from_pretrained("openai/whisper-tiny.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en", attn_implementation="sdpa")
model.to(torch_device, dtype=torch_dtype)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
inputs = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").to(torch_device)
input_features = inputs.input_features.to(torch_dtype)

model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
model.generation_config.cache_implementation = "static"

# compile
for i in range(2):
    model.generate(input_features)

# inference
pred_ids = model.generate(input_features)

In refactoring the eager attention implementation for the cache abstraction, I managed to remove a lot of wasteful .view operations, generally aligning it with LLaMA and giving a performance boost even without compile (TODO: quantify speed-up).

The only regression comes when using FA2 and compile, where we have to introduce a bunch of new .transpose operations for compatibility with the shape of our k/v cache (TODO: quantify regression). This is also a known problem in LLaMA.

There are a few tidy-up points left TODO. Once we're happy with the design, I'll complete the PR with the final checklist items:

Fix failing fast tests
Tidy docstrings for new arguments (past_key_values, cache_position)
Update model doc with FA2 usage
Run all Whisper slow tests
Run all ASR pipeline slow tests
Check gradients propagate correctly when training with output_attentions=True

HuggingFaceDocBuilderDev · 2024-05-31T14:39:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Very nice overall.cc @zhenglongjiepheonix I reviewed this one instead of #30949 because it had less changes, sorry that work got duplicated here!

src/transformers/models/whisper/modeling_whisper.py

zhenglongjiepheonix · 2024-06-03T15:44:15Z

You can reference my PR #30949 for tests failing part, it passes all the tests that the current main branch passes and will save you a lot of time debugging @sanchit-gandhi

Co-authored-by: Arthur Zucker <[email protected]>

Co-authored-by: Arthur <[email protected]>

…static-kv

gante

Happy with the PR 🔥🔥 Let's goooo

ArthurZucker

Would maybe just run the slow tests?

src/transformers/models/whisper/modeling_whisper.py

ArthurZucker · 2024-06-28T16:17:15Z

src/transformers/models/whisper/modeling_whisper.py

+                logger.warning_once(
+                    "Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. "
+                    "You should pass an instance of `EncoderDecoderCache` instead, e.g. "
+                    "`past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`."
+                )


sanchit-gandhi · 2024-07-02T12:24:03Z

Thanks both for the reviews! Confirming that the slow tests pass on the DGX A100.

Going to merge this one to enable static kv cache for:

Short-form generation
Long-form generation without fallback (i.e. sequential generation without temperature fallback)

We'll need a follow-up PR to enable:

Long-form generation with fallback: remember that we dynamically reduce the batch size when we do temperature fallback. We'll need to change this to fixed batch sizes for compile
Long-form chunked generation with pipeline: again, the batch size is set dynamically in the pipeline class, depending on the length of the inputs

fxmarty · 2024-07-02T13:20:50Z

src/transformers/models/whisper/modeling_whisper.py

-            and past_key_value is not None
-            and past_key_value[0].shape[2] == key_value_states.shape[1]
-        ):
+        query_states = self._shape(self.q_proj(hidden_states), tgt_len, bsz)


_shape and _reshape are not the same op, is it fine to replace?

We do the transpose later to get it into the original format: https://github.com/huggingface/transformers/pull/31166/files#r1652420859

But this is a good point - we don't need to _shape then .transpose the q-states, we can directly get them into the correct format

SaeedNajafi · 2024-07-02T13:31:22Z

Hi, I am getting some cache errors while doing generation with llama3 and fsdp.
I am using flash_attention_2, and the use_cache=True in the generate function.
Latest transformer from the repo including your recent PR.

[rank1]: Traceback (most recent call last):
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/src/squadv2_finetuning.py", line 129, in <module>
[rank1]:     app.run(main)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/absl/app.py", line 308, in run
[rank1]:     _run_main(main, args)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
[rank1]:     sys.exit(main(argv))
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/src/squadv2_finetuning.py", line 91, in main
[rank1]:     results = train(
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/src/utils/train_utils.py", line 108, in train
[rank1]:     eval_ppl, eval_epoch_loss, temp_val_loss, temp_step_perplexity, eval_scores = evaluation(
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/src/utils/train_utils.py", line 405, in evaluation
[rank1]:     for ret_row, ret_loss in model.predict(batch):
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/src/llm.py", line 245, in predict
[rank1]:     answers, log_ps = self.generation_pass(batch)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/src/llm.py", line 216, in generation_pass
[rank1]:     predictions_output = self.model.generate(
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/peft/peft_model.py", line 1491, in generate
[rank1]:     outputs = self.base_model.generate(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/transformers/src/transformers/generation/utils.py", line 1945, in generate
[rank1]:     result = self._sample(
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/transformers/src/transformers/generation/utils.py", line 2693, in _sample
[rank1]:     outputs = self(
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/transformers/src/transformers/models/llama/modeling_llama.py", line 1174, in forward
[rank1]:     outputs = self.model(
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/transformers/src/transformers/models/llama/modeling_llama.py", line 978, in forward
[rank1]:     layer_outputs = decoder_layer(
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 857, in forward
[rank1]:     output = self._fsdp_wrapped_module(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 168, in forward
[rank1]:     return self.checkpoint_fn(  # type: ignore[misc]
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
[rank1]:     return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
[rank1]:     return fn(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 494, in checkpoint
[rank1]:     ret = function(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/transformers/src/transformers/models/llama/modeling_llama.py", line 718, in forward
[rank1]:     hidden_states, self_attn_weights, present_key_value = self.self_attn(
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/llm-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/transformers/src/transformers/models/llama/modeling_llama.py", line 431, in forward
[rank1]:     key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
[rank1]:   File "/fs01/home/snajafi/codes/llm-research/transformers/src/transformers/cache_utils.py", line 366, in update
[rank1]:     return self.key_cache[layer_idx], self.value_cache[layer_idx]
[rank1]: IndexError: list index out of range

sanchit-gandhi · 2024-07-02T13:50:48Z

Hey @SaeedNajafi - do you have a minimal reproducer you could use to open a new issue on the repo? Thanks!

ArthurZucker · 2024-07-22T13:10:51Z

The pipeline needs more work, specifically for longer audios + the merging solution.
Your controbution is welcome, especially for 1) if you have a wroking snippet feel free to add it to the doc

Jiltseb · 2024-07-22T13:27:33Z

The pipeline needs more work, specifically for longer audios + the merging solution. Your controbution is welcome, especially for 1) if you have a wroking snippet feel free to add it to the doc

Thanks. I deleted the comment once I saw the PR already in progress #31772 for this exact thing. I think it's better to wait for the merge.

Support the `cache_position` input that was added to Hugging Face Whisper models as part of a revision of how it handles KV-caching. This is like `position_ids`, but there is no batch dimension. See huggingface/optimum#1971 and huggingface/transformers#31166.

kadirnar · 2025-08-03T15:44:38Z

@ArthurZucker @sanchit-gandhi

It works with SDPA optimization. However, flash_attention_3 did not work with flash_attention_2 and kernels.

code:

from transformers import WhisperForConditionalGeneration, AutoProcessor
import torch
import logging
import time
import librosa

audio_path = "test.mp3"
torch._logging.set_logs(graph_breaks=True, recompiles=True)

torch_device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

processor = AutoProcessor.from_pretrained("distil-whisper/distil-large-v3.5")
model = WhisperForConditionalGeneration.from_pretrained("distil-whisper/distil-large-v3.5", attn_implementation="kernels-community/flash-attn3")
model.to(torch_device, dtype=torch_dtype)

audio_array, sampling_rate = librosa.load(audio_path, sr=16000)
inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt").to(torch_device)
input_features = inputs.input_features.to(torch_dtype)

model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
model.generation_config.cache_implementation = "static"

# compile
for i in range(2):
    model.generate(input_features)

# inference
pred_ids = model.generate(input_features)

Error Messages:

  File "/mnt/whisper-plus/.venv/lib/python3.10/site-packages/transformers/integrations/flash_attention.py", line 64, in flash_attention_forward
    attn_output = _flash_attention_forward(
  File "/mnt/whisper-plus/.venv/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 363, in _flash_attention_forward
    if not all(k in globals() for k in ("_flash_fn", "_flash_varlen_fn", "_pad_fn", "_unpad_fn", "_is_fa3")):

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

sanchit-gandhi added 2 commits May 31, 2024 15:05

make work with cache abstraction

738ed90

correct for static cache

624fa74

sanchit-gandhi added 13 commits May 31, 2024 15:56

hacks for compile

f2124f8

make fast

9f02f7d

fix

2d7102e

fix pos ids

cd9ce9b

generate

abad0b9

fix sdpa

248be4d

fix sdpa cache pos

9ba0da9

fix fa2

4ea437a

clean fa2

92f94f8

integrate cache into generate

7ea0d16

make style

b4478c1

copies

b6cb739

more copies

57a219b

ArthurZucker reviewed Jun 3, 2024

View reviewed changes

sanchit-gandhi added 12 commits June 5, 2024 14:50

update eager

2d91708

update sdpa

11e79a9

update fa2

27d520b

simplify

f72224d

use cache pos

fcf024a

always compute cross-cache for debug

3f48947

avoid recompiles

7a5a5eb

Co-authored-by: Arthur Zucker <[email protected]>

fix fix

2eba447

fix fix fix

0bb8cb6

more fix

bfac769

try encoder-decoder cache (too messy)

93c97c1

revert encoder-decoder cache

05f12a3

sanchit-gandhi and others added 7 commits June 26, 2024 17:10

Update src/transformers/cache_utils.py

2d4a2a8

Co-authored-by: Arthur <[email protected]>

deprecate

1860c31

Merge remote-tracking branch 'origin/whisper-static-kv' into whisper-…

d8e738f

…static-kv

updates

24183cb

final updates

2bad47c

Merge branch 'main' into whisper-static-kv

89823f3

style

f0f8130

gante approved these changes Jun 27, 2024

View reviewed changes

ArthurZucker approved these changes Jun 28, 2024

View reviewed changes

sanchit-gandhi and others added 2 commits July 2, 2024 11:18

Merge branch 'main' into whisper-static-kv

d8e8d64

style

e25c8e1

sanchit-gandhi merged commit a970195 into huggingface:main Jul 2, 2024

fxmarty reviewed Jul 2, 2024

View reviewed changes

This was referenced Jul 2, 2024

[modelling] remove un-necessary transpose for fa2 attention #31749

Merged

[whisper] compile compatibility with long-form decoding #31772

Merged

This was referenced Jul 15, 2024

Whisper Static Cache #30760

Closed

tracker: generate compatibility with torch.compile #28981

Closed

kadirnar mentioned this pull request Jul 15, 2024

Whisper + Torch.Compile: torch._dynamo.exc.Unsupported: reconstruct: UserDefinedObjectVariable(EncoderDecoderCache) #31987

Closed

4 tasks

eustlb mentioned this pull request Jul 22, 2024

Add static cache huggingface/parler-tts#89

Merged

This was referenced Sep 6, 2024

Add static cache support for Whisper #30707

Closed

What's going on with T5 x torch.compile ? #33221

Closed

robertknight mentioned this pull request Oct 27, 2024

Support cache_position inputs in Hugging Face models robertknight/rten#395

Merged

[whisper] static kv cache #31166

[whisper] static kv cache #31166

Uh oh!

Conversation

sanchit-gandhi commented May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented May 31, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhenglongjiepheonix commented Jun 3, 2024 • edited by sanchit-gandhi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker Jun 28, 2024

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

sanchit-gandhi Jul 2, 2024

Choose a reason for hiding this comment

Uh oh!

SaeedNajafi commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchit-gandhi commented Jul 2, 2024

Uh oh!

ArthurZucker commented Jul 22, 2024

Uh oh!

Jiltseb commented Jul 22, 2024

Uh oh!

kadirnar commented Aug 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

sanchit-gandhi commented May 31, 2024 •

edited

Loading

zhenglongjiepheonix commented Jun 3, 2024 •

edited by sanchit-gandhi

Loading

sanchit-gandhi commented Jul 2, 2024 •

edited

Loading

SaeedNajafi commented Jul 2, 2024 •

edited

Loading