[Cache] Support compilable cache reuse with smaller batch sizes #37394

gante · 2025-04-09T16:35:42Z

What does this PR do?

⚠️ this PR needs to be rebased, don't review/merge

Supercedes #37389
Partially solves #35444

This PR makes our max_cache_size argument in compilable caches finally true: we can now use a cache object with a batch size smaller than the one defined in the cache. Compile once and run with multiple input shapes -- particularly useful for export, as mentioned in #35444.

Adds other minor related fixes (see PR comments).

We can see in the following test script that this does not degrade compiled performance:

from transformers import AutoModelForCausalLM, AutoTokenizer, StaticCache
import torch
import time

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B", device_map="auto", torch_dtype=torch.float16)

input_ids = tokenizer(["The quick brown"], return_tensors="pt").input_ids.to(model.device)
cache_position = torch.arange(input_ids.shape[1]).to(model.device)

with torch.no_grad():
    #------------------------------------------------------------------------------------------------
    # OLD, cache batch size = input batch size
    # Measured on an RTX 4090: `main` = 0.223ms; this PR = 0.223ms
    cache = StaticCache(
        config=model.config,
        max_batch_size=1,
        max_cache_len=100,
        device=model.device,
        dtype=model.dtype
    )
    model.forward = torch.compile(model.forward, fullgraph=True, mode="reduce-overhead")

    # warmup
    for _ in range(3):
        outputs = model(input_ids, cache_position=cache_position, past_key_values=cache)

    # measure
    start = time.time()
    for _ in range(100):
        outputs = model(input_ids, cache_position=cache_position, past_key_values=cache)
    end = time.time()
    print(f"[Old] Average time taken: {((end - start) / 100) * 1000} ms")

    #------------------------------------------------------------------------------------------------
    # clear torch compile cache
    torch._dynamo.reset()

    #------------------------------------------------------------------------------------------------
    # NEW, cache batch size > input batch size
    # Measured on an RTX 4090: `main` = Doesn't work; this PR = 0.224ms
    cache = StaticCache(
        config=model.config,
        max_batch_size=16,  # 16 >> 1
        max_cache_len=100,
        device=model.device,
        dtype=model.dtype
    )
    model.forward = torch.compile(model.forward, fullgraph=True, mode="reduce-overhead")

    # warmup
    for _ in range(3):
        outputs = model(input_ids, cache_position=cache_position, past_key_values=cache)

    # measure
    start = time.time()
    for _ in range(100):
        outputs = model(input_ids, cache_position=cache_position, past_key_values=cache)
    end = time.time()
    print(f"[New] Average time taken: {((end - start) / 100) * 1000} ms")

github-actions · 2025-04-09T16:35:54Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

gante · 2025-04-09T16:47:45Z

src/transformers/cache_utils.py

-    Cache for mamba model which does not have attention mechanism and key value states.
+    Cache for mamba model which does not have attention mechanism and key value states.  At initialization, the cache
+    is preallocated to its maximum possible shape. Contrarily to other caches, `max_batch_size` must match the
+    batch size used at inference time.


Note: adding the feature to MambaCache would imply adding model-level changes. And it wouldn't work with the fast kernels.

gante · 2025-04-09T16:49:38Z

src/transformers/generation/utils.py

            self._cache.reset()
        return self._cache

-    def _supports_default_dynamic_cache(self) -> bool:


The rest of the diff in this file is to make mamba + compile work again (MambaIntegrationTests::test_compile_mamba_cache was red)

In general, models with unique caches are messy to use with generate, and need some work. A model should be able to tell generate "hey, I can only use this cache class"

gante · 2025-04-09T16:51:39Z

tests/utils/test_cache_utils.py



 @require_torch_accelerator
-@slow


Not all tests here are @slow. Tests that take >1s kept the decorator

gante · 2025-04-09T16:52:59Z

tests/utils/test_cache_utils.py

-            ("sdpa", "static"),
-        ]
-    )
-    def test_static_cache_greedy_decoding_pad_right(self, attn_implementation, cache_implementation):


generation is not meant to work well with right-padding, no need to spend resources testing it

HuggingFaceDocBuilderDev · 2025-04-09T17:30:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

Great, thanks for adding support on this and happy to see it can be done with minimal changes!

zucchini-nlp · 2025-04-10T10:32:42Z

src/transformers/generation/utils.py

                "layer_device_map": layer_device_map,
            }
+            cache_signature = inspect.signature(cache_cls.__init__)
+            cache_kwargs = {k: v for k, v in all_possible_cache_kwargs.items() if k in cache_signature.parameters}


we didn't change the signature, why is this needed?

MambaCache + mamba was broken ☠️ This is needed to fix it

gante · 2025-04-22T16:53:39Z

(PR on hold: some slow cache tests are failing due to reasons unrelated to this PR, fixing them first before re-requesting a review)

gante · 2025-08-12T15:33:42Z

(caches have been refactored, better start from scratch)

github-actions bot marked this pull request as draft April 9, 2025 16:35

gante marked this pull request as ready for review April 9, 2025 16:35

github-actions bot requested review from ArthurZucker and Rocketknight1 April 9, 2025 16:36

gante removed request for ArthurZucker and Rocketknight1 April 9, 2025 16:36

gante commented Apr 9, 2025

View reviewed changes

gante requested review from ArthurZucker and zucchini-nlp April 9, 2025 16:53

gante mentioned this pull request Apr 9, 2025

Allow static cache to be larger than sequence length / batch size for encoder-decoder models #35444

Open

zucchini-nlp approved these changes Apr 10, 2025

View reviewed changes

gante added 6 commits April 17, 2025 12:42

support smaller bs

72ba76b

add tests

1030494

revert change

ff61203

revert change

b2ba325

offloaded cache doesn't work on all models

83c82a1

harder test

883ac39

gante force-pushed the cache_support_smaller_bs branch from cc635a6 to 883ac39 Compare April 17, 2025 13:15

Merge branch 'main' into cache_support_smaller_bs

bc528a6

gante mentioned this pull request Apr 22, 2025

[tests] reorganize cache tests and clean memory between tests #37684

Merged

This was referenced Apr 29, 2025

[tests] fix flaky pattern in test_generate_continue_from_past_key_values #37724

Merged

[tests] Test all cache implementations #37873

Merged

ArthurZucker removed their request for review June 2, 2025 14:45

gante closed this Aug 12, 2025

zucchini-nlp mentioned this pull request Nov 27, 2025

StaticCache crashes when the batch-size changes #42454

Open

4 tasks

i3hz mentioned this pull request Nov 28, 2025

Fixes StaticCache Crashes #42467

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Cache] Support compilable cache reuse with smaller batch sizes #37394

[Cache] Support compilable cache reuse with smaller batch sizes #37394

Uh oh!

gante commented Apr 9, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2025

Uh oh!

gante Apr 9, 2025 •

edited

Loading

Uh oh!

gante Apr 9, 2025 •

edited

Loading

Uh oh!

gante Apr 9, 2025

Uh oh!

gante Apr 9, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2025

Uh oh!

zucchini-nlp left a comment

Uh oh!

zucchini-nlp Apr 10, 2025

Uh oh!

gante Apr 10, 2025

Uh oh!

gante commented Apr 22, 2025 •

edited

Loading

Uh oh!

gante commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@require_torch_accelerator
		@slow

[Cache] Support compilable cache reuse with smaller batch sizes #37394

[Cache] Support compilable cache reuse with smaller batch sizes #37394

Uh oh!

Conversation

gante commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions bot commented Apr 9, 2025

Uh oh!

gante Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

gante Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Apr 9, 2025

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

gante Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

gante commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gante commented Apr 9, 2025 •

edited

Loading

gante Apr 9, 2025 •

edited

Loading

gante Apr 9, 2025 •

edited

Loading

gante commented Apr 22, 2025 •

edited

Loading