Cache size limit for generation

### Feature request

Add a `cache_limit` argument for `generate`, limiting the size of the cache (`past_key_values`).

### Motivation

In some contexts one might want to generate long sequences. When doing so, the system can easily run out of memory. Keeping the cache to a maximum size would allow users to have more control and tweak others parameters such as batch size or number of beams, to generate faster and take the most out of their hardware.

### Your contribution

I implemented it in GPT2 (PyTorch & TF, PR is ready), but I guess this could be implemented more broadly in `generate` so that every models could benefit it. 
It might relate to #17574.
Waiting for your opinion on this, I can probably add it to `generate`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache size limit for generation #20767

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cache size limit for generation #20767

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions