LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal?

### System Info

- `transformers` version: 4.46.3
- Platform: Linux-5.8.0-36-generic-x86_64-with-glibc2.31
- Python version: 3.11.4
- Huggingface_hub version: 0.24.6
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config:    not found
- PyTorch version (GPU?): 2.4.1+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: Quadro RTX 6000

### Who can help?

@ArthurZucker @itazap @amyeroberts @qubvel

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

```python
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

PROMPT = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n What is the content of the image? ASSISTANT:"
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)

batch = processor(text=[PROMPT], images=[image], padding=True, truncation=True, return_tensors="pt")
decoded_input = processor.decode(batch['input_ids'][0])
print(decoded_input)
```

This code will decode the input in such a way that the image token is multiplied into 576 tokens. This seems strange to me for two reasons:
- At some point in October 2024, I was running this same type of code with the updated version of `transformers`. Based on my log at that time, the `<image>` token is not multiplied at all. 
- Now when I compared my inference results now vs my old code (where `<image>` was not multiplied), the output now is very weird. While 50% of the time, the generated text is the same as before, some other times it is just a series of numbers. Sometimes it is different and worse.

I guessed my previous transformers version was either the one at `git+https://github.com/huggingface/transformers@454a0f2efdf9f0d94b77ef08efabbdc6418c868d` or `4.46.1`. When I tried the first version, AutoProcessor cannot load `llava-hf/llava-1.5-7b-hf`. Meanwhile, `4.46.1` produces the same phenomenon with my current env.

### Expected behavior

Is it desirable that `<image>` is replaced with 576 same tokens? Can it be the case that it is not replaced, such as in what I saw in some earlier version? If someone recognize this change and can point me to how to bring my code to the previous state (one `<image>` only), that'd be greatly appreciated.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal? #34934

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal? #34934

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions