Skip to content

LlavaProcessor replaces <image> with 576 <image> tokens. Is this normal? #34934

@npnkhoi

Description

@npnkhoi

System Info

  • transformers version: 4.46.3
  • Platform: Linux-5.8.0-36-generic-x86_64-with-glibc2.31
  • Python version: 3.11.4
  • Huggingface_hub version: 0.24.6
  • Safetensors version: 0.4.5
  • Accelerate version: 0.34.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.1+cu118 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: Quadro RTX 6000

Who can help?

@ArthurZucker @itazap @amyeroberts @qubvel

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

PROMPT = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n What is the content of the image? ASSISTANT:"
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)

batch = processor(text=[PROMPT], images=[image], padding=True, truncation=True, return_tensors="pt")
decoded_input = processor.decode(batch['input_ids'][0])
print(decoded_input)

This code will decode the input in such a way that the image token is multiplied into 576 tokens. This seems strange to me for two reasons:

  • At some point in October 2024, I was running this same type of code with the updated version of transformers. Based on my log at that time, the <image> token is not multiplied at all.
  • Now when I compared my inference results now vs my old code (where <image> was not multiplied), the output now is very weird. While 50% of the time, the generated text is the same as before, some other times it is just a series of numbers. Sometimes it is different and worse.

I guessed my previous transformers version was either the one at git+https://github.com/huggingface/transformers@454a0f2efdf9f0d94b77ef08efabbdc6418c868d or 4.46.1. When I tried the first version, AutoProcessor cannot load llava-hf/llava-1.5-7b-hf. Meanwhile, 4.46.1 produces the same phenomenon with my current env.

Expected behavior

Is it desirable that <image> is replaced with 576 same tokens? Can it be the case that it is not replaced, such as in what I saw in some earlier version? If someone recognize this change and can point me to how to bring my code to the previous state (one <image> only), that'd be greatly appreciated.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    UsageGeneral questions about the librarybug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions