-
Notifications
You must be signed in to change notification settings - Fork 30.8k
Description
System Info
transformers
version: 4.46.3- Platform: Linux-5.8.0-36-generic-x86_64-with-glibc2.31
- Python version: 3.11.4
- Huggingface_hub version: 0.24.6
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.1+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: Quadro RTX 6000
Who can help?
@ArthurZucker @itazap @amyeroberts @qubvel
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
import requests
PROMPT = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n What is the content of the image? ASSISTANT:"
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(requests.get(url, stream=True).raw)
batch = processor(text=[PROMPT], images=[image], padding=True, truncation=True, return_tensors="pt")
decoded_input = processor.decode(batch['input_ids'][0])
print(decoded_input)
This code will decode the input in such a way that the image token is multiplied into 576 tokens. This seems strange to me for two reasons:
- At some point in October 2024, I was running this same type of code with the updated version of
transformers
. Based on my log at that time, the<image>
token is not multiplied at all. - Now when I compared my inference results now vs my old code (where
<image>
was not multiplied), the output now is very weird. While 50% of the time, the generated text is the same as before, some other times it is just a series of numbers. Sometimes it is different and worse.
I guessed my previous transformers version was either the one at git+https://github.com/huggingface/transformers@454a0f2efdf9f0d94b77ef08efabbdc6418c868d
or 4.46.1
. When I tried the first version, AutoProcessor cannot load llava-hf/llava-1.5-7b-hf
. Meanwhile, 4.46.1
produces the same phenomenon with my current env.
Expected behavior
Is it desirable that <image>
is replaced with 576 same tokens? Can it be the case that it is not replaced, such as in what I saw in some earlier version? If someone recognize this change and can point me to how to bring my code to the previous state (one <image>
only), that'd be greatly appreciated.
Thanks!