-
Notifications
You must be signed in to change notification settings - Fork 30.8k
Closed
Description
System Info
transformers
version: 4.47.1- Platform: Linux-5.15.0-113-generic-x86_64-with-glibc2.35
- Python version: 3.12.5
- Huggingface_hub version: 0.26.1
- Safetensors version: 0.4.5
- Accelerate version: 0.34.2
- Accelerate config: - compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 4
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1,2,3
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: [] - PyTorch version (GPU?): 2.4.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: NO
- Using GPU in script?: YES
- GPU type: NVIDIA A40
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When using text-only inputs with the LLaVa 1.5 and 1.6 family we get an error. I think the issue has already been brought up here bug but the error is different here. Also this is an interesting discussion. The simple code to reproduce:
import torch
import requests
from PIL import Image
from transformers import (
AutoModelForVision2Seq,
AutoProcessor
)
MODEL_ID = "llava-hf/llava-v1.6-vicuna-7b-hf" #"llava-hf/llava-1.5-7b-hf"
model = AutoModelForVision2Seq.from_pretrained(MODEL_ID).to(0, torch.bfloat16)
processor = AutoProcessor.from_pretrained(MODEL_ID)
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(images=[raw_image, None], text=["<image> what do you see in the image?", "Do you think that 2+2 is equal to 4?"], padding=True, return_tensors='pt').to(0, torch.bfloat16)
output = model.generate(**inputs, max_new_tokens=20, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=False))
The error we get:
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.31it/s]
Traceback (most recent call last):
File "/mnt/llmdata/home/gbonetta/progetti/kimera/test_kimera_checkpoint.py", line 25, in <module>
inputs = processor(images=[raw_image, None], text=["<image> what do you see in the image?", "Do you think that 2+2 is equal to 4?"], padding=True, return_tensors='pt').to(0, torch.bfloat16)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/llmdata/home/gbonetta/miniconda3/miniconda/envs/llava_env/lib/python3.12/site-packages/transformers/models/llava_next/processing_llava_next.py", line 133, in __call__
images, text = _validate_images_text_input_order(images, text)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/llmdata/home/gbonetta/miniconda3/miniconda/envs/llava_env/lib/python3.12/site-packages/transformers/processing_utils.py", line 1205, in _validate_images_text_input_order
raise ValueError("Invalid input type. Check that `images` and/or `text` are valid inputs.")
ValueError: Invalid input type. Check that `images` and/or `text` are valid inputs.
Expected behavior
The model should run using the image features when provided.