Skip to content

LlavaForConditionalGeneration._merge_input_ids_with_image_features throws error #35169

@NicolasDrapier

Description

@NicolasDrapier

System Info

  • transformers version: 4.43.1
  • Platform: Linux-6.8.5-1-default-x86_64-with-glibc2.39
  • Python version: 3.11.9
  • Huggingface_hub version: 0.23.5
  • Safetensors version: 0.4.3
  • Accelerate version: 0.29.3
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    - distributed_type: MULTI_GPU
    - mixed_precision: bf16
    - use_cpu: False
    - debug: False
    - num_processes: 8
    - machine_rank: 0
    - num_machines: 1
    - gpu_ids: all
    - rdzv_backend: static
    - same_network: True
    - main_training_function: main
    - enable_cpu_affinity: False
    - downcast_bf16: no
    - tpu_use_cluster: False
    - tpu_use_sudo: False
    - tpu_env: []
    - dynamo_config: {'dynamo_backend': 'INDUCTOR'}
  • PyTorch version (GPU?): 2.4.0+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: True
  • Using GPU in script?: True
  • GPU type: NVIDIA L40S

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Description

I am trying to use the AutoAWQ library to quantize a Pixtral model (mistral-community/Pixtral-Large-Instruct-2411). However, I am encountering the following error:

File "/quantization/quant/lib64/python3.11/site-packages/transformers/models/llava/modeling_llava.py", line 303, in _merge_input_ids_with_image_features
    num_images, num_image_patches, embed_dim = image_features.shape
                                               ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'shape'

Code

Here is the code I am using:

import os
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = r'/data/models/mistral/pixtral-large-instruct-2411' # from https://huggingface.co/mistral-community/Pixtral-Large-Instruct-2411
quant_path = r'/data/models/mistral/pixtral-large-instruct-2411-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
os.makedirs(quant_path, exist_ok=True)

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')

Analysis

The model I am using is Pixtral-Large-Instruct-2411, but its configuration is LlavaForConditionalGeneration. The issue arises in the Transformers library's source code where image_features remains None if pixel_values is None. Consequently, in the method _merge_input_ids_with_image_features, the first line num_images, num_image_patches, embed_dim = image_features.shape tries to access the shape attribute of None, resulting in an AttributeError.

image_features = None
if pixel_values is not None:
    image_features = self.get_image_features(
        pixel_values=pixel_values,
        vision_feature_layer=vision_feature_layer,
        vision_feature_select_strategy=vision_feature_select_strategy,
    )

if legacy_processing:
    logger.warning_once(
        "Expanding inputs for image tokens in LLaVa should be done in processing. "
        "Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly "
        "with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
        "Using processors without these attributes in the config is deprecated and will throw an error in v4.50."
    )
    # prefill stage vs decoding stage (legacy behavior copied)
    if input_ids.shape[1] != 1:
        inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
            image_features, inputs_embeds, input_ids, attention_mask, labels # <-- image_features is still None here
        )
        cache_position = torch.arange(attention_mask.shape[1], device=attention_mask.device)

Steps to Reproduce

  1. Ensure the Pixtral-Large-Instruct-2411 model is available at the specified path.
  2. Run the provided code snippet.

Actual Behavior

An AttributeError is raised due to image_features being None.

Expected behavior

The model should be loaded, quantized, and saved without any errors.

Metadata

Metadata

Assignees

Labels

MultimodalWIPLabel your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progressbug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions