LlavaForConditionalGeneration._merge_input_ids_with_image_features throws error

### System Info

- `transformers` version: 4.43.1
- Platform: Linux-6.8.5-1-default-x86_64-with-glibc2.39
- Python version: 3.11.9
- Huggingface_hub version: 0.23.5
- Safetensors version: 0.4.3
- Accelerate version: 0.29.3
- Accelerate config:    - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: bf16
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
        - dynamo_config: {'dynamo_backend': 'INDUCTOR'}
- PyTorch version (GPU?): 2.4.0+cu121 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: True
- Using GPU in script?: True
- GPU type: NVIDIA L40S

### Who can help?

_No response_

### Information

- [X] The official example scripts
- [X] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

#### Description
I am trying to use the `AutoAWQ` library to quantize a Pixtral model (`mistral-community/Pixtral-Large-Instruct-2411`). However, I am encountering the following error:
```txt
File "/quantization/quant/lib64/python3.11/site-packages/transformers/models/llava/modeling_llava.py", line 303, in _merge_input_ids_with_image_features
    num_images, num_image_patches, embed_dim = image_features.shape
                                               ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'shape'
```

#### Code
Here is the code I am using:
```python
import os
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = r'/data/models/mistral/pixtral-large-instruct-2411' # from https://huggingface.co/mistral-community/Pixtral-Large-Instruct-2411
quant_path = r'/data/models/mistral/pixtral-large-instruct-2411-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
os.makedirs(quant_path, exist_ok=True)

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f'Model is quantized and saved at "{quant_path}"')
```

#### Analysis
The model I am using is `Pixtral-Large-Instruct-2411`, but its configuration is `LlavaForConditionalGeneration`. The issue arises in the `Transformers` library's source code where `image_features` remains `None` if `pixel_values` is `None`. Consequently, in the method `_merge_input_ids_with_image_features`, the first line `num_images, num_image_patches, embed_dim = image_features.shape` tries to access the `shape` attribute of `None`, resulting in an `AttributeError`.

```python
image_features = None
if pixel_values is not None:
    image_features = self.get_image_features(
        pixel_values=pixel_values,
        vision_feature_layer=vision_feature_layer,
        vision_feature_select_strategy=vision_feature_select_strategy,
    )

if legacy_processing:
    logger.warning_once(
        "Expanding inputs for image tokens in LLaVa should be done in processing. "
        "Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly "
        "with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. "
        "Using processors without these attributes in the config is deprecated and will throw an error in v4.50."
    )
    # prefill stage vs decoding stage (legacy behavior copied)
    if input_ids.shape[1] != 1:
        inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
            image_features, inputs_embeds, input_ids, attention_mask, labels # <-- image_features is still None here
        )
        cache_position = torch.arange(attention_mask.shape[1], device=attention_mask.device)
```

#### Steps to Reproduce
1. Ensure the `Pixtral-Large-Instruct-2411` model is available at the specified path.
2. Run the provided code snippet.

#### Actual Behavior
An `AttributeError` is raised due to `image_features` being `None`.


### Expected behavior

The model should be loaded, quantized, and saved without any errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LlavaForConditionalGeneration._merge_input_ids_with_image_features throws error #35169

System Info

Who can help?

Information

Tasks

Reproduction

Description

Code

Analysis

Steps to Reproduce

Actual Behavior

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LlavaForConditionalGeneration._merge_input_ids_with_image_features throws error #35169

Description

System Info

Who can help?

Information

Tasks

Reproduction

Description

Code

Analysis

Steps to Reproduce

Actual Behavior

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions