Skip to content

SDXL LoRA training, cannot resume from checkpoint #4566

@xiankgx

Description

@xiankgx

Describe the bug

I am able to train a SDXL Lora no problem. However, when I tried to resume from an existing checkpoint, I'm faced with the error:

  File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 133, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 372, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.

Looking at the error, it seems to be AMP related.

Reproduction

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

# First we perform a first run to get at least 1 checkpoint
accelerate launch train_text_to_image_lora_sdxl.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
    --dataset_name=$DATASET_NAME \
    --caption_column="text" \
    --resolution=512 \
    --random_flip \
    --train_batch_size=16 \
    --num_train_epochs=30 \
    --checkpointing_steps=500 \
    --learning_rate=1e-05 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --mixed_precision="fp16" \
    --gradient_checkpointing \
    --use_8bit_adam \
    --seed=42  \
    --output_dir="sd-pokemon-model-lora-sdxl-txt" \
    --validation_prompt="cute dragon creature" \
    --checkpoints_total_limit=1 \
    --report_to="wandb"

# Then, we resume from the checkpoint by adding --resume=latest
accelerate launch train_text_to_image_lora_sdxl.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
    --dataset_name=$DATASET_NAME \
    --caption_column="text" \
    --resolution=512 \
    --random_flip \
    --train_batch_size=16 \
    --num_train_epochs=30 \
    --checkpointing_steps=500 \
    --learning_rate=1e-05 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --mixed_precision="fp16" \
    --gradient_checkpointing \
    --use_8bit_adam \
    --seed=42  \
    --output_dir="sd-pokemon-model-lora-sdxl-txt" \
    --validation_prompt="cute dragon creature" \
    --checkpoints_total_limit=1 \
    --report_to="wandb" \
    --resume="latest"

Logs

File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 133, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 372, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.


### System Info

- `diffusers` version: 0.19.3
- Platform: Linux-4.14.285-215.501.amzn2.x86_64-x86_64-with-glibc2.31
- Python version: 3.10.11
- PyTorch version (GPU?): 2.0.1 (True)
- Huggingface_hub version: 0.16.4
- Transformers version: 4.31.0
- Accelerate version: 0.21.0
- xFormers version: 0.0.20
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

@sayakpaul 

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions