-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I am able to train a SDXL Lora no problem. However, when I tried to resume from an existing checkpoint, I'm faced with the error:
File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 133, in step
self.scaler.step(self.optimizer, closure)
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 372, in step
assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
Looking at the error, it seems to be AMP related.
Reproduction
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
# First we perform a first run to get at least 1 checkpoint
accelerate launch train_text_to_image_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--dataset_name=$DATASET_NAME \
--caption_column="text" \
--resolution=512 \
--random_flip \
--train_batch_size=16 \
--num_train_epochs=30 \
--checkpointing_steps=500 \
--learning_rate=1e-05 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--mixed_precision="fp16" \
--gradient_checkpointing \
--use_8bit_adam \
--seed=42 \
--output_dir="sd-pokemon-model-lora-sdxl-txt" \
--validation_prompt="cute dragon creature" \
--checkpoints_total_limit=1 \
--report_to="wandb"
# Then, we resume from the checkpoint by adding --resume=latest
accelerate launch train_text_to_image_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
--dataset_name=$DATASET_NAME \
--caption_column="text" \
--resolution=512 \
--random_flip \
--train_batch_size=16 \
--num_train_epochs=30 \
--checkpointing_steps=500 \
--learning_rate=1e-05 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--mixed_precision="fp16" \
--gradient_checkpointing \
--use_8bit_adam \
--seed=42 \
--output_dir="sd-pokemon-model-lora-sdxl-txt" \
--validation_prompt="cute dragon creature" \
--checkpoints_total_limit=1 \
--report_to="wandb" \
--resume="latest"
Logs
File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 133, in step
self.scaler.step(self.optimizer, closure)
File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 372, in step
assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
### System Info
- `diffusers` version: 0.19.3
- Platform: Linux-4.14.285-215.501.amzn2.x86_64-x86_64-with-glibc2.31
- Python version: 3.10.11
- PyTorch version (GPU?): 2.0.1 (True)
- Huggingface_hub version: 0.16.4
- Transformers version: 4.31.0
- Accelerate version: 0.21.0
- xFormers version: 0.0.20
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
### Who can help?
@sayakpaul
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working