CUDA out of memory and invalid value encountered in cast with train_text_to_image_lora_sdxl.py

### Describe the bug

I encountered two distinct issues while attempting to run the `lambdalabs/pokemon-blip-captions` example of train_text_to_image_lora_sdxl.py on an RTX 4090, utilizing bf16.

### Problem 1: RuntimeWarning and Image Processing:
```
RuntimeWarning: invalid value encountered in cast
images = (images * 255).round().astype("uint8")
```

### Problem 2: CUDA Out of Memory Error:
Despite the GPU memory usage during training consistently remaining at 67%, I also encounter a CUDA out-of-memory issue after the training concludes:

![W B Chart 8_23_2023, 12_23_59 PM](https://github.com/huggingface/diffusers/assets/24874992/4103a352-3a14-46bf-a5e5-929406d3b3f9)

The error message is as follows:
```
hidden_states = hidden_states.to(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.64 GiB total capacity; 20.89 GiB already allocated; 497.75 MiB free; 22.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
```

### Hypothesis:
I suspect that memory might not be fully released before the test inference step. Could it be?

I intend to investigate this matter further on my own, and I'll provide updates here. If anyone else encounters a solution before I do, kindly share it here as well.

### Reproduction

```bash
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --caption_column="text" \
  --resolution=1024 \
  --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=2 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=500 \
  --learning_rate=1e-04 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --dataloader_num_workers=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-sdxl-txt" \
  --train_text_encoder \
  --validation_prompt="cute dragon creature" \
  --report_to="wandb" \
  --mixed_precision="bf16" \
  --rank=4
```

### Logs

```shell
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --caption_column="text" \
  --resolution=1024 \
  --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=2 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=500 \
  --learning_rate=1e-04 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --dataloader_num_workers=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-sdxl-txt" \
  --train_text_encoder \
  --validation_prompt="cute dragon creature" \
  --report_to="wandb" \
  --mixed_precision="bf16" \
  --rank=4
08/23/2023 11:18:30 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
{'attention_type'} was not found in config. Values will be initialized to default values.
wandb: Currently logged in as: mnslarcher. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.8
wandb: Run data is saved locally in /home/mnslarcher/ai/hands/wandb/run-20230823_111845-ngknp8t5
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run bumbling-brook-7
wandb: ⭐️ View project at https://wandb.ai/mnslarcher/text2image-fine-tune
wandb: 🚀 View run at https://wandb.ai/mnslarcher/text2image-fine-tune/runs/ngknp8t5
08/23/2023 11:18:49 - INFO - __main__ - ***** Running training *****
08/23/2023 11:18:49 - INFO - __main__ -   Num examples = 833
08/23/2023 11:18:49 - INFO - __main__ -   Num Epochs = 2
08/23/2023 11:18:49 - INFO - __main__ -   Instantaneous batch size per device = 1
08/23/2023 11:18:49 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
08/23/2023 11:18:49 - INFO - __main__ -   Gradient Accumulation steps = 1
08/23/2023 11:18:49 - INFO - __main__ -   Total optimization steps = 1666
Steps:  30%|████████████████████████████████████████▌                                                                                              | 500/1666 [08:05<19:20,  1.00it/s, lr=0.0001, step_loss=0.0274]08/23/2023 11:26:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-500
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/pytorch_lora_weights.safetensors
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/optimizer.bin
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/scheduler.bin
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/random_states_0.pkl
08/23/2023 11:26:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-500
Steps:  50%|████████████████████████████████████████████████████████████████████                                                                    | 833/1666 [13:28<13:20,  1.04it/s, lr=0.0001, step_loss=0.134]08/23/2023 11:32:18 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: cute dragon creature.
                                                                                                                                                                                                                  Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                     | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                  Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                    | 3/7 [00:00<00:00, 27.36it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 62.86it/s]
/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")
Steps:  60%|████████████████████████████████████████████████████████████████████████████████▍                                                     | 1000/1666 [17:06<10:46,  1.03it/s, lr=0.0001, step_loss=0.0518]08/23/2023 11:35:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1000
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/pytorch_lora_weights.safetensors
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/optimizer.bin
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/scheduler.bin
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/random_states_0.pkl
08/23/2023 11:35:56 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1000
Steps:  90%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋             | 1500/1666 [25:13<02:39,  1.04it/s, lr=0.0001, step_loss=0.0561]08/23/2023 11:44:02 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1500
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/pytorch_lora_weights.safetensors
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/optimizer.bin
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/scheduler.bin
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/random_states_0.pkl
08/23/2023 11:44:03 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1500
Steps: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1666/1666 [27:55<00:00,  1.04it/s, lr=0.0001, step_loss=0.0427]08/23/2023 11:46:45 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: cute dragon creature.
                                                                                                                                                                                                                  Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                     | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                  Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                    | 3/7 [00:00<00:00, 27.61it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 63.49it/s]
/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")
Model weights saved in sd-pokemon-model-lora-sdxl-txt/pytorch_lora_weights.safetensors
                                                                                                                                                                                                                  Loaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                             | 0/7 [00:00<?, ?it/s]
                                                                                                                                                                                                                  Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                             | 1/7 [00:00<00:05,  1.13it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
{'attention_type'} was not found in config. Values will be initialized to default values.
Loaded unet as UNet2DConditionModel from `unet` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                  Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.██████▎                                                             | 4/7 [00:03<00:02,  1.05it/s]
Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.71it/s]
Loading unet.ine components...:  86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                    | 6/7 [00:04<00:00,  1.66it/s]
Loading text_encoder.
Loading text_encoder_2.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.31it/s]
Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.31it/s]
  File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1505, in <module>
    main(args)
  File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1458, in main
    images = [
  File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1459, in <listcomp>
    pipeline(
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py", line 845, in __call__
    image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 270, in decode
    decoded = self._decode(z).sample
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 257, in _decode
    dec = self.decoder(z)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/vae.py", line 271, in forward
    sample = up_block(sample, latent_embeds)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 2334, in forward
    hidden_states = upsampler(hidden_states)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/resnet.py", line 164, in forward
    hidden_states = hidden_states.to(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.64 GiB total capacity; 20.89 GiB already allocated; 497.75 MiB free; 22.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: | 0.042 MB of 0.042 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: train_loss ▂▆▂▁▄▃▅▄▂▃▁▂▁▁▄▂▁▄▁▃▅▁▂▆▁▁▅▄▃▁▄▆▄█▅▁▇▂▅▁
wandb: 
wandb: Run summary:
wandb: train_loss 0.04268
wandb: 
wandb: 🚀 View run bumbling-brook-7 at: https://wandb.ai/mnslarcher/text2image-fine-tune/runs/ngknp8t5
wandb: Synced 6 W&B file(s), 2 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230823_111845-ngknp8t5/logs
Traceback (most recent call last):
  File "/home/mnslarcher/anaconda3/envs/hands/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/mnslarcher/anaconda3/envs/hands/bin/python', 'train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--dataset_name=lambdalabs/pokemon-blip-captions', '--caption_column=text', '--resolution=1024', '--random_flip', '--train_batch_size=1', '--num_train_epochs=2', '--gradient_accumulation_steps=1', '--checkpointing_steps=500', '--learning_rate=1e-04', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--dataloader_num_workers=0', '--seed=42', '--output_dir=sd-pokemon-model-lora-sdxl-txt', '--train_text_encoder', '--validation_prompt=cute dragon creature', '--report_to=wandb', '--mixed_precision=bf16', '--rank=4']' returned non-zero exit status 1.
```


### System Info

OS Name: Ubuntu 22.04.3 LTS
GPU: NVIDIA GeForce RTX 4090

diffusers-cli env:
- `diffusers` version: 0.21.0.dev0
- Platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Huggingface_hub version: 0.16.4
- Transformers version: 4.31.0
- Accelerate version: 0.21.0
- xFormers version: not installed
- Using GPU in script?: YES
- Using distributed or parallel set-up in script?: NO

enviroment.yml (conda):
```
name: myenv
channels:
  - defaults
dependencies:
  - nb_conda_kernels
  - ipykernel
  - jupyter
  - pip
  - python=3.10
  - pip:
    - accelerate==0.21.0
    - "black[jupyter]==23.7.0"
    - datasets==2.14.4
    - git+https://github.com/huggingface/diffusers
    - ftfy==6.1.1
    - gradio==3.40.1
    - isort==5.12.0
    - Jinja2==3.1.2
    - tensorboard==2.14.0
    - torch==2.0.1
    - torchvision==0.15.2
    - transformers==4.31.0
    - wandb==0.15.8
```

### Who can help?

@sayakpaul 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA out of memory and invalid value encountered in cast with train_text_to_image_lora_sdxl.py #4736

Describe the bug

Problem 1: RuntimeWarning and Image Processing:

Problem 2: CUDA Out of Memory Error:

Hypothesis:

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA out of memory and invalid value encountered in cast with train_text_to_image_lora_sdxl.py #4736

Description

Describe the bug

Problem 1: RuntimeWarning and Image Processing:

Problem 2: CUDA Out of Memory Error:

Hypothesis:

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions