[T2I LoRA training] fix: unscale fp16 gradient problem #6119

sayakpaul · 2023-12-10T04:54:50Z

What does this PR do?

sayakpaul · 2023-12-10T04:55:27Z

examples/text_to_image/train_text_to_image_lora.py

-    pipeline = pipeline.to(accelerator.device)
+        # Final inference
+        # Load previous pipeline
+        if args.validation_prompt is not None:


If not validation_prompt was passed we must not run this step.

sayakpaul · 2023-12-10T04:55:45Z

examples/text_to_image/train_text_to_image_lora.py

-    # load attention processors
-    pipeline.unet.load_attn_procs(args.output_dir)
+            # load attention processors
+            pipeline.load_lora_weights(args.output_dir)


Make sure to use load_lora_weights() instead of load_attn_procs().

HuggingFaceDocBuilderDev · 2023-12-11T03:24:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sr5434 · 2023-12-14T23:16:38Z

@sayakpaul I am getting this error for regular LoRA fine-tune:

Steps:   0% 0/1500 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora_sdxl.py", line 967, in <module>
    main()
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora_sdxl.py", line 774, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 680, in forward
    return model_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 668, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/models/unet_2d_condition.py", line 1004, in forward
    if "text_embeds" not in added_cond_kwargs:
TypeError: argument of type 'NoneType' is not iterable

This is in a free GPU enabled Google Colab

sayakpaul · 2023-12-15T02:18:20Z

I am not sure what script you're using here.

younesbelkada

Makes sense thanks! in the future we could also expose a method on PEFT to upcast trainable params in fp32 ! cc @BenjaminBossan @pacman100 similarly as prepare_model_for_kbit_training

BenjaminBossan · 2023-12-15T11:39:50Z

Makes sense thanks! in the future we could also expose a method on PEFT to upcast trainable params in fp32 ! cc @BenjaminBossan @pacman100 similarly as prepare_model_for_kbit_training

Yes, for sure, this isn't the first time this came up. Do we know exactly when this condition appears? Is it only when the user explicitly loads a model in float16? If yes, we may want to add a corresponding check to this PR.

younesbelkada · 2023-12-15T13:12:03Z

Is it only when the user explicitly loads a model in float16?

@sayakpaul can confirm but I think that's the case right ?

sayakpaul · 2023-12-15T13:17:08Z

Is it only when the user explicitly loads a model in float16?

@sayakpaul can confirm but I think that's the case right ?

Indeed that's the case. Only reduced precisions.

patrickvonplaten · 2023-12-15T21:20:29Z

@patil-suraj @williamberman can you please also take a look here?

sr5434 · 2023-12-16T17:58:25Z

I am not sure what script you're using here.

I am using the train_text_to_image_lora_sdxl.py script

BenjaminBossan · 2023-12-18T09:48:25Z

Indeed that's the case. Only reduced precisions.

Does this also apply to bf16? If not, I think the dtype conversion should be conditional, i.e. if args.mixed_precision == "fp16".

sayakpaul · 2023-12-18T10:08:24Z

@BenjaminBossan done in 8ac462b.

patrickvonplaten · 2023-12-18T17:41:51Z

Hmm, but now we're just silently disabling fp16 training - didn't this work before (e.g. that the whole UNet is kept in fp16 when LoRA is trained). Why doesn't it work anymore?

patrickvonplaten · 2023-12-18T17:46:49Z

The problem here is the following IMO:

We move both LoRA weights and non-LoRA weights to fp16 before training; then in mixed precision training the LoRA weights should not be in FP16 and thus an error is thrown
IMO, the solution should not be to move all weights (including non-trainable weights) to full fp32, instead we should only move the trainable LoRA weights to fp32 and keep the rest in fp16 to not blow up memory

younesbelkada · 2023-12-18T17:46:55Z

the changes proposed only upcasts the LoRA in fp32 (with the check requires_grad. When you inject adapters the non-LoRA weights will have requires_grad set to False) and before the PEFT integration all LoRA layers were in fp32 because the arg dtype was never used in the example scripts:

diffusers/src/diffusers/models/lora.py

Line 204 in a0c5482

    
           self.down = nn.Linear(in_features, rank, bias=False, device=device, dtype=dtype)

    for attn_processor_name, attn_processor in unet.attn_processors.items():
        # Parse the attention module.
        attn_module = unet
        for n in attn_processor_name.split(".")[:-1]:
            attn_module = getattr(attn_module, n)

        # Set the `lora_layer` attribute of the attention-related matrices.
        attn_module.to_q.set_lora_layer(
            LoRALinearLayer(
                in_features=attn_module.to_q.in_features, out_features=attn_module.to_q.out_features, rank=args.rank
            )
        )
        attn_module.to_k.set_lora_layer(
            LoRALinearLayer(
                in_features=attn_module.to_k.in_features, out_features=attn_module.to_k.out_features, rank=args.rank
            )
        )
        attn_module.to_v.set_lora_layer(
            LoRALinearLayer(
                in_features=attn_module.to_v.in_features, out_features=attn_module.to_v.out_features, rank=args.rank
            )
        )

younesbelkada · 2023-12-18T17:47:45Z

one cleaner check could be to check if the module is an instance of BaseTunerLayer and upcast it only if that's the case

examples/dreambooth/train_dreambooth_lora_sdxl.py

examples/text_to_image/train_text_to_image_lora.py

patrickvonplaten · 2023-12-18T18:35:44Z

the changes proposed only upcasts the LoRA in fp32 (with the check requires_grad. When you inject adapters the non-LoRA weights will have requires_grad set to False) and before the PEFT integration all LoRA layers were in fp32 because the arg dtype was never used in the example scripts:

diffusers/src/diffusers/models/lora.py

Line 204 in a0c5482

    
           self.down = nn.Linear(in_features, rank, bias=False, device=device, dtype=dtype)

    for attn_processor_name, attn_processor in unet.attn_processors.items():
        # Parse the attention module.
        attn_module = unet
        for n in attn_processor_name.split(".")[:-1]:
            attn_module = getattr(attn_module, n)

        # Set the `lora_layer` attribute of the attention-related matrices.
        attn_module.to_q.set_lora_layer(
            LoRALinearLayer(
                in_features=attn_module.to_q.in_features, out_features=attn_module.to_q.out_features, rank=args.rank
            )
        )
        attn_module.to_k.set_lora_layer(
            LoRALinearLayer(
                in_features=attn_module.to_k.in_features, out_features=attn_module.to_k.out_features, rank=args.rank
            )
        )
        attn_module.to_v.set_lora_layer(
            LoRALinearLayer(
                in_features=attn_module.to_v.in_features, out_features=attn_module.to_v.out_features, rank=args.rank
            )
        )

I see that makes sense! Thanks for the explanation

Co-authored-by: Patrick von Platen <[email protected]>

sayakpaul · 2023-12-19T03:58:37Z

I pulled in the changes from this PR and added to #6225.

         text_encoder_one.add_adapter(text_lora_config)
         text_encoder_two.add_adapter(text_lora_config)

+    # Make sure the trainable params are in float32.
+    if args.mixed_precision == "fp16":
+        models = [unet]
+        if args.train_text_encoder:
+            models.extend([text_encoder_one, text_encoder_two])
+        for model in models:
+            for param in model.parameters():
+                # only upcast trainable parameters (LoRA) into fp32
+                if param.requires_grad:
+                    param.data = param.to(torch.float32)
+
     # create custom saving & loading hooks so that `accelerator.save_state(...)` serializes in a nice format
     def save_model_hook(models, weights, output_dir):
         if accelerator.is_main_process:

I can confirm that things are working well: https://wandb.ai/sayakpaul/dreambooth-lora-sd-xl/runs/ow1vrez8. See the "test" media pictures.

Command I ran:

CUDA_VISIBLE_DEVICES=1 accelerate launch train_with_fixes.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --instance_data_dir="dog" \
  --output_dir="corgy_dog_LoRA" \
  --mixed_precision="fp16" \
  --instance_prompt="a photo of TOK dog" \
  --resolution=1024 \
  --train_batch_size=4 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --snr_gamma=5.0 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --max_train_steps=500 \
  --checkpointing_steps=100 \
  --push_to_hub \
  --validation_prompt="a photo of TOK dog in a bucket at the beach" \
  --report_to="wandb" \
  --seed="0"

Trained model: https://huggingface.co/sayakpaul/corgy_dog_LoRA. I am gonna try to run using Colab free tier too and report back here.

sayakpaul · 2023-12-19T04:20:47Z

To the ones wondering if this stuff would run on free-tier Colab Notebook, https://colab.research.google.com/gist/sayakpaul/9615b89369f3ef23cc29d0dac58253dd/scratchpad.ipynb should clear all the doubts once and for all 💪

yashveer08 · 2023-12-19T06:58:21Z

To the ones wondering if this stuff would run on free-tier Colab Notebook, https://colab.research.google.com/gist/sayakpaul/9615b89369f3ef23cc29d0dac58253dd/scratchpad.ipynb should clear all the doubts once and for all 💪

This seems to be working, but when added the metadata.jsonl file, the dataset library is causing the issue.
It shows the below error.
can you cross check your colab with the caption file @sayakpaul , it would be great help.

sayakpaul · 2023-12-19T07:21:29Z

That is an unrelated problem and you should instead file this in the datasets repo.

yashveer08 · 2023-12-19T07:34:31Z

That is an unrelated problem and you should instead file this in the datasets repo.

Sure will do, but is there a way apart from this to train an SDXL model with captions for each image, alternative to this?
@sayakpaul

sayakpaul · 2023-12-21T03:43:27Z

This you will have to debug your way through, cause it's not exactly the same code that you're using.

) * fix: unscale fp16 gradient problem * fix for dreambooth lora sdxl * make the type-casting conditional. * Apply suggestions from code review Co-authored-by: Patrick von Platen <[email protected]> --------- Co-authored-by: Patrick von Platen <[email protected]>

fix: unscale fp16 gradient problem

b6de725

sayakpaul mentioned this pull request Dec 10, 2023

In training the script train_text_to_image_lora.py on Colab with a V100 GPU, the error ValueError: Attempting to unscale FP16 gradients occurred. #6086

Closed

sayakpaul commented Dec 10, 2023

View reviewed changes

sayakpaul mentioned this pull request Dec 11, 2023

Issue "ValueError: Attempting to unscale FP16 gradients." when running SDXL_DreamBooth_LoRA_.ipynb #6124

Closed

fix for dreambooth lora sdxl

32bd473

sayakpaul requested review from patrickvonplaten and younesbelkada December 15, 2023 08:35

younesbelkada approved these changes Dec 15, 2023

View reviewed changes

Merge branch 'main' into fix/lora-training

724ddf9

patrickvonplaten requested review from patil-suraj and williamberman December 15, 2023 21:20

sayakpaul added 2 commits December 18, 2023 15:34

Merge branch 'main' into fix/lora-training

3aed05c

make the type-casting conditional.

8ac462b

patrickvonplaten reviewed Dec 18, 2023

View reviewed changes

examples/dreambooth/train_dreambooth_lora_sdxl.py Show resolved Hide resolved

patrickvonplaten reviewed Dec 18, 2023

View reviewed changes

examples/text_to_image/train_text_to_image_lora.py Show resolved Hide resolved

patrickvonplaten approved these changes Dec 18, 2023

View reviewed changes

sayakpaul and others added 2 commits December 19, 2023 08:17

Apply suggestions from code review

18e6bf7

Co-authored-by: Patrick von Platen <[email protected]>

Merge branch 'main' into fix/lora-training

85ff777

sayakpaul merged commit 288ceeb into main Dec 19, 2023

sayakpaul deleted the fix/lora-training branch December 19, 2023 04:24

kashif mentioned this pull request Dec 20, 2023

[Wuerstchen] fix fp16 training and correct lora args #6245

Merged

sayakpaul mentioned this pull request Jan 2, 2024

Dreambooth SDXL example inconsistent results #6413

Closed

[T2I LoRA training] fix: unscale fp16 gradient problem #6119

[T2I LoRA training] fix: unscale fp16 gradient problem #6119

Uh oh!

Conversation

sayakpaul commented Dec 10, 2023

What does this PR do?

Uh oh!

sayakpaul Dec 10, 2023

Choose a reason for hiding this comment

Uh oh!

sayakpaul Dec 10, 2023

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Dec 11, 2023

Uh oh!

sr5434 commented Dec 14, 2023

Uh oh!

sayakpaul commented Dec 15, 2023

Uh oh!

younesbelkada left a comment

Choose a reason for hiding this comment

Uh oh!

BenjaminBossan commented Dec 15, 2023

Uh oh!

younesbelkada commented Dec 15, 2023

Uh oh!

sayakpaul commented Dec 15, 2023

Uh oh!

patrickvonplaten commented Dec 15, 2023

Uh oh!

sr5434 commented Dec 16, 2023

Uh oh!

BenjaminBossan commented Dec 18, 2023

Uh oh!

sayakpaul commented Dec 18, 2023

Uh oh!

patrickvonplaten commented Dec 18, 2023

Uh oh!

patrickvonplaten commented Dec 18, 2023

Uh oh!

younesbelkada commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

younesbelkada commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten commented Dec 18, 2023

Uh oh!

sayakpaul commented Dec 19, 2023

Uh oh!

sayakpaul commented Dec 19, 2023

Uh oh!

yashveer08 commented Dec 19, 2023

Uh oh!

sayakpaul commented Dec 19, 2023

Uh oh!

yashveer08 commented Dec 19, 2023

Uh oh!

sayakpaul commented Dec 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

younesbelkada commented Dec 18, 2023 •

edited

Loading

younesbelkada commented Dec 18, 2023 •

edited

Loading

sayakpaul commented Dec 21, 2023 •

edited

Loading