enable stable-xl textual inversion #6421

jiqing-feng · 2024-01-02T00:36:26Z

Hi @sayakpaul @patrickvonplaten . Since Stable-Diffusion-XL is getting increasingly popular, users may want to see how it performs in textual inversion.

I enabled Stable-Diffusion-XL textual inversion and training 2000 steps with bfloat16 on Intel SPR node, the result is as follows:
Text: A cat-toy backpack

Would you please help to review my changes? Thx!

sayakpaul

Thanks for your contributions. However, it might be better to have it in a separate script and add proper test for it.

See how we did it for SDXL LoRA DreamBooth here: https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sdxl.py

Tests:

diffusers/examples/dreambooth/test_dreambooth_lora.py

Line 201 in 8a366b8

class DreamBoothLoRASDXL(ExamplesTestsAccelerate):

patrickvonplaten

Hey @jiqing-feng,

Such a new textual inversion script would be very helpful I think. Can we maybe add it as a new _sdxl script like we've done for lora: https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sdxl.py ?

HuggingFaceDocBuilderDev · 2024-01-05T03:38:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jiqing-feng · 2024-01-05T04:44:22Z

Hi @sayakpaul @patrickvonplaten . Thanks for your review. I have added textual inversion for SDXL in a new script, and also a new test script. Would you please review it? Thx!

BTW, from my experiments, only fine-tuning the text_encoder_1 will get a better result than find-tuning both text encoders.

sayakpaul · 2024-01-05T05:24:51Z

examples/textual_inversion/textual_inversion_sdxl.py

+    return args
+
+
+imagenet_templates_small = [


Where is this coming from?

It is copied from textual_inversion

sayakpaul · 2024-01-05T05:25:35Z

examples/textual_inversion/textual_inversion_sdxl.py

+        if self.center_crop:
+            crop = min(img.shape[0], img.shape[1])
+            (
+                h,
+                w,
+            ) = (
+                img.shape[0],
+                img.shape[1],
+            )
+            img = img[(h - crop) // 2 : (h + crop) // 2, (w - crop) // 2 : (w + crop) // 2]


Can't we use CenterCrop from torchvision for this?

It is copied from textual_inversion

sayakpaul · 2024-01-05T05:27:00Z

examples/textual_inversion/textual_inversion_sdxl.py

+    # Move vae and unet and text_encoder_2 to device and cast to weight_dtype
+    unet.to(accelerator.device, dtype=weight_dtype)
+    vae.to(accelerator.device, dtype=weight_dtype)
+    text_encoder_2 = text_encoder_2.to(accelerator.device, dtype=weight_dtype)


Why not move the text_encoder too?

text_encoder_1 will move to the device in accelerate.prepare function

sayakpaul · 2024-01-05T05:27:32Z

examples/textual_inversion/textual_inversion_sdxl.py

+        # The dropout cannot be != 0 so it doesn't matter if we are in eval or train mode.
+        unet.train()
+        text_encoder_1.gradient_checkpointing_enable()
+        text_encoder_2.gradient_checkpointing_enable()


If text_encoder_2 is not trained then why enable gradient checkpointing here?

Yes, I will remove it if we don't want to train text_encoder_2

sayakpaul · 2024-01-05T05:27:41Z

examples/textual_inversion/textual_inversion_sdxl.py

+        unet.train()
+        text_encoder_1.gradient_checkpointing_enable()
+        text_encoder_2.gradient_checkpointing_enable()
+        unet.enable_gradient_checkpointing()


Same for this.

I think it should be kept as the comment in line 698 explained it.

This is weird. None of the other training scripts enable gradient checkpointing on the models that are not being trained.

sayakpaul · 2024-01-05T05:30:07Z

examples/textual_inversion/textual_inversion_sdxl.py

+                sample_size = unet.config.sample_size * (2 ** (len(vae.config.block_out_channels) - 1))
+                original_size = (sample_size, sample_size)
+                add_time_ids = torch.tensor(
+                    [list(original_size + (0, 0) + original_size)], dtype=weight_dtype, device=accelerator.device


sample_size calculation seems to be wrong to me as it should be the original size of the input images. Also, we're not supplementing the crop coordinates here.

Could you refer to https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora_sdxl.py and incorporate the changes here w.r.t how these micro-conditions are implemented?

examples/textual_inversion/textual_inversion_sdxl.py

sayakpaul · 2024-01-05T05:31:38Z

examples/textual_inversion/textual_inversion_sdxl.py

+                        accelerator,
+                        args,
+                        save_path,
+                        safe_serialization=not args.no_safe_serialization,


Let's default to safetensors and not make it configurable IMO.

It is also copied from textual_inversion, and it defaults to safetensors in this script. I can make it unconfigurable if you want to.

safetensors is the default format in diffusers. So, makes sense to not make it configurable here.

Also, the script that keep referring to is a bit old. So, we don't have to follow it note to note :)

sayakpaul · 2024-01-05T05:32:52Z

examples/textual_inversion/textual_inversion_sdxl.py

+                commit_message="End of training",
+                ignore_patterns=["step_*", "epoch_*"],
+            )
+


Let's also run validation here? We can use log_validation() here and log the images under "test" key, instead.

log_validation() is in the line 961. I think it would be better to do log_validation() after training and before saving, instead of after pushing to hub. WDYT?

No, what I mean is inside log_validation we're using "validation" key to log the medias. If we use the same key for covering both intermediate logging and the logging done after training, it might be inconvenient. To give you a better idea, refer to this:

diffusers/examples/research_projects/diffusion_dpo/train_diffusion_dpo.py

Line 117 in 5bacc2f

tracker_key = "test" if is_final_validation else "validation"

sayakpaul

Great start! Thank you.

Left some initial comments related to the implementation. I'd prefer having the ability to train two text encoders, though.

examples/textual_inversion/textual_inversion_sdxl.py

jiqing-feng · 2024-01-08T01:02:23Z

Hi @sayakpaul. I think I have fixed all your comments except training 2 text decoders.

This is the result that I only trained 1 text encoder for 500 steps on A100, and it seems great.

Unfortunately, I didn't get an acceptable result if training 2 text encoders. Could we merge this example first? I will check what's wrong with 2 text encoders training. WDYT @patrickvonplaten

Thanks!

BTW, I run the following commands and it didn't change any code styles in my script.

  ruff check examples tests src utils scripts
  ruff format examples tests src utils scripts --check

sayakpaul · 2024-01-08T01:24:56Z

You need to run make style && make quality to get the code styling issues fixed.

sayakpaul · 2024-01-08T01:26:13Z

examples/textual_inversion/test_textual_inversion_sdxl.py

+        f"Running validation... \n Generating {args.num_validation_images} images with prompt:"
+        f" {args.validation_prompt}."
+    )
+    # create pipeline (note: unet and vae are loaded again in float32)


Why is this the case?

sayakpaul · 2024-01-08T01:27:46Z

examples/textual_inversion/test_textual_inversion_sdxl.py

+    if args.validation_epochs is not None:
+        warnings.warn(
+            f"FutureWarning: You are doing logging with validation_epochs={args.validation_epochs}."
+            " Deprecated validation_epochs in favor of `validation_steps`"
+            f"Setting `args.validation_steps` to {args.validation_epochs * len(train_dataset)}",
+            FutureWarning,
+            stacklevel=2,
+        )
+        args.validation_steps = args.validation_epochs * len(train_dataset)


Let's simplify this. Let's expose only one argument from the command-line to control the interval of running validation.

sayakpaul · 2024-01-08T01:29:02Z

examples/textual_inversion/test_textual_inversion_sdxl.py

+    text_encoder_2.text_model.encoder.requires_grad_(False)
+    text_encoder_2.text_model.final_layer_norm.requires_grad_(False)
+    text_encoder_2.text_model.embeddings.position_embedding.requires_grad_(False)


We can just call text_encoder_2.requires_grad_(False) here no since we're not training it?

sayakpaul · 2024-01-08T01:29:52Z

examples/textual_inversion/test_textual_inversion_sdxl.py

+    # Move vae and unet and text_encoder_2 to device and cast to weight_dtype
+    unet.to(accelerator.device, dtype=weight_dtype)
+    vae.to(accelerator.device, dtype=weight_dtype)
+    text_encoder_2 = text_encoder_2.to(accelerator.device, dtype=weight_dtype)


nit: no need to assign to text_encoder_2 variable after device placement.

sayakpaul

Thanks for the changes, however, the open comments aren't resolved still:

Apart from these, I think we'd need to add a separate README_sdxl.md for this example like other SDXL scripts, so that users know what training commands to use and that training text encoder 2 isn't supported for specific reasons.

jiqing-feng · 2024-01-08T01:53:46Z

Hi @sayakpaul
Sorry for that I changed the training script to the test script. I have fixed it, would you please help to review it again? Thx!

jiqing-feng · 2024-01-08T02:04:59Z

make style && make quality

Unfortunately, it doesn't work

sayakpaul · 2024-01-09T04:09:24Z

examples/textual_inversion/README_sdxl.md

@@ -0,0 +1,27 @@
+## Textual Inversion fine-tuning example for SDXL
+
+The `textual_inversion.py` do not support training stable-diffusion-XL as it has two text encoders, you can training SDXL by the following command:


Suggested change

The `textual_inversion.py` do not support training stable-diffusion-XL as it has two text encoders, you can training SDXL by the following command:

I don't think we need to mention this. We can just add a note about the SDXL variant in the README.md file.

sayakpaul · 2024-01-09T04:10:58Z

examples/textual_inversion/README_sdxl.md

+  --output_dir="./textual_inversion_cat_sdxl"
+```
+
+We only enabled training the first text encoder because of the precision issue, we will enable training the second text encoder once we fixed the problem.


Suggested change

We only enabled training the first text encoder because of the precision issue, we will enable training the second text encoder once we fixed the problem.

For now, only training of the first text encoder is supported.

sayakpaul · 2024-01-09T04:30:38Z

examples/textual_inversion/textual_inversion_sdxl.py

+    optimizer_1 = torch.optim.AdamW(
+        text_encoder_1.get_input_embeddings().parameters(),  # only optimize the embeddings
+        lr=args.learning_rate,
+        betas=(args.adam_beta1, args.adam_beta2),
+        weight_decay=args.adam_weight_decay,
+        eps=args.adam_epsilon,
+    )


We usually also support the 8Bit Adam too:

diffusers/examples/dreambooth/train_dreambooth_lora.py

Line 907 in 5bacc2f

if args.use_8bit_adam:

Since SDXL is quite heavier than SD, could we add it too?

Also, let's keep to optimizer for now. No need to use optimizer_1.

sayakpaul · 2024-01-09T04:31:41Z

examples/textual_inversion/textual_inversion_sdxl.py

+        tokenizer_1=tokenizer_1,
+        tokenizer_2=tokenizer_2,
+        size=args.resolution,
+        placeholder_token=(" ".join(tokenizer_1.convert_ids_to_tokens(placeholder_token_ids))),


Let's assign this " ".join(tokenizer_1.convert_ids_to_tokens(placeholder_token_ids))) in a separate variable.

sayakpaul

Thanks a lot for the changes. Just some final set of comments.

jiqing-feng · 2024-01-09T05:27:41Z

Hi @sayakpaul . Thanks for your review. I have fixed all the comments, would you please take a look? Thx!

sayakpaul · 2024-01-09T05:38:57Z

examples/textual_inversion/README.md


 **___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___**

+**___Note: Please follow the README_sdxl.md if you are using the [stable-diffusion-xl](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).___**


Suggested change

**___Note: Please follow the README_sdxl.md if you are using the [stable-diffusion-xl](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).___**

**___Note: Please follow the [README_sdxl.md](./README_sdxl.md) if you are using the [stable-diffusion-xl](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).___**

sayakpaul

I think there's this open comment still: https://github.com/huggingface/diffusers/pull/6421/files#r1442509162.

After everything is resolved, I will fix the quality issues.

jiqing-feng · 2024-01-09T07:16:10Z

I think there's this open comment still: https://github.com/huggingface/diffusers/pull/6421/files#r1442509162.

After everything is resolved, I will fix the quality issues.

Hi @sayakpaul . Thanks for your clarify, I have fixed it now, please take a look. Thx!

sayakpaul

Thank you so much for bearing with my requests!

sayakpaul · 2024-01-09T07:47:33Z

Will merge after the CI is green :)

jiqing-feng · 2024-01-09T08:23:13Z

Thank you so much for bearing with my requests!

Also thanks for your patience :)

sayakpaul · 2024-01-09T09:42:44Z

Thanks for your great contribution!

* enable stable-xl textual inversion * check if optimizer_2 exists * check text_encoder_2 before using * add textual inversion for sdxl in a single file * fix style * fix example style * reset for error changes * add readme for sdxl * fix style * disable autocast as it will cause cast error when weight_dtype=bf16 * fix spelling error * fix style and readme and 8bit optimizer * add README_sdxl.md link * add tracker key on log_validation * run style * rm the second center crop --------- Co-authored-by: Sayak Paul <[email protected]>

sayakpaul reviewed Jan 2, 2024

View reviewed changes

jiqing-feng added 2 commits January 2, 2024 03:28

enable stable-xl textual inversion

1a44ebb

check if optimizer_2 exists

b9518fc

patrickvonplaten reviewed Jan 2, 2024

View reviewed changes

jiqing-feng and others added 2 commits January 3, 2024 04:46

check text_encoder_2 before using

d74f765

Merge branch 'main' into textual_inversion

529447d

sayakpaul reviewed Jan 5, 2024

View reviewed changes

examples/textual_inversion/textual_inversion_sdxl.py Show resolved Hide resolved

sayakpaul reviewed Jan 5, 2024

View reviewed changes

jiqing-feng commented Jan 5, 2024

View reviewed changes

examples/textual_inversion/textual_inversion_sdxl.py Show resolved Hide resolved

jiqing-feng and others added 4 commits January 5, 2024 06:14

add textual inversion for sdxl in a single file

3c6cafd

Merge branch 'main' into textual_inversion

b0e318f

fix style

21ab27e

fix example style

a2eda7b

sayakpaul reviewed Jan 8, 2024

View reviewed changes

jiqing-feng force-pushed the textual_inversion branch from 0082197 to 322ef19 Compare January 8, 2024 01:47

jiqing-feng added 5 commits January 8, 2024 04:45

reset for error changes

322ef19

add readme for sdxl

e796492

fix style

1aad9db

disable autocast as it will cause cast error when weight_dtype=bf16

73de30a

fix spelling error

1576fdf

sayakpaul reviewed Jan 9, 2024

View reviewed changes

Merge branch 'main' into textual_inversion

5c5313f

sayakpaul reviewed Jan 9, 2024

View reviewed changes

sayakpaul approved these changes Jan 9, 2024

View reviewed changes

sayakpaul merged commit aa1797e into huggingface:main Jan 9, 2024

jiqing-feng and others added 5 commits January 9, 2024 08:24

fix style and readme and 8bit optimizer

c161991

add README_sdxl.md link

5d439df

add tracker key on log_validation

0ed88f0

run style

49fa6ab

rm the second center crop

d6bb1d9

		@@ -0,0 +1,27 @@
		## Textual Inversion fine-tuning example for SDXL

		The `textual_inversion.py` do not support training stable-diffusion-XL as it has two text encoders, you can training SDXL by the following command:

	We only enabled training the first text encoder because of the precision issue, we will enable training the second text encoder once we fixed the problem.
	For now, only training of the first text encoder is supported.


		___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___

		___Note: Please follow the README_sdxl.md if you are using the [stable-diffusion-xl](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).___

	___Note: Please follow the README_sdxl.md if you are using the [stable-diffusion-xl](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).___
	___Note: Please follow the [README_sdxl.md](./README_sdxl.md) if you are using the [stable-diffusion-xl](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0).___

enable stable-xl textual inversion #6421

enable stable-xl textual inversion #6421

Uh oh!

Conversation

jiqing-feng commented Jan 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jan 5, 2024

Uh oh!

jiqing-feng commented Jan 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiqing-feng Jan 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jiqing-feng commented Jan 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Jan 8, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

jiqing-feng commented Jan 8, 2024

Uh oh!

jiqing-feng commented Jan 8, 2024

Uh oh!

jiqing-feng commented Jan 2, 2024 •

edited

Loading

jiqing-feng Jan 5, 2024 •

edited

Loading

jiqing-feng commented Jan 8, 2024 •

edited

Loading