Skip to content

Commit 5911a3a

Browse files
williambermanpatrickvonplatensayakpaul
authored
dreambooth if docs - stage II, more info (#3628)
* dreambooth if docs - stage II, more info * Update docs/source/en/training/dreambooth.mdx Co-authored-by: Patrick von Platen <[email protected]> * Update docs/source/en/training/dreambooth.mdx Co-authored-by: Patrick von Platen <[email protected]> * Update docs/source/en/training/dreambooth.mdx Co-authored-by: Sayak Paul <[email protected]> * download instructions for downsized images * update source README to match docs --------- Co-authored-by: Patrick von Platen <[email protected]> Co-authored-by: Sayak Paul <[email protected]>
1 parent b7af946 commit 5911a3a

File tree

2 files changed

+295
-31
lines changed

2 files changed

+295
-31
lines changed

docs/source/en/training/dreambooth.mdx

Lines changed: 148 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -502,9 +502,65 @@ You may also run inference from any of the [saved training checkpoints](#inferen
502502

503503
## IF
504504

505-
You can use the lora and full dreambooth scripts to also train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). A few alternative cli flags are needed due to the model size, the expected input resolution, and the text encoder conventions.
505+
You can use the lora and full dreambooth scripts to train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) and the stage II upscaler
506+
[IF model](https://huggingface.co/DeepFloyd/IF-II-L-v1.0).
506507

507-
### LoRA Dreambooth
508+
Note that IF has a predicted variance, and our finetuning scripts only train the models predicted error, so for finetuned IF models we switch to a fixed
509+
variance schedule. The full finetuning scripts will update the scheduler config for the full saved model. However, when loading saved LoRA weights, you
510+
must also update the pipeline's scheduler config.
511+
512+
```py
513+
from diffusers import DiffusionPipeline
514+
515+
pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0")
516+
517+
pipe.load_lora_weights("<lora weights path>")
518+
519+
# Update scheduler config to fixed variance schedule
520+
pipe.scheduler = pipe.scheduler.__class__.from_config(pipe.scheduler.config, variance_type="fixed_small")
521+
```
522+
523+
Additionally, a few alternative cli flags are needed for IF.
524+
525+
`--resolution=64`: IF is a pixel space diffusion model. In order to operate on un-compressed pixels, the input images are of a much smaller resolution.
526+
527+
`--pre_compute_text_embeddings`: IF uses [T5](https://huggingface.co/docs/transformers/model_doc/t5) for its text encoder. In order to save GPU memory, we pre compute all text embeddings and then de-allocate
528+
T5.
529+
530+
`--tokenizer_max_length=77`: T5 has a longer default text length, but the default IF encoding procedure uses a smaller number.
531+
532+
`--text_encoder_use_attention_mask`: T5 passes the attention mask to the text encoder.
533+
534+
### Tips and Tricks
535+
We find LoRA to be sufficient for finetuning the stage I model as the low resolution of the model makes representing finegrained detail hard regardless.
536+
537+
For common and/or not-visually complex object concepts, you can get away with not-finetuning the upscaler. Just be sure to adjust the prompt passed to the
538+
upscaler to remove the new token from the instance prompt. I.e. if your stage I prompt is "a sks dog", use "a dog" for your stage II prompt.
539+
540+
For finegrained detail like faces that aren't present in the original training set, we find that full finetuning of the stage II upscaler is better than
541+
LoRA finetuning stage II.
542+
543+
For finegrained detail like faces, we find that lower learning rates work best.
544+
545+
For stage II, we find that lower learning rates are also needed.
546+
547+
### Stage II additional validation images
548+
549+
The stage II validation requires images to upscale, we can download a downsized version of the training set:
550+
551+
```py
552+
from huggingface_hub import snapshot_download
553+
554+
local_dir = "./dog_downsized"
555+
snapshot_download(
556+
"diffusers/dog-example-downsized",
557+
local_dir=local_dir,
558+
repo_type="dataset",
559+
ignore_patterns=".gitattributes",
560+
)
561+
```
562+
563+
### IF stage I LoRA Dreambooth
508564
This training configuration requires ~28 GB VRAM.
509565

510566
```sh
@@ -518,7 +574,7 @@ accelerate launch train_dreambooth_lora.py \
518574
--instance_data_dir=$INSTANCE_DIR \
519575
--output_dir=$OUTPUT_DIR \
520576
--instance_prompt="a sks dog" \
521-
--resolution=64 \ # The input resolution of the IF unet is 64x64
577+
--resolution=64 \
522578
--train_batch_size=4 \
523579
--gradient_accumulation_steps=1 \
524580
--learning_rate=5e-6 \
@@ -527,16 +583,57 @@ accelerate launch train_dreambooth_lora.py \
527583
--validation_prompt="a sks dog" \
528584
--validation_epochs=25 \
529585
--checkpointing_steps=100 \
530-
--pre_compute_text_embeddings \ # Pre compute text embeddings to that T5 doesn't have to be kept in memory
531-
--tokenizer_max_length=77 \ # IF expects an override of the max token length
532-
--text_encoder_use_attention_mask # IF expects attention mask for text embeddings
586+
--pre_compute_text_embeddings \
587+
--tokenizer_max_length=77 \
588+
--text_encoder_use_attention_mask
533589
```
534590

535-
### Full Dreambooth
536-
Due to the size of the optimizer states, we recommend training the full XL IF model with 8bit adam.
537-
Using 8bit adam and the rest of the following config, the model can be trained in ~48 GB VRAM.
591+
### IF stage II LoRA Dreambooth
538592

539-
For full dreambooth, IF requires very low learning rates. With higher learning rates model quality will degrade.
593+
`--validation_images`: These images are upscaled during validation steps.
594+
595+
`--class_labels_conditioning=timesteps`: Pass additional conditioning to the UNet needed for stage II.
596+
597+
`--learning_rate=1e-6`: Lower learning rate than stage I.
598+
599+
`--resolution=256`: The upscaler expects higher resolution inputs
600+
601+
```sh
602+
export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"
603+
export INSTANCE_DIR="dog"
604+
export OUTPUT_DIR="dreambooth_dog_upscale"
605+
export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png"
606+
607+
python train_dreambooth_lora.py \
608+
--report_to wandb \
609+
--pretrained_model_name_or_path=$MODEL_NAME \
610+
--instance_data_dir=$INSTANCE_DIR \
611+
--output_dir=$OUTPUT_DIR \
612+
--instance_prompt="a sks dog" \
613+
--resolution=256 \
614+
--train_batch_size=4 \
615+
--gradient_accumulation_steps=1 \
616+
--learning_rate=1e-6 \
617+
--max_train_steps=2000 \
618+
--validation_prompt="a sks dog" \
619+
--validation_epochs=100 \
620+
--checkpointing_steps=500 \
621+
--pre_compute_text_embeddings \
622+
--tokenizer_max_length=77 \
623+
--text_encoder_use_attention_mask \
624+
--validation_images $VALIDATION_IMAGES \
625+
--class_labels_conditioning=timesteps
626+
```
627+
628+
### IF Stage I Full Dreambooth
629+
`--skip_save_text_encoder`: When training the full model, this will skip saving the entire T5 with the finetuned model. You can still load the pipeline
630+
with a T5 loaded from the original model.
631+
632+
`use_8bit_adam`: Due to the size of the optimizer states, we recommend training the full XL IF model with 8bit adam.
633+
634+
`--learning_rate=1e-7`: For full dreambooth, IF requires very low learning rates. With higher learning rates model quality will degrade.
635+
636+
Using 8bit adam and a batch size of 4, the model can be trained in ~48 GB VRAM.
540637

541638
```sh
542639
export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0"
@@ -549,17 +646,52 @@ accelerate launch train_dreambooth.py \
549646
--instance_data_dir=$INSTANCE_DIR \
550647
--output_dir=$OUTPUT_DIR \
551648
--instance_prompt="a photo of sks dog" \
552-
--resolution=64 \ # The input resolution of the IF unet is 64x64
649+
--resolution=64 \
553650
--train_batch_size=4 \
554651
--gradient_accumulation_steps=1 \
555652
--learning_rate=1e-7 \
556653
--max_train_steps=150 \
557654
--validation_prompt "a photo of sks dog" \
558655
--validation_steps 25 \
559-
--text_encoder_use_attention_mask \ # IF expects attention mask for text embeddings
560-
--tokenizer_max_length 77 \ # IF expects an override of the max token length
561-
--pre_compute_text_embeddings \ # Pre compute text embeddings to that T5 doesn't have to be kept in memory
656+
--text_encoder_use_attention_mask \
657+
--tokenizer_max_length 77 \
658+
--pre_compute_text_embeddings \
562659
--use_8bit_adam \ #
563660
--set_grads_to_none \
564-
--skip_save_text_encoder # do not save the full T5 text encoder with the model
565-
```
661+
--skip_save_text_encoder \
662+
--push_to_hub
663+
```
664+
665+
### IF Stage II Full Dreambooth
666+
667+
`--learning_rate=1e-8`: Even lower learning rate.
668+
669+
`--resolution=256`: The upscaler expects higher resolution inputs
670+
671+
```sh
672+
export MODEL_NAME="DeepFloyd/IF-II-L-v1.0"
673+
export INSTANCE_DIR="dog"
674+
export OUTPUT_DIR="dreambooth_dog_upscale"
675+
export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png"
676+
677+
accelerate launch train_dreambooth.py \
678+
--report_to wandb \
679+
--pretrained_model_name_or_path=$MODEL_NAME \
680+
--instance_data_dir=$INSTANCE_DIR \
681+
--output_dir=$OUTPUT_DIR \
682+
--instance_prompt="a sks dog" \
683+
--resolution=256 \
684+
--train_batch_size=2 \
685+
--gradient_accumulation_steps=2 \
686+
--learning_rate=1e-8 \
687+
--max_train_steps=2000 \
688+
--validation_prompt="a sks dog" \
689+
--validation_steps=150 \
690+
--checkpointing_steps=500 \
691+
--pre_compute_text_embeddings \
692+
--tokenizer_max_length=77 \
693+
--text_encoder_use_attention_mask \
694+
--validation_images $VALIDATION_IMAGES \
695+
--class_labels_conditioning timesteps \
696+
--push_to_hub
697+
```

0 commit comments

Comments
 (0)