You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* dreambooth if docs - stage II, more info
* Update docs/source/en/training/dreambooth.mdx
Co-authored-by: Patrick von Platen <[email protected]>
* Update docs/source/en/training/dreambooth.mdx
Co-authored-by: Patrick von Platen <[email protected]>
* Update docs/source/en/training/dreambooth.mdx
Co-authored-by: Sayak Paul <[email protected]>
* download instructions for downsized images
* update source README to match docs
---------
Co-authored-by: Patrick von Platen <[email protected]>
Co-authored-by: Sayak Paul <[email protected]>
@@ -502,9 +502,65 @@ You may also run inference from any of the [saved training checkpoints](#inferen
502
502
503
503
## IF
504
504
505
-
You can use the lora and full dreambooth scripts to also train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). A few alternative cli flags are needed due to the model size, the expected input resolution, and the text encoder conventions.
505
+
You can use the lora and full dreambooth scripts to train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) and the stage II upscaler
Note that IF has a predicted variance, and our finetuning scripts only train the models predicted error, so for finetuned IF models we switch to a fixed
509
+
variance schedule. The full finetuning scripts will update the scheduler config for the full saved model. However, when loading saved LoRA weights, you
Additionally, a few alternative cli flags are needed for IF.
524
+
525
+
`--resolution=64`: IF is a pixel space diffusion model. In order to operate on un-compressed pixels, the input images are of a much smaller resolution.
526
+
527
+
`--pre_compute_text_embeddings`: IF uses [T5](https://huggingface.co/docs/transformers/model_doc/t5) for its text encoder. In order to save GPU memory, we pre compute all text embeddings and then de-allocate
528
+
T5.
529
+
530
+
`--tokenizer_max_length=77`: T5 has a longer default text length, but the default IF encoding procedure uses a smaller number.
531
+
532
+
`--text_encoder_use_attention_mask`: T5 passes the attention mask to the text encoder.
533
+
534
+
### Tips and Tricks
535
+
We find LoRA to be sufficient for finetuning the stage I model as the low resolution of the model makes representing finegrained detail hard regardless.
536
+
537
+
For common and/or not-visually complex object concepts, you can get away with not-finetuning the upscaler. Just be sure to adjust the prompt passed to the
538
+
upscaler to remove the new token from the instance prompt. I.e. if your stage I prompt is "a sks dog", use "a dog" for your stage II prompt.
539
+
540
+
For finegrained detail like faces that aren't present in the original training set, we find that full finetuning of the stage II upscaler is better than
541
+
LoRA finetuning stage II.
542
+
543
+
For finegrained detail like faces, we find that lower learning rates work best.
544
+
545
+
For stage II, we find that lower learning rates are also needed.
546
+
547
+
### Stage II additional validation images
548
+
549
+
The stage II validation requires images to upscale, we can download a downsized version of the training set:
`--skip_save_text_encoder`: When training the full model, this will skip saving the entire T5 with the finetuned model. You can still load the pipeline
630
+
with a T5 loaded from the original model.
631
+
632
+
`use_8bit_adam`: Due to the size of the optimizer states, we recommend training the full XL IF model with 8bit adam.
633
+
634
+
`--learning_rate=1e-7`: For full dreambooth, IF requires very low learning rates. With higher learning rates model quality will degrade.
635
+
636
+
Using 8bit adam and a batch size of 4, the model can be trained in ~48 GB VRAM.
0 commit comments