Skip to content

Commit 9dc8444

Browse files
[Examples] InstructPix2Pix instruct training script (#2478)
* add: initial implementation of the pix2pix instruct training script. * shorten cli arg. * fix: main process check. * fix: dataset column names. * simplify tokenization. * proper placement of null conditions. * apply styling. * remove debugging message for conditioning do. * complete license. * add: requirements.tzt * wandb column name order. * fix: augmentation. * change: dataset_id. * fix: convert_to_np() call. * fix: reshaping. * fix: final ema copy. * Apply suggestions from code review Co-authored-by: Patrick von Platen <[email protected]> * address PR comments. * add: readme details. * config fix. * downgrade version. * reduce image width in the readme. * note on hyperparameters during generation. * add: output images. * update readme. * minor edits to readme. * debugging statement. * explicitly placement of the pipeline. * bump minimum diffusers version. * fix: device attribute error. * weight dtype. * debugging. * add dtype inform. * add seoarate te and vae. * add: explicit casting/ * remove casting. * up. * up 2. * up 3. * autocast. * disable mixed-precision in the final inference. * debugging information. * autocasting. * add: instructpix2pix training section to the docs. * Empty-Commit --------- Co-authored-by: Patrick von Platen <[email protected]>
1 parent c681ad1 commit 9dc8444

File tree

5 files changed

+1371
-0
lines changed

5 files changed

+1371
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,22 @@
9898
- local: optimization/habana
9999
title: Habana Gaudi
100100
title: Optimization/Special Hardware
101+
- sections:
102+
- local: training/overview
103+
title: Overview
104+
- local: training/unconditional_training
105+
title: Unconditional image generation
106+
- local: training/text_inversion
107+
title: Textual Inversion
108+
- local: training/dreambooth
109+
title: DreamBooth
110+
- local: training/text2image
111+
title: Text-to-image
112+
- local: training/lora
113+
title: Low-Rank Adaptation of Large Language Models (LoRA)
114+
- local: training/instructpix2pix
115+
title: InstructPix2Pix Training
116+
title: Training
101117
- sections:
102118
- local: conceptual/philosophy
103119
title: Philosophy
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# InstructPix2Pix
14+
15+
[InstructPix2Pix](https://arxiv.org/abs/2211.09800) is a method to fine-tune text-conditioned diffusion models such that they can follow an edit instruction for an input image. Models fine-tuned using this method take the following as inputs:
16+
17+
<p align="center">
18+
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-instruction.png" alt="instructpix2pix-inputs" width=600/>
19+
</p>
20+
21+
The output is an "edited" image that reflects the edit instruction applied on the input image:
22+
23+
<p align="center">
24+
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/output-gs%407-igs%401-steps%4050.png" alt="instructpix2pix-output" width=600/>
25+
</p>
26+
27+
The `train_instruct_pix2pix.py` script shows how to implement the training procedure and adapt it for Stable Diffusion.
28+
29+
***Disclaimer: Even though `train_instruct_pix2pix.py` implements the InstructPix2Pix
30+
training procedure while being faithful to the [original implementation](https://github.com/timothybrooks/instruct-pix2pix) we have only tested it on a [small-scale dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples). This can impact the end results. For better results, we recommend longer training runs with a larger dataset. [Here](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) you can find a large dataset for InstructPix2Pix training.***
31+
32+
## Running locally with PyTorch
33+
34+
### Installing the dependencies
35+
36+
Before running the scripts, make sure to install the library's training dependencies:
37+
38+
**Important**
39+
40+
To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
41+
```bash
42+
git clone https://github.com/huggingface/diffusers
43+
cd diffusers
44+
pip install -e .
45+
```
46+
47+
Then cd in the example folder and run
48+
```bash
49+
pip install -r requirements.txt
50+
```
51+
52+
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
53+
54+
```bash
55+
accelerate config
56+
```
57+
58+
Or for a default accelerate configuration without answering questions about your environment
59+
60+
```bash
61+
accelerate config default
62+
```
63+
64+
Or if your environment doesn't support an interactive shell e.g. a notebook
65+
66+
```python
67+
from accelerate.utils import write_basic_config
68+
69+
write_basic_config()
70+
```
71+
72+
### Toy example
73+
74+
As mentioned before, we'll use a [small toy dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) for training. The dataset
75+
is a smaller version of the [original dataset](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) used in the InstructPix2Pix paper.
76+
77+
Configure environment variables such as the dataset identifier and the Stable Diffusion
78+
checkpoint:
79+
80+
```bash
81+
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
82+
export DATASET_ID="fusing/instructpix2pix-1000-samples"
83+
```
84+
85+
Now, we can launch training:
86+
87+
```bash
88+
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
89+
--pretrained_model_name_or_path=$MODEL_NAME \
90+
--dataset_name=$DATASET_ID \
91+
--enable_xformers_memory_efficient_attention \
92+
--resolution=256 --random_flip \
93+
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
94+
--max_train_steps=15000 \
95+
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
96+
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
97+
--conditioning_dropout_prob=0.05 \
98+
--mixed_precision=fp16 \
99+
--seed=42
100+
```
101+
102+
Additionally, we support performing validation inference to monitor training progress
103+
with Weights and Biases. You can enable this feature with `report_to="wandb"`:
104+
105+
```bash
106+
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
107+
--pretrained_model_name_or_path=$MODEL_NAME \
108+
--dataset_name=$DATASET_ID \
109+
--enable_xformers_memory_efficient_attention \
110+
--resolution=256 --random_flip \
111+
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
112+
--max_train_steps=15000 \
113+
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
114+
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
115+
--conditioning_dropout_prob=0.05 \
116+
--mixed_precision=fp16 \
117+
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \
118+
--validation_prompt="make the mountains snowy" \
119+
--seed=42 \
120+
--report_to=wandb
121+
```
122+
123+
We recommend this type of validation as it can be useful for model debugging. Note that you need `wandb` installed to use this. You can install `wandb` by running `pip install wandb`.
124+
125+
[Here](https://wandb.ai/sayakpaul/instruct-pix2pix/runs/ctr3kovq), you can find an example training run that includes some validation samples and the training hyperparameters.
126+
127+
***Note: In the original paper, the authors observed that even when the model is trained with an image resolution of 256x256, it generalizes well to bigger resolutions such as 512x512. This is likely because of the larger dataset they used during training.***
128+
129+
## Inference
130+
131+
Once training is complete, we can perform inference:
132+
133+
```python
134+
import PIL
135+
import requests
136+
import torch
137+
from diffusers import StableDiffusionInstructPix2PixPipeline
138+
139+
model_id = "your_model_id" # <- replace this
140+
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
141+
generator = torch.Generator("cuda").manual_seed(0)
142+
143+
url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png"
144+
145+
146+
def download_image(url):
147+
image = PIL.Image.open(requests.get(url, stream=True).raw)
148+
image = PIL.ImageOps.exif_transpose(image)
149+
image = image.convert("RGB")
150+
return image
151+
152+
153+
image = download_image(url)
154+
prompt = "wipe out the lake"
155+
num_inference_steps = 20
156+
image_guidance_scale = 1.5
157+
guidance_scale = 10
158+
159+
edited_image = pipe(
160+
prompt,
161+
image=image,
162+
num_inference_steps=num_inference_steps,
163+
image_guidance_scale=image_guidance_scale,
164+
guidance_scale=guidance_scale,
165+
generator=generator,
166+
).images[0]
167+
edited_image.save("edited_image.png")
168+
```
169+
170+
An example model repo obtained using this training script can be found
171+
here - [sayakpaul/instruct-pix2pix](https://huggingface.co/sayakpaul/instruct-pix2pix).
172+
173+
We encourage you to play with the following three parameters to control
174+
speed and quality during performance:
175+
176+
* `num_inference_steps`
177+
* `image_guidance_scale`
178+
* `guidance_scale`
179+
180+
Particularly, `image_guidance_scale` and `guidance_scale` can have a profound impact
181+
on the generated ("edited") image (see [here](https://twitter.com/RisingSayak/status/1628392199196151808?s=20) for an example).

examples/instruct_pix2pix/README.md

Lines changed: 166 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,166 @@
1+
# InstructPix2Pix training example
2+
3+
[InstructPix2Pix](https://arxiv.org/abs/2211.09800) is a method to fine-tune text-conditioned diffusion models such that they can follow an edit instruction for an input image. Models fine-tuned using this method take the following as inputs:
4+
5+
<p align="center">
6+
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-instruction.png" alt="instructpix2pix-inputs" width=600/>
7+
</p>
8+
9+
The output is an "edited" image that reflects the edit instruction applied on the input image:
10+
11+
<p align="center">
12+
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/output-gs%407-igs%401-steps%4050.png" alt="instructpix2pix-output" width=600/>
13+
</p>
14+
15+
The `train_instruct_pix2pix.py` script shows how to implement the training procedure and adapt it for Stable Diffusion.
16+
17+
***Disclaimer: Even though `train_instruct_pix2pix.py` implements the InstructPix2Pix
18+
training procedure while being faithful to the [original implementation](https://github.com/timothybrooks/instruct-pix2pix) we have only tested it on a [small-scale dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples). This can impact the end results. For better results, we recommend longer training runs with a larger dataset. [Here](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) you can find a large dataset for InstructPix2Pix training.***
19+
20+
## Running locally with PyTorch
21+
22+
### Installing the dependencies
23+
24+
Before running the scripts, make sure to install the library's training dependencies:
25+
26+
**Important**
27+
28+
To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:
29+
```bash
30+
git clone https://github.com/huggingface/diffusers
31+
cd diffusers
32+
pip install -e .
33+
```
34+
35+
Then cd in the example folder and run
36+
```bash
37+
pip install -r requirements.txt
38+
```
39+
40+
And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with:
41+
42+
```bash
43+
accelerate config
44+
```
45+
46+
Or for a default accelerate configuration without answering questions about your environment
47+
48+
```bash
49+
accelerate config default
50+
```
51+
52+
Or if your environment doesn't support an interactive shell e.g. a notebook
53+
54+
```python
55+
from accelerate.utils import write_basic_config
56+
write_basic_config()
57+
```
58+
59+
### Toy example
60+
61+
As mentioned before, we'll use a [small toy dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) for training. The dataset
62+
is a smaller version of the [original dataset](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) used in the InstructPix2Pix paper.
63+
64+
Configure environment variables such as the dataset identifier and the Stable Diffusion
65+
checkpoint:
66+
67+
```bash
68+
export MODEL_NAME="runwayml/stable-diffusion-v1-5"
69+
export DATASET_ID="fusing/instructpix2pix-1000-samples"
70+
```
71+
72+
Now, we can launch training:
73+
74+
```bash
75+
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
76+
--pretrained_model_name_or_path=$MODEL_NAME \
77+
--dataset_name=$DATASET_ID \
78+
--enable_xformers_memory_efficient_attention \
79+
--resolution=256 --random_flip \
80+
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
81+
--max_train_steps=15000 \
82+
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
83+
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
84+
--conditioning_dropout_prob=0.05 \
85+
--mixed_precision=fp16 \
86+
--seed=42
87+
```
88+
89+
Additionally, we support performing validation inference to monitor training progress
90+
with Weights and Biases. You can enable this feature with `report_to="wandb"`:
91+
92+
```bash
93+
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
94+
--pretrained_model_name_or_path=$MODEL_NAME \
95+
--dataset_name=$DATASET_ID \
96+
--enable_xformers_memory_efficient_attention \
97+
--resolution=256 --random_flip \
98+
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
99+
--max_train_steps=15000 \
100+
--checkpointing_steps=5000 --checkpoints_total_limit=1 \
101+
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \
102+
--conditioning_dropout_prob=0.05 \
103+
--mixed_precision=fp16 \
104+
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \
105+
--validation_prompt="make the mountains snowy" \
106+
--seed=42 \
107+
--report_to=wandb
108+
```
109+
110+
We recommend this type of validation as it can be useful for model debugging. Note that you need `wandb` installed to use this. You can install `wandb` by running `pip install wandb`.
111+
112+
[Here](https://wandb.ai/sayakpaul/instruct-pix2pix/runs/ctr3kovq), you can find an example training run that includes some validation samples and the training hyperparameters.
113+
114+
***Note: In the original paper, the authors observed that even when the model is trained with an image resolution of 256x256, it generalizes well to bigger resolutions such as 512x512. This is likely because of the larger dataset they used during training.***
115+
116+
## Inference
117+
118+
Once training is complete, we can perform inference:
119+
120+
```python
121+
import PIL
122+
import requests
123+
import torch
124+
from diffusers import StableDiffusionInstructPix2PixPipeline
125+
126+
model_id = "your_model_id" # <- replace this
127+
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
128+
generator = torch.Generator("cuda").manual_seed(0)
129+
130+
url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png"
131+
132+
133+
def download_image(url):
134+
image = PIL.Image.open(requests.get(url, stream=True).raw)
135+
image = PIL.ImageOps.exif_transpose(image)
136+
image = image.convert("RGB")
137+
return image
138+
139+
image = download_image(url)
140+
prompt = "wipe out the lake"
141+
num_inference_steps = 20
142+
image_guidance_scale = 1.5
143+
guidance_scale = 10
144+
145+
edited_image = pipe(prompt,
146+
image=image,
147+
num_inference_steps=num_inference_steps,
148+
image_guidance_scale=image_guidance_scale,
149+
guidance_scale=guidance_scale,
150+
generator=generator,
151+
).images[0]
152+
edited_image.save("edited_image.png")
153+
```
154+
155+
An example model repo obtained using this training script can be found
156+
here - [sayakpaul/instruct-pix2pix](https://huggingface.co/sayakpaul/instruct-pix2pix).
157+
158+
We encourage you to play with the following three parameters to control
159+
speed and quality during performance:
160+
161+
* `num_inference_steps`
162+
* `image_guidance_scale`
163+
* `guidance_scale`
164+
165+
Particularly, `image_guidance_scale` and `guidance_scale` can have a profound impact
166+
on the generated ("edited") image (see [here](https://twitter.com/RisingSayak/status/1628392199196151808?s=20) for an example).
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
accelerate
2+
torchvision
3+
transformers>=4.25.1
4+
datasets
5+
ftfy
6+
tensorboard

0 commit comments

Comments
 (0)