From befdb3614e3cb18bfe6050af3005ecd31515f4ea Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Tue, 9 May 2023 10:46:46 +0530
Subject: [PATCH 01/10] add: benchmarking stats for A100 and V100.

---
 docs/source/en/optimization/torch2.0.mdx | 216 +++++++++--------------
 1 file changed, 88 insertions(+), 128 deletions(-)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index 206ac4e447cc..0a9c3cb4dc9f 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -12,19 +12,19 @@ specific language governing permissions and limitations under the License.
 
 # Accelerated PyTorch 2.0 support in Diffusers
 
-Starting from version `0.13.0`, Diffusers supports the latest optimization from the upcoming [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) release. These include:
-1. Support for accelerated transformers implementation with memory-efficient attention – no extra dependencies required.
+Starting from version `0.13.0`, Diffusers supports the latest optimization from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) release. These include:
+1. Support for accelerated transformers implementation with memory-efficient attention – no extra dependencies (such as `xformers`) required.
 2. [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) support for extra performance boost when individual models are compiled.
 
 
 ## Installation
-To benefit from the accelerated attention implementation and `torch.compile`, you just need to install the latest versions of PyTorch 2.0 from `pip`, and make sure you are on diffusers 0.13.0 or later. As explained below, `diffusers` automatically uses the attention optimizations (but not `torch.compile`) when available.
+To benefit from the accelerated attention implementation and `torch.compile`, you just need to install the latest versions of PyTorch 2.0 from `pip`, and make sure you are on diffusers 0.13.0 or later. As explained below, `diffusers` automatically uses the optimized attention processor ([`~diffusers.models.attention_processor.AttnProcessor2_0`]) (but not `torch.compile`) when PyTorch 2.0 is available.
 
 ```bash
 pip install --upgrade torch torchvision diffusers
 ```
 
-## Using accelerated transformers and torch.compile.
+## Using accelerated transformers and `torch.compile`.
 
 
 1. **Accelerated Transformers implementation**
@@ -60,6 +60,19 @@ pip install --upgrade torch torchvision diffusers
 
     This should be as fast and memory efficient as `xFormers`. More details [in our benchmark](#benchmark).
 
+    If you want to use the vanilla attention processor (([`~diffusers.models.attention_processor.AttnProcessor`])) as shown below:
+
+    ```Python
+    import torch
+    from diffusers import DiffusionPipeline
+    from diffusers.models.attention_processor import AttnProcessor
+
+    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
+    pipe.unet.set_attn_processor(AttnProcessor())
+
+    prompt = "a photo of an astronaut riding a horse on mars"
+    image = pipe(prompt).images[0]
+    ```
 
 2. **torch.compile**
 
@@ -80,131 +93,78 @@ pip install --upgrade torch torchvision diffusers
 
     Depending on the type of GPU, `compile()` can yield between 2-9% of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
     
-    Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times.
+    Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times. Calling the compiled pipeline on a different type of input will retrigger compilation which can be expensive.
 
 
 ## Benchmark
 
-We conducted a simple benchmark on different GPUs to compare vanilla attention, xFormers, `torch.nn.functional.scaled_dot_product_attention` and `torch.compile+torch.nn.functional.scaled_dot_product_attention`.
-For the benchmark we used the [stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) model with 50 steps. The `xFormers` benchmark is done using the `torch==1.13.1` version, while the accelerated transformers optimizations are tested using nightly versions of PyTorch 2.0. The tables below summarize the results we got.
-
-Please refer to [our featured blog post in the PyTorch site](https://pytorch.org/blog/accelerated-diffusers-pt-20/) for more details.
-
-### FP16 benchmark
-
-The table below shows the benchmark results for inference using `fp16`. As we can see, `torch.nn.functional.scaled_dot_product_attention` is as fast as `xFormers` (sometimes slightly faster/slower) on all the GPUs we tested.
-And using `torch.compile` gives further speed-up of up of 10% over `xFormers`, but it's mostly noticeable on the A100 GPU.
-
-___The time reported is in seconds.___
-
-| GPU | Batch Size | Vanilla Attention | xFormers | PyTorch2.0 SDPA | SDPA + torch.compile | Speed over xformers (%) |
-| --- | --- | --- | --- | --- | --- | --- |
-| A100 | 1 | 2.69 | 2.7 | 1.98 | 2.47 | 8.52 |
-| A100 | 2 | 3.21 | 3.04 | 2.38 | 2.78 | 8.55 |
-| A100 | 4 | 5.27 | 3.91 | 3.89 | 3.53 | 9.72 |
-| A100 | 8 | 9.74 | 7.03 | 7.04 | 6.62 | 5.83 |
-| A100 | 10 | 12.02 | 8.7 | 8.67 | 8.45 | 2.87 |
-| A100 | 16 | 18.95 | 13.57 | 13.55 | 13.20 | 2.73 |
-| A100 | 32 (1) | OOM | 26.56 | 26.68 | 25.85 | 2.67 |
-| A100 | 64 | | 52.51 | 53.03 | 50.93 | 3.01 |
-| | | | | | | |
-| A10 | 4 | 13.94 | 9.81 | 10.01 | 9.35 | 4.69 |
-| A10 | 8 | 27.09 | 19 | 19.53 | 18.33 | 3.53 |
-| A10 | 10 | 33.69 | 23.53 | 24.19 | 22.52 | 4.29 |
-| A10 | 16 | OOM | 37.55 | 38.31 | 36.81 | 1.97 |
-| A10 | 32 (1) | | 77.19 | 78.43 | 76.64 | 0.71 |
-| A10 | 64 (1) | | 173.59 | 158.99 | 155.14 | 10.63 |
-| | | | | | | |
-| T4 | 4 | 38.81 | 30.09 | 29.74 | 27.55 | 8.44 |
-| T4 | 8 | OOM | 55.71 | 55.99 | 53.85 | 3.34 |
-| T4 | 10 | OOM | 68.96 | 69.86 | 65.35 | 5.23 |
-| T4 | 16 | OOM | 111.47 | 113.26  | 106.93  | 4.07 |
-| | | | | | | |
-| V100 | 4 | 9.84 | 8.16 | 8.09 | 7.65 | 6.25 |
-| V100 | 8 | OOM | 15.62 | 15.44 | 14.59 | 6.59 |
-| V100 | 10 | OOM | 19.52 | 19.28 | 18.18 | 6.86 |
-| V100 | 16 | OOM | 30.29 | 29.84 | 28.22 | 6.83 |
-| | | | | | | |
-| 3090 | 1 | 2.94 | 2.5 | 2.42 | 2.33 | 6.80 |
-| 3090 | 4 | 10.04 | 7.82 | 7.72 | 7.38 | 5.63 |
-| 3090 | 8 | 19.27 | 14.97 | 14.88 | 14.15 | 5.48 |
-| 3090 | 10| 24.08 | 18.7 | 18.62 | 18.12 | 3.10 |
-| 3090 | 16 | OOM | 29.06 | 28.88 | 28.2 | 2.96 |
-| 3090 | 32 (1) | | 58.05 | 57.42 | 56.28 | 3.05 |
-| 3090 | 64 (1) | | 126.54 | 114.27 | 112.21 | 11.32 |
-| | | | | | | |
-| 3090 Ti | 1 | 2.7 | 2.26 | 2.19 | 2.12 | 6.19 |
-| 3090 Ti | 4 | 9.07 | 7.14 | 7.00 | 6.71 | 6.02 |
-| 3090 Ti | 8 | 17.51 | 13.65 | 13.53 | 12.94 | 5.20 |
-| 3090 Ti | 10 (2) | 21.79 | 16.85 | 16.77 | 16.44 | 2.43 |
-| 3090 Ti | 16 | OOM | 26.1 | 26.04 | 25.53 | 2.18 |
-| 3090 Ti | 32 (1) | | 51.78 | 51.71 | 50.91 | 1.68 |
-| 3090 Ti | 64 (1) | | 112.02 | 102.78 | 100.89 | 9.94 |
-| | | | | | | |
-| 4090 | 1 | 4.47 | 3.98 | 1.28 | 1.21 | 69.60 |
-| 4090 | 4 | 10.48 | 8.37 | 3.76 | 3.56 | 57.47 |
-| 4090 | 8 | 14.33 | 10.22 | 7.43 | 6.99 | 31.60 |
-| 4090 | 16 | | 17.07 | 14.98 | 14.58 | 14.59 |
-| 4090 | 32 (1) | | 39.03 | 30.18 | 29.49 | 24.44 |
-| 4090 | 64 (1) | | 77.29 | 61.34 | 59.96 | 22.42 |
-
-
-				
-### FP32 benchmark
-
-The table below shows the benchmark results for inference using `fp32`. In this case, `torch.nn.functional.scaled_dot_product_attention` is faster than `xFormers` on all the GPUs we tested.
-
-Using `torch.compile` in addition to the accelerated transformers implementation can yield up to 19% performance improvement over `xFormers` in Ampere and Ada cards, and up to 20% (Ampere) or 28% (Ada) over vanilla attention.
-
-| GPU | Batch Size | Vanilla Attention | xFormers | PyTorch2.0 SDPA | SDPA + torch.compile | Speed over xformers (%) | Speed over vanilla (%) |
-| --- | --- | --- | --- | --- | --- | --- | --- |
-| A100 | 1 | 4.97 | 3.86 | 2.6 | 2.86 | 25.91 | 42.45 |
-| A100 | 2 | 9.03 | 6.76 | 4.41 | 4.21 | 37.72 | 53.38 |
-| A100 | 4 | 16.70 | 12.42 | 7.94 | 7.54 | 39.29 | 54.85 |
-| A100 | 10 | OOM | 29.93 | 18.70 | 18.46 | 38.32 | |
-| A100 | 16 | | 47.08 | 29.41 | 29.04 | 38.32 | |
-| A100 | 32 | | 92.89 | 57.55 | 56.67 | 38.99 | |
-| A100 | 64 | | 185.3 | 114.8 | 112.98 | 39.03 | |
-| | | | | | | |
-| A10 | 1 | 10.59 | 8.81 | 7.51 | 7.35 | 16.57 | 30.59 |
-| A10 | 4 | 34.77 | 27.63 | 22.77 | 22.07 | 20.12 | 36.53 |
-| A10 | 8 | | 56.19 | 43.53 | 43.86 | 21.94 | |
-| A10 | 16 | | 116.49 | 88.56 | 86.64 | 25.62 | |
-| A10 | 32 | | 221.95 | 175.74 | 168.18 | 24.23 | |
-| A10 | 48 | | 333.23 | 264.84 | | 20.52 | |
-| | | | | | | |
-| T4 | 1 | 28.2 | 24.49 | 23.93 | 23.56 | 3.80 | 16.45 |
-| T4 | 2 | 52.77 | 45.7 | 45.88 | 45.06 | 1.40 | 14.61 |
-| T4 | 4 | OOM | 85.72 | 85.78 | 84.48 | 1.45 | |
-| T4 | 8 | | 149.64 | 150.75 | 148.4 | 0.83 | |
-| | | | | | | |
-| V100 | 1 | 7.4 | 6.84 | 6.8 | 6.66 | 2.63 | 10.00 |
-| V100 | 2 | 13.85 | 12.81 | 12.66 | 12.35 | 3.59 | 10.83 |
-| V100 | 4 | OOM | 25.73 | 25.31 | 24.78 | 3.69 | |
-| V100 | 8 | | 43.95 | 43.37 | 42.25 | 3.87 | |
-| V100 | 16 | | 84.99 | 84.73 | 82.55 | 2.87 | |
-| | | | | | | |
-| 3090 | 1 | 7.09 | 6.78 | 5.34 | 5.35 | 21.09 | 24.54 |
-| 3090 | 4 | 22.69 | 21.45 | 18.56 | 18.18 | 15.24 | 19.88 |
-| 3090 | 8 | | 42.59 | 36.68 | 35.61 | 16.39 | |
-| 3090 | 16 | | 85.35 | 72.93 | 70.18 | 17.77 | |
-| 3090 | 32 (1) | | 162.05 | 143.46 | 138.67 | 14.43 | |
-| | | | | | | |
-| 3090 Ti | 1 | 6.45 | 6.19 | 4.99 | 4.89 | 21.00 | 24.19 |
-| 3090 Ti | 4 | 20.32 | 19.31 | 17.02 | 16.48 | 14.66 | 18.90 |
-| 3090 Ti | 8 | | 37.93 | 33.21 | 32.24 | 15.00 | |
-| 3090 Ti | 16 | | 75.37 | 66.63 | 64.5 | 14.42 | |
-| 3090 Ti | 32 (1) | | 142.55 | 128.89 | 124.92 | 12.37 | |
-| | | | | | | |
-| 4090 | 1 | 5.54 | 4.99 | 2.66 | 2.58 | 48.30 | 53.43 |
-| 4090 | 4 | 13.67 | 11.4 | 8.81 | 8.46 | 25.79 | 38.11 |
-| 4090 | 8 |  | 19.79 | 17.55 | 16.62 | 16.02 |  |
-| 4090 | 16 |  | 38.62 | 35.65 | 34.07 | 11.78 |  |
-| 4090 | 32 (1) |  | 76.57 | 69.48 | 65.35 | 14.65 |  |
-| 4090 | 48 |  | 114.44 | 106.3 |  | 7.11 |  |
-
-
-(1) Batch Size >= 32 requires enable_vae_slicing() because of https://github.com/pytorch/pytorch/issues/81665.
-This is required for PyTorch 1.13.1, and also for PyTorch 2.0 and large batch sizes.
-
-For more details about how this benchmark was run, please refer to [this PR](https://github.com/huggingface/diffusers/pull/2303) and to [the blog post](https://pytorch.org/blog/accelerated-diffusers-pt-20/).
+We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and different batch sizes for five of our most used pipelines. 
+In the following tables, we report our findings in terms of the number of iterations processed per second. 
+
+### A100 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 21.66 | 23.13 | 44.03 | 49.74 |
+| SD - img2img | 21.81 | 22.40 | 43.92 | 46.32 |
+| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 |
+| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 |
+| IF | 20.21 / <br>13.84 / <br>24.00 | 20.12 / <br>13.70 / <br>24.03 | ❌ | 97.34 / <br>27.23 / <br>111.66 |
+
+### A100 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 11.6 | 13.12 | 14.62 | 17.27 |
+| SD - img2img | 11.47 | 13.06 | 14.66 | 17.25 |
+| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 |
+| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 |
+| IF | 25.02 | 18.04 | ❌ | 48.47 |
+
+### A100 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 3.04 | 3.6 | 3.83 | 4.68 |
+| SD - img2img | 2.98 | 3.58 | 3.83 | 4.67 |
+| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 |
+| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 |
+| IF | 8.78 | 9.82 | ❌ | 16.77 |
+
+### V100 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 18.99 | 19.14 | 20.95 | 22.17 |
+| SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 |
+| SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 |
+| SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 |
+| IF |  20.01 | 19.79 | ❌ | 55.75 |
+
+### V100 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 5.96 | 5.89 | 6.83 | 6.86 |
+| SD - img2img | 5.90 | 5.91 | 6.81 | 6.82 |
+| SD - inpaint | 5.99 | 6.03 | 6.93 | 6.95 |
+| SD - controlnet | 4.26 | 4.29 | 4.92 | 4.93 |
+| IF | 15.41 | 14.76 | ❌ | 22.95 |
+
+### V100 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 1.66 | 1.66 | 1.92 | 1.90 |
+| SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 |
+| SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 |
+| SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 |
+| IF * | 5.43 | 5.29 | ❌ | 7.06 |
+
+## Notes 
+
+* Follow [this PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the code and the environment used for conducting the benchmarks. 
+* For the IF pipeline and batch sizes > 1, we only used a batch size of >1 in the first IF pipeline for text-to-image generation and NOT for upscaling. So, that means the two upscaling pipelines received a batch size of 1. 
+
+
+*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile` in Diffusers.*
\ No newline at end of file

From 4a4696fb49dc8f3c0f42fbc426fed966bbd6abaf Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Tue, 9 May 2023 20:14:05 +0530
Subject: [PATCH 02/10] Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
---
 docs/source/en/optimization/torch2.0.mdx | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index 0a9c3cb4dc9f..59a26e2db577 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -60,7 +60,7 @@ pip install --upgrade torch torchvision diffusers
 
     This should be as fast and memory efficient as `xFormers`. More details [in our benchmark](#benchmark).
 
-    If you want to use the vanilla attention processor (([`~diffusers.models.attention_processor.AttnProcessor`])) as shown below:
+    If you want to revert to use the vanilla attention processor (([`~diffusers.models.attention_processor.AttnProcessor`])) which can help to make the pipeline more deterministic, you can use the [`~UNet2DModel.set_default_attn_processor`] function:
 
     ```Python
     import torch
@@ -68,7 +68,7 @@ pip install --upgrade torch torchvision diffusers
     from diffusers.models.attention_processor import AttnProcessor
 
     pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
-    pipe.unet.set_attn_processor(AttnProcessor())
+    pipe.unet.set_default_attn_processor()
 
     prompt = "a photo of an astronaut riding a horse on mars"
     image = pipe(prompt).images[0]
@@ -93,7 +93,7 @@ pip install --upgrade torch torchvision diffusers
 
     Depending on the type of GPU, `compile()` can yield between 2-9% of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
     
-    Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times. Calling the compiled pipeline on a different type of input will retrigger compilation which can be expensive.
+    Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times. Calling the compiled pipeline on a different image size will re-trigger compilation which can be expensive.
 
 
 ## Benchmark

From 89a941dd984d6456953ca0920674638ad6d786bc Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Tue, 9 May 2023 20:40:02 +0530
Subject: [PATCH 03/10] address patrick's comments.

---
 docs/source/en/optimization/torch2.0.mdx | 232 +++++++++++++++++++++--
 1 file changed, 215 insertions(+), 17 deletions(-)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index 59a26e2db577..2c56a24484cb 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -46,13 +46,13 @@ pip install --upgrade torch torchvision diffusers
 
     If you want to enable it explicitly (which is not required), you can do so as shown below.
 
-    ```Python
+    ```diff
     import torch
     from diffusers import DiffusionPipeline
-    from diffusers.models.attention_processor import AttnProcessor2_0
+    + from diffusers.models.attention_processor import AttnProcessor2_0
 
     pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
-    pipe.unet.set_attn_processor(AttnProcessor2_0())
+    + pipe.unet.set_attn_processor(AttnProcessor2_0())
 
     prompt = "a photo of an astronaut riding a horse on mars"
     image = pipe(prompt).images[0]
@@ -60,7 +60,7 @@ pip install --upgrade torch torchvision diffusers
 
     This should be as fast and memory efficient as `xFormers`. More details [in our benchmark](#benchmark).
 
-    If you want to revert to use the vanilla attention processor (([`~diffusers.models.attention_processor.AttnProcessor`])) which can help to make the pipeline more deterministic, you can use the [`~UNet2DModel.set_default_attn_processor`] function:
+    If you want to revert to use the vanilla attention processor ([`~AttnProcessor`]) which can help to make the pipeline more deterministic, you can use the [`~UNet2DModel.set_default_attn_processor`] function:
 
     ```Python
     import torch
@@ -76,22 +76,16 @@ pip install --upgrade torch torchvision diffusers
 
 2. **torch.compile**
 
-    To get an additional speedup, we can use the new `torch.compile` feature. To do so, we simply wrap our `unet` with `torch.compile`. For more information and different options, refer to the 
+    To get an additional speedup, we can use the new `torch.compile` feature. Since the UNet of the pipeline is usually the most computationally expensive, we wrap the `unet` with `torch.compile` leaving rest of the sub-models (text encoder and VAE) as is. For more information and different options, refer to the 
     [torch compile docs](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
 
     ```python
-    import torch
-    from diffusers import DiffusionPipeline
-
-    pipe = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")
-    pipe.unet = torch.compile(pipe.unet)
 
-    batch_size = 10
-    prompt = "A photo of an astronaut riding a horse on marse."
+    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
     images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images
     ```
 
-    Depending on the type of GPU, `compile()` can yield between 2-9% of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
+    Depending on the type of GPU, `compile()` can yield between **3% - 56%** of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
     
     Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times. Calling the compiled pipeline on a different image size will re-trigger compilation which can be expensive.
 
@@ -99,7 +93,181 @@ pip install --upgrade torch torchvision diffusers
 ## Benchmark
 
 We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and different batch sizes for five of our most used pipelines. 
-In the following tables, we report our findings in terms of the number of iterations processed per second. 
+
+### Benchmarking code 
+
+#### Stable Diffusion text-to-image 
+
+```python 
+from diffusers import DiffusionPipeline
+import torch
+
+path = "runwayml/stable-diffusion-v1-5"
+
+run_compile = True  # Set True / False
+
+pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+pipe.unet.to(memory_format=torch.channels_last)
+
+if run_compile:
+    print("Run torch compile")
+    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "ghibli style, a fantasy landscape with castles"
+
+for _ in range(3):
+    images = pipe(prompt=prompt).images
+```
+
+#### Stable Diffusion image-to-image 
+
+```python 
+from diffusers import StableDiffusionImg2ImgPipeline
+import requests
+import torch
+from PIL import Image
+from io import BytesIO
+
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+
+response = requests.get(url)
+init_image = Image.open(BytesIO(response.content)).convert("RGB")
+init_image = init_image.resize((512, 512))
+
+path = "runwayml/stable-diffusion-v1-5"
+
+run_compile = True  # Set True / False
+
+pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+pipe.unet.to(memory_format=torch.channels_last)
+
+if run_compile:
+    print("Run torch compile")
+    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "ghibli style, a fantasy landscape with castles"
+
+for _ in range(3):
+    image = pipe(prompt=prompt, image=init_image).images[0]
+```
+
+#### Stable Diffusion - inpatining 
+
+```python 
+from diffusers import StableDiffusionInpaintPipeline
+import requests
+import torch
+from PIL import Image
+from io import BytesIO
+
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+
+def download_image(url):
+    response = requests.get(url)
+    return Image.open(BytesIO(response.content)).convert("RGB")
+
+
+img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
+mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
+
+init_image = download_image(img_url).resize((512, 512))
+mask_image = download_image(mask_url).resize((512, 512))
+
+path = "runwayml/stable-diffusion-inpainting"
+
+run_compile = True  # Set True / False
+
+pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16)
+pipe = pipe.to("cuda")
+pipe.unet.to(memory_format=torch.channels_last)
+
+if run_compile:
+    print("Run torch compile")
+    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "ghibli style, a fantasy landscape with castles"
+
+for _ in range(3):
+    image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
+```
+
+#### ControlNet 
+
+```python 
+from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
+import requests
+import torch
+from PIL import Image
+from io import BytesIO
+
+url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
+
+response = requests.get(url)
+init_image = Image.open(BytesIO(response.content)).convert("RGB")
+init_image = init_image.resize((512, 512))
+
+path = "runwayml/stable-diffusion-v1-5"
+
+run_compile = True  # Set True / False
+controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
+pipe = StableDiffusionControlNetPipeline.from_pretrained(
+    path, controlnet=controlnet, torch_dtype=torch.float16
+)
+
+pipe = pipe.to("cuda")
+pipe.unet.to(memory_format=torch.channels_last)
+pipe.controlnet.to(memory_format=torch.channels_last)
+
+if run_compile:
+    print("Run torch compile")
+    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+    pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "ghibli style, a fantasy landscape with castles"
+
+for _ in range(3):
+    image = pipe(prompt=prompt, image=init_image).images[0]
+```
+
+#### IF text-to-image + upscaling
+
+```python 
+from diffusers import DiffusionPipeline
+import torch
+
+run_compile = True  # Set True / False
+
+pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
+pipe.to("cuda")
+pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16)
+pipe_2.to("cuda")
+pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16)
+pipe_3.to("cuda")
+
+
+pipe.unet.to(memory_format=torch.channels_last)
+pipe_2.unet.to(memory_format=torch.channels_last)
+pipe_3.unet.to(memory_format=torch.channels_last)
+
+if run_compile:
+    pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+    pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True)
+    pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True)
+
+prompt = "the blue hulk"
+
+prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
+neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16)
+
+for _ in range(3):
+    image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
+    image_2 = pipe_2(image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images
+    image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images
+```
+
+In the following tables, we report our findings in terms of the number of **_iterations processed per second_**. 
 
 ### A100 (batch size: 1)
 
@@ -139,7 +307,7 @@ In the following tables, we report our findings in terms of the number of iterat
 | SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 |
 | SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 |
 | SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 |
-| IF |  20.01 | 19.79 | ❌ | 55.75 |
+| IF |  20.01 / <br>9.08 / <br>23.34 | 19.79 / <br>8.98 / <br>24.10 | ❌ | 55.75 / <br>11.57 / <br>57.67 |
 
 ### V100 (batch size: 4)
 
@@ -159,11 +327,41 @@ In the following tables, we report our findings in terms of the number of iterat
 | SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 |
 | SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 |
 | SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 |
-| IF * | 5.43 | 5.29 | ❌ | 7.06 |
+| IF | 5.43 | 5.29 | ❌ | 7.06 |
+
+### T4 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 6.9 | 6.95 | 7.3 | 7.56 |
+| SD - img2img | 6.84 | 6.99 | 7.04 | 7.55 |
+| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 |
+| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 |
+| IF | 17.42 / <br>2.47 / <br>18.52 | 16.96 / <br>2.45 / <br>18.69 | ❌ | 24.63 / <br>2.47 / <br>23.39 |
+
+### T4 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 1.79 | 1.79 | 2.03 | 1.99 |
+| SD - img2img | 1.77 | 1.77 | 2.05 | 2.04 |
+| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 |
+| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 |
+| IF | 5.79 |  5.61 | ❌ | 7.39 |
+
+### T4 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 2.34s | 2.30s | OOM after 2nd iteration | 1.99s |
+| SD - img2img | 2.35s | 2.31s | OOM after warmup | 2.00s |
+| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s |
+| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup |
+| IF * | 1.44 | 1.44 | ❌ | 1.94 |
 
 ## Notes 
 
-* Follow [this PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the code and the environment used for conducting the benchmarks. 
+* Follow [this PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. 
 * For the IF pipeline and batch sizes > 1, we only used a batch size of >1 in the first IF pipeline for text-to-image generation and NOT for upscaling. So, that means the two upscaling pipelines received a batch size of 1. 
 
 

From 268c2922ab778d2a65c3afdbebb324f79ed95544 Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Tue, 9 May 2023 20:45:00 +0530
Subject: [PATCH 04/10] add: rtx 4090 stats

---
 docs/source/en/optimization/torch2.0.mdx | 32 ++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index 2c56a24484cb..fce4c761f517 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -80,7 +80,6 @@ pip install --upgrade torch torchvision diffusers
     [torch compile docs](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
 
     ```python
-
     pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
     images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images
     ```
@@ -359,10 +358,39 @@ In the following tables, we report our findings in terms of the number of **_ite
 | SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup |
 | IF * | 1.44 | 1.44 | ❌ | 1.94 |
 
+### RTX 4090 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 40.5 | 41.89 | 44.65 | 49.81 |
+| SD - img2img | 40.39 | 41.95 | 44.46 | 49.8 |
+| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 |
+| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 |
+| IF | 69.71 / <br>18.78 / <br>85.49 | 69.13 / <br>18.80 / <br>85.56 | ❌ | 124.60 / <br>26.37 / <br>138.79 |
+
+### RTX 4090 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 12.62 | 12.84 | 15.32 | 15.59 |
+| SD - img2img | 12.61 | 12,.79 | 15.35 | 15.66 |
+| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 |
+| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 |
+| IF | 31.88 | 31.14 | ❌ | 43.92 |
+
+### RTX 4090 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 3.17 | 3.2 | 3.84 | 3.85 |
+| SD - img2img | 3.16 | 3.2 | 3.84 | 3.85 |
+| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 |
+| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 |
+| IF | 9.26 | 9.2 | ❌ | 13.31 |
+
 ## Notes 
 
 * Follow [this PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. 
 * For the IF pipeline and batch sizes > 1, we only used a batch size of >1 in the first IF pipeline for text-to-image generation and NOT for upscaling. So, that means the two upscaling pipelines received a batch size of 1. 
 
-
 *Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile` in Diffusers.*
\ No newline at end of file

From 3a018354851ff8123250b1fce9cb7cdf5a6cbeba Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Tue, 9 May 2023 20:52:55 +0530
Subject: [PATCH 05/10] =?UTF-8?q?=E2=9A=94=20benchmark=20reports=20done?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 docs/source/en/optimization/torch2.0.mdx | 30 ++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index fce4c761f517..211730a7cc15 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -358,6 +358,36 @@ In the following tables, we report our findings in terms of the number of **_ite
 | SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup |
 | IF * | 1.44 | 1.44 | ❌ | 1.94 |
 
+### RTX 3090 (batch size: 1)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 22.56 | 22.84 | 23.84 | 25.69 |
+| SD - img2img | 22.25 | 22.61 | 24.1 | 25.83 |
+| SD - inpaint | 22.22 | 22.54 | 24.26 | 26.02 |
+| SD - controlnet | 16.03 | 16.33 | 17.38 | 18.56 |
+| IF | 27.08 / <br>9.07 / <br>31.23 | 26.75 / <br>8.92 / <br>31.47 | ❌ | 68.08 / <br>11.16 / <br>65.29 |
+
+### RTX 3090 (batch size: 4)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 6.46 | 6.35 | 7.29 | 7.3 |
+| SD - img2img | 6.33 | 6.27 | 7.31 | 7.26 |
+| SD - inpaint | 6.47 | 6.4 | 7.44 | 7.39 |
+| SD - controlnet | 4.59 | 4.54 | 5.27 | 5.26 |
+| IF | 16.81 | 16.62 | ❌ | 21.57 |
+
+### RTX 3090 (batch size: 16)
+
+| **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |
+|:---:|:---:|:---:|:---:|:---:|
+| SD - txt2img | 1.7 | 1.69 | 1.93 | 1.91 |
+| SD - img2img | 1.68 | 1.67 | 1.93 | 1.9 |
+| SD - inpaint | 1.72 | 1.71 | 1.97 | 1.94 |
+| SD - controlnet | 1.23 | 1.22 | 1.4 | 1.38 |
+| IF | 5.01 | 5.00 | ❌ | 6.33 |
+
 ### RTX 4090 (batch size: 1)
 
 | **Pipeline** | **torch 2.0 - <br>no compile** | **torch nightly - <br>no compile** | **torch 2.0 - <br>compile** | **torch nightly - <br>compile** |

From 60bbbbbe7879c21b50bbb3fc4a2efb12f3fed20f Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Wed, 10 May 2023 16:13:25 +0530
Subject: [PATCH 06/10] Apply suggestions from code review

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
---
 docs/source/en/optimization/torch2.0.mdx | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index 211730a7cc15..f8945e2259ad 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License.
 
 # Accelerated PyTorch 2.0 support in Diffusers
 
-Starting from version `0.13.0`, Diffusers supports the latest optimization from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) release. These include:
+Starting from version `0.13.0`, Diffusers supports the latest optimization from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). These include:
 1. Support for accelerated transformers implementation with memory-efficient attention – no extra dependencies (such as `xformers`) required.
 2. [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) support for extra performance boost when individual models are compiled.
 
@@ -60,7 +60,7 @@ pip install --upgrade torch torchvision diffusers
 
     This should be as fast and memory efficient as `xFormers`. More details [in our benchmark](#benchmark).
 
-    If you want to revert to use the vanilla attention processor ([`~AttnProcessor`]) which can help to make the pipeline more deterministic, you can use the [`~UNet2DModel.set_default_attn_processor`] function:
+    It is possible to revert to the vanilla attention processor ([`~AttnProcessor`]), which can be helpful to make the pipeline more deterministic, or if you need to convert a fine-tuned model to other formats such as [Core ML](https://huggingface.co/docs/diffusers/v0.16.0/en/optimization/coreml#how-to-run-stable-diffusion-with-core-ml). To use the normal attention processor you can use the [`~UNet2DModel.set_default_attn_processor`] function:
 
     ```Python
     import torch
@@ -91,7 +91,7 @@ pip install --upgrade torch torchvision diffusers
 
 ## Benchmark
 
-We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and different batch sizes for five of our most used pipelines. 
+We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. We used `diffusers 0.17.0.dev0`, which [makes sure `torch.compile()` is leveraged optimally](https://github.com/huggingface/diffusers/pull/3286).
 
 ### Benchmarking code 
 

From 9889051b19e8cc16ead1f53d9848b1e465d4089b Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Wed, 10 May 2023 16:15:24 +0530
Subject: [PATCH 07/10] 3313 pr link.

---
 docs/source/en/optimization/torch2.0.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index f8945e2259ad..385e1ae2a140 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -91,7 +91,7 @@ pip install --upgrade torch torchvision diffusers
 
 ## Benchmark
 
-We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. We used `diffusers 0.17.0.dev0`, which [makes sure `torch.compile()` is leveraged optimally](https://github.com/huggingface/diffusers/pull/3286).
+We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. We used `diffusers 0.17.0.dev0`, which [makes sure `torch.compile()` is leveraged optimally](https://github.com/huggingface/diffusers/pull/3313).
 
 ### Benchmarking code 
 

From 0f14d1a2637e21f58cdd2d817c72651e314387c8 Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Fri, 12 May 2023 10:10:42 +0530
Subject: [PATCH 08/10] add: plots.

Co-authored-by: Pedro <pedro@huggingface.co>
---
 docs/source/en/optimization/torch2.0.mdx | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index 385e1ae2a140..279f73eaca54 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -18,7 +18,8 @@ Starting from version `0.13.0`, Diffusers supports the latest optimization from
 
 
 ## Installation
-To benefit from the accelerated attention implementation and `torch.compile`, you just need to install the latest versions of PyTorch 2.0 from `pip`, and make sure you are on diffusers 0.13.0 or later. As explained below, `diffusers` automatically uses the optimized attention processor ([`~diffusers.models.attention_processor.AttnProcessor2_0`]) (but not `torch.compile`) when PyTorch 2.0 is available.
+To benefit from the accelerated attention implementation and `torch.compile`, you just need to install the latest versions of PyTorch 2.0 from `pip`, and make sure you are on diffusers 0.13.0 or later. As explained below, `diffusers` automatically uses the optimized attention processor ([`AttnProcessor2_0`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L798)) (but not `torch.compile`)
+when PyTorch 2.0 is available.
 
 ```bash
 pip install --upgrade torch torchvision diffusers
@@ -60,7 +61,7 @@ pip install --upgrade torch torchvision diffusers
 
     This should be as fast and memory efficient as `xFormers`. More details [in our benchmark](#benchmark).
 
-    It is possible to revert to the vanilla attention processor ([`~AttnProcessor`]), which can be helpful to make the pipeline more deterministic, or if you need to convert a fine-tuned model to other formats such as [Core ML](https://huggingface.co/docs/diffusers/v0.16.0/en/optimization/coreml#how-to-run-stable-diffusion-with-core-ml). To use the normal attention processor you can use the [`~UNet2DModel.set_default_attn_processor`] function:
+    It is possible to revert to the vanilla attention processor ([`AttnProcessor`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L402)), which can be helpful to make the pipeline more deterministic, or if you need to convert a fine-tuned model to other formats such as [Core ML](https://huggingface.co/docs/diffusers/v0.16.0/en/optimization/coreml#how-to-run-stable-diffusion-with-core-ml). To use the normal attention processor you can use the [`~diffusers.UNet2DConditionModel.set_default_attn_processor`] function:
 
     ```Python
     import torch
@@ -266,6 +267,22 @@ for _ in range(3):
     image_3 = pipe_3(prompt=prompt, image=image, noise_level=100).images
 ```
 
+To give you a pictorial overview of the possible speed-ups that can be obtained with PyTorch 2.0 and `torch.compile()`,
+here is a plot that shows relative speed-ups for the [Stable Diffusion text-to-image pipeline](StableDiffusionPipeline) across five
+different GPU families (with a batch size of 4):
+
+![t2i_speedup](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/t2i_speedup.png)
+
+To give you an even better idea of how this speed-up holds for the other pipelines presented above, consider the following 
+plot that shows the benchmarking numbers from an A100 across three different batch sizes
+(with PyTorch 2.0 nightly and `torch.compile()`):
+
+![a100_numbers](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/a100_numbers.png)
+
+_(Our benchmarking metric for the plots above is **number of iterations/second**)_
+
+But we reveal all the benchmarking numbers in the interest of transparency! 
+
 In the following tables, we report our findings in terms of the number of **_iterations processed per second_**. 
 
 ### A100 (batch size: 1)

From 07dc8feb1d54c93adb9a7443a48a6741e1dbf43f Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Fri, 12 May 2023 10:37:47 +0530
Subject: [PATCH 09/10] fix formattimg

---
 docs/source/en/optimization/torch2.0.mdx | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index 279f73eaca54..a78626317577 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -18,7 +18,7 @@ Starting from version `0.13.0`, Diffusers supports the latest optimization from
 
 
 ## Installation
-To benefit from the accelerated attention implementation and `torch.compile`, you just need to install the latest versions of PyTorch 2.0 from `pip`, and make sure you are on diffusers 0.13.0 or later. As explained below, `diffusers` automatically uses the optimized attention processor ([`AttnProcessor2_0`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L798)) (but not `torch.compile`)
+To benefit from the accelerated attention implementation and `torch.compile()`, you just need to install the latest versions of PyTorch 2.0 from pip, and make sure you are on diffusers 0.13.0 or later. As explained below, diffusers automatically uses the optimized attention processor ([`AttnProcessor2_0`](https://github.com/huggingface/diffusers/blob/1a5797c6d4491a879ea5285c4efc377664e0332d/src/diffusers/models/attention_processor.py#L798)) (but not `torch.compile()`)
 when PyTorch 2.0 is available.
 
 ```bash
@@ -440,4 +440,4 @@ In the following tables, we report our findings in terms of the number of **_ite
 * Follow [this PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. 
 * For the IF pipeline and batch sizes > 1, we only used a batch size of >1 in the first IF pipeline for text-to-image generation and NOT for upscaling. So, that means the two upscaling pipelines received a batch size of 1. 
 
-*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile` in Diffusers.*
\ No newline at end of file
+*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.*
\ No newline at end of file

From b6de6a645497fcc1972cfabfc1d0f27ecdc81756 Mon Sep 17 00:00:00 2001
From: Sayak Paul <spsayakpaul@gmail.com>
Date: Sat, 13 May 2023 15:03:51 +0530
Subject: [PATCH 10/10] update number percent.

---
 docs/source/en/optimization/torch2.0.mdx | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/optimization/torch2.0.mdx b/docs/source/en/optimization/torch2.0.mdx
index a78626317577..2bcf3fa82115 100644
--- a/docs/source/en/optimization/torch2.0.mdx
+++ b/docs/source/en/optimization/torch2.0.mdx
@@ -85,7 +85,7 @@ pip install --upgrade torch torchvision diffusers
     images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images
     ```
 
-    Depending on the type of GPU, `compile()` can yield between **3% - 56%** of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
+    Depending on the type of GPU, `compile()` can yield between **5% - 300%** of _additional speed-up_ over the accelerated transformer optimizations. Note, however, that compilation is able to squeeze more performance improvements in more recent GPU architectures such as Ampere (A100, 3090), Ada (4090) and Hopper (H100).
     
     Compilation takes some time to complete, so it is best suited for situations where you need to prepare your pipeline once and then perform the same type of inference operations multiple times. Calling the compiled pipeline on a different image size will re-trigger compilation which can be expensive.