From c9da3dcba70f27869aa1e2c5f2011f2de7fced75 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Tue, 9 May 2023 10:29:32 -0700
Subject: [PATCH 1/5] distributed inference

---
 docs/source/en/_toctree.yml                   |  2 +
 .../en/training/distributed_inference.mdx     | 97 +++++++++++++++++++
 2 files changed, 99 insertions(+)
 create mode 100644 docs/source/en/training/distributed_inference.mdx

diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index f205046ffc90..0350feff4692 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -78,6 +78,8 @@
       title: InstructPix2Pix Training
     - local: training/custom_diffusion
       title: Custom Diffusion
+    - local: training/distributed_inference
+      title: Distributed inference with multiple GPUs
     title: Training
   - sections:
     - local: using-diffusers/rl
diff --git a/docs/source/en/training/distributed_inference.mdx b/docs/source/en/training/distributed_inference.mdx
new file mode 100644
index 000000000000..c0444e9d5c07
--- /dev/null
+++ b/docs/source/en/training/distributed_inference.mdx
@@ -0,0 +1,97 @@
+# Distributed inference with multiple GPUs
+
+On distributed setups, you can run inference across multiple GPUs with [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) or 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index), which is useful for generating multiple prompts in parallel.
+
+This guide will show you how to use PyTorch Distributed and 🤗 Accelerate for distributed inference.
+
+## PyTorch Distributed
+
+PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) which enables faster data parallelism.
+
+To start, create a Python file and import `torch.distributed` and `torch.multiprocessing` to set up the distributed process group and to spawn the processes for inference on each GPU. You should also initialize a [`DiffusionPipeline`]:
+
+```py
+#!/usr/bin/env python3
+import torch
+import torch.distributed as dist
+import torch.multiprocessing as mp
+
+from diffusers import DiffusionPipeline
+
+sd = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+```
+
+You'll want to create a function to run inference; [`init_process_group`](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) handles creating a distributed environment with the type of backend to use, the `rank` of the current process, and the `world_size` or the number of processes participating. If you're running inference in parallel over 2 GPUs, then the `world_size` would be 2.
+
+Move the [`DiffusionPipeline`] to `rank` and use `get_rank` to assign a GPU to each process, where each process handles a different prompt:
+
+```py
+def run_inference(rank, world_size):
+    dist.init_process_group("gloo", rank=rank, world_size=world_size)
+
+    sd.to(rank)
+
+    if torch.distributed.get_rank() == 0:
+        prompt = "a dog"
+    elif torch.distributed.get_rank() == 1:
+        prompt = "a cat"
+
+    image = sd(prompt).images[0]
+    image.save(f"./{'_'.join(prompt)}.png")
+```
+
+To run the distributed inference, call [`mp.spawn`](https://pytorch.org/docs/stable/multiprocessing.html#torch.multiprocessing.spawn) to run the `run_inference` function on the number of GPUs defined in `world_size`:
+
+```py
+def main():
+    world_size = 2
+    mp.spawn(run_inference, args=(world_size,), nprocs=world_size, join=True)
+
+
+if __name__ == "__main__":
+    main()
+```
+
+Once you've completed the inference script, run it like:
+
+```bash
+torchrun run_distributed.py
+```
+
+## 🤗 Accelerate
+
+🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) is a library designed to make it easy to train or run inference across distributed setups. It simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code.
+
+Start by initializing a [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `state.device` to assign a GPU to each process and use `process_index` to assign a GPU to each prompt:
+
+```py
+#!/usr/bin/env python3
+from accelerate import PartialState
+from diffusers import DiffusionPipeline
+
+sd = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+
+
+def main():
+    state = PartialState()
+
+    sd.to(state.device)
+
+    if state.process_index == 0:
+        prompt = "a dog"
+    elif state.process_index == 1:
+        prompt = "a cat"
+
+    image = sd(prompt).images[0]
+    image.save(f"./{'_'.join(prompt)}.png")
+
+
+if __name__ == "__main__":
+    main()
+```
+
+Call `accelerate launch` to run the distributed inference script:
+
+```bash
+accelerate launch run_distributed.py
+```
\ No newline at end of file

From 877bcaa00cfa66ecade5bf1a529239ea971945cd Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Tue, 9 May 2023 10:42:08 -0700
Subject: [PATCH 2/5] move to inference section

---
 docs/source/en/_toctree.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 0350feff4692..a86d99f2a451 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -42,6 +42,8 @@
       title: Text-guided image-inpainting
     - local: using-diffusers/depth2img
       title: Text-guided depth-to-image
+    - local: training/distributed_inference
+      title: Distributed inference with multiple GPUs
     - local: using-diffusers/reusing_seeds
       title: Improve image quality with deterministic generation
     - local: using-diffusers/reproducibility
@@ -78,8 +80,6 @@
       title: InstructPix2Pix Training
     - local: training/custom_diffusion
       title: Custom Diffusion
-    - local: training/distributed_inference
-      title: Distributed inference with multiple GPUs
     title: Training
   - sections:
     - local: using-diffusers/rl

From fb1adb91fa49af76f08fddc646ac8bd27344d071 Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Wed, 10 May 2023 13:44:37 -0700
Subject: [PATCH 3/5] apply feedback

---
 .../en/training/distributed_inference.mdx      | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/docs/source/en/training/distributed_inference.mdx b/docs/source/en/training/distributed_inference.mdx
index c0444e9d5c07..89c3fb479e33 100644
--- a/docs/source/en/training/distributed_inference.mdx
+++ b/docs/source/en/training/distributed_inference.mdx
@@ -1,6 +1,6 @@
 # Distributed inference with multiple GPUs
 
-On distributed setups, you can run inference across multiple GPUs with [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) or 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index), which is useful for generating multiple prompts in parallel.
+On distributed setups, you can run inference across multiple GPUs with [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) or 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index), which is useful for generating with multiple prompts in parallel.
 
 This guide will show you how to use PyTorch Distributed and 🤗 Accelerate for distributed inference.
 
@@ -11,7 +11,6 @@ PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/gen
 To start, create a Python file and import `torch.distributed` and `torch.multiprocessing` to set up the distributed process group and to spawn the processes for inference on each GPU. You should also initialize a [`DiffusionPipeline`]:
 
 ```py
-#!/usr/bin/env python3
 import torch
 import torch.distributed as dist
 import torch.multiprocessing as mp
@@ -21,13 +20,13 @@ from diffusers import DiffusionPipeline
 sd = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
 ```
 
-You'll want to create a function to run inference; [`init_process_group`](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) handles creating a distributed environment with the type of backend to use, the `rank` of the current process, and the `world_size` or the number of processes participating. If you're running inference in parallel over 2 GPUs, then the `world_size` would be 2.
+You'll want to create a function to run inference; [`init_process_group`](https://pytorch.org/docs/stable/distributed.html?highlight=init_process_group#torch.distributed.init_process_group) handles creating a distributed environment with the type of backend to use, the `rank` of the current process, and the `world_size` or the number of processes participating. If you're running inference in parallel over 2 GPUs, then the `world_size` is 2.
 
 Move the [`DiffusionPipeline`] to `rank` and use `get_rank` to assign a GPU to each process, where each process handles a different prompt:
 
 ```py
 def run_inference(rank, world_size):
-    dist.init_process_group("gloo", rank=rank, world_size=world_size)
+    dist.init_process_group("nccl", rank=rank, world_size=world_size)
 
     sd.to(rank)
 
@@ -52,20 +51,19 @@ if __name__ == "__main__":
     main()
 ```
 
-Once you've completed the inference script, run it like:
+Once you've completed the inference script, use the `--nproc_per_node` argument to specify the number of GPUs to use and call `torchrun` to run the script:
 
 ```bash
-torchrun run_distributed.py
+torchrun run_distributed.py --nproc_per_node=2
 ```
 
 ## 🤗 Accelerate
 
 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) is a library designed to make it easy to train or run inference across distributed setups. It simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code.
 
-Start by initializing a [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `state.device` to assign a GPU to each process and use `process_index` to assign a GPU to each prompt:
+Start by initializing an [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `state.device` to assign a GPU to each process and use `process_index` to assign a GPU to each prompt:
 
 ```py
-#!/usr/bin/env python3
 from accelerate import PartialState
 from diffusers import DiffusionPipeline
 
@@ -90,8 +88,8 @@ if __name__ == "__main__":
     main()
 ```
 
-Call `accelerate launch` to run the distributed inference script:
+Use the `--num_processes` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script:
 
 ```bash
-accelerate launch run_distributed.py
+accelerate launch run_distributed.py --num_processes=2
 ```
\ No newline at end of file

From d8da73212933a852df42e44a32fb84dae4035dca Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Wed, 17 May 2023 12:16:37 -0700
Subject: [PATCH 4/5] update with split_between_processes

---
 .../en/training/distributed_inference.mdx     | 32 ++++++++-----------
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/docs/source/en/training/distributed_inference.mdx b/docs/source/en/training/distributed_inference.mdx
index 89c3fb479e33..b5c6fb18c955 100644
--- a/docs/source/en/training/distributed_inference.mdx
+++ b/docs/source/en/training/distributed_inference.mdx
@@ -61,31 +61,27 @@ torchrun run_distributed.py --nproc_per_node=2
 
 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) is a library designed to make it easy to train or run inference across distributed setups. It simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code.
 
-Start by initializing an [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `state.device` to assign a GPU to each process and use `process_index` to assign a GPU to each prompt:
-
-```py
-from accelerate import PartialState
-from diffusers import DiffusionPipeline
-
-sd = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+To begin, create a Python file and initialize an [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `distributed_state.device` to assign a GPU to each process.
 
+<Tip>
 
-def main():
-    state = PartialState()
+To learn more about how to perform distributed inference, for example, if you have an odd number of GPUs or want to pad the prompts to the same length, take a look at the [Distributed Inference with 🤗 Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
 
-    sd.to(state.device)
+</Tip>
 
-    if state.process_index == 0:
-        prompt = "a dog"
-    elif state.process_index == 1:
-        prompt = "a cat"
+Now use the [`~accelerate.PartialState.split_between_processes`] utility as a context manager to automatically distribute the prompts between the number of processes.
 
-    image = sd(prompt).images[0]
-    image.save(f"./{'_'.join(prompt)}.png")
+```py
+from accelerate import PartialState
+from diffusers import DiffusionPipeline
 
+pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+distributed_state = PartialState()
+pipeline.to(distributed_state.device)
 
-if __name__ == "__main__":
-    main()
+with distributed_state.split_across_processes(["a dog", "a cat"]) as prompt:
+    result = pipe(prompt).images[0]
+    result.save(f"result_{distributed_state.rank}.png")
 ```
 
 Use the `--num_processes` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script:

From 01bcbc43c5defdd749fb81cbd48e3a74e5d1afaa Mon Sep 17 00:00:00 2001
From: Steven Liu <steven.liu@huggingface.co>
Date: Thu, 18 May 2023 09:44:29 -0700
Subject: [PATCH 5/5] apply feedback

---
 .../en/training/distributed_inference.mdx     | 72 +++++++++----------
 1 file changed, 36 insertions(+), 36 deletions(-)

diff --git a/docs/source/en/training/distributed_inference.mdx b/docs/source/en/training/distributed_inference.mdx
index b5c6fb18c955..e85b3f11e238 100644
--- a/docs/source/en/training/distributed_inference.mdx
+++ b/docs/source/en/training/distributed_inference.mdx
@@ -1,12 +1,45 @@
 # Distributed inference with multiple GPUs
 
-On distributed setups, you can run inference across multiple GPUs with [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) or 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index), which is useful for generating with multiple prompts in parallel.
+On distributed setups, you can run inference across multiple GPUs with 🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) or [PyTorch Distributed](https://pytorch.org/tutorials/beginner/dist_overview.html), which is useful for generating with multiple prompts in parallel.
 
-This guide will show you how to use PyTorch Distributed and 🤗 Accelerate for distributed inference.
+This guide will show you how to use 🤗 Accelerate and PyTorch Distributed for distributed inference.
+
+## 🤗 Accelerate
+
+🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) is a library designed to make it easy to train or run inference across distributed setups. It simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code.
+
+To begin, create a Python file and initialize an [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `distributed_state.device` to assign a GPU to each process.
+
+Now use the [`~accelerate.PartialState.split_between_processes`] utility as a context manager to automatically distribute the prompts between the number of processes.
+
+```py
+from accelerate import PartialState
+from diffusers import DiffusionPipeline
+
+pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
+distributed_state = PartialState()
+pipeline.to(distributed_state.device)
+
+with distributed_state.split_between_processes(["a dog", "a cat"]) as prompt:
+    result = pipeline(prompt).images[0]
+    result.save(f"result_{distributed_state.process_index}.png")
+```
+
+Use the `--num_processes` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script:
+
+```bash
+accelerate launch run_distributed.py --num_processes=2
+```
+
+<Tip>
+
+To learn more, take a look at the [Distributed Inference with 🤗 Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
+
+</Tip>
 
 ## PyTorch Distributed
 
-PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) which enables faster data parallelism.
+PyTorch supports [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) which enables data parallelism.
 
 To start, create a Python file and import `torch.distributed` and `torch.multiprocessing` to set up the distributed process group and to spawn the processes for inference on each GPU. You should also initialize a [`DiffusionPipeline`]:
 
@@ -55,37 +88,4 @@ Once you've completed the inference script, use the `--nproc_per_node` argument
 
 ```bash
 torchrun run_distributed.py --nproc_per_node=2
-```
-
-## 🤗 Accelerate
-
-🤗 [Accelerate](https://huggingface.co/docs/accelerate/index) is a library designed to make it easy to train or run inference across distributed setups. It simplifies the process of setting up the distributed environment, allowing you to focus on your PyTorch code.
-
-To begin, create a Python file and initialize an [`accelerate.PartialState`] to create a distributed environment; your setup is automatically detected so you don't need to explicitly define the `rank` or `world_size`. Move the [`DiffusionPipeline`] to `distributed_state.device` to assign a GPU to each process.
-
-<Tip>
-
-To learn more about how to perform distributed inference, for example, if you have an odd number of GPUs or want to pad the prompts to the same length, take a look at the [Distributed Inference with 🤗 Accelerate](https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate) guide.
-
-</Tip>
-
-Now use the [`~accelerate.PartialState.split_between_processes`] utility as a context manager to automatically distribute the prompts between the number of processes.
-
-```py
-from accelerate import PartialState
-from diffusers import DiffusionPipeline
-
-pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
-distributed_state = PartialState()
-pipeline.to(distributed_state.device)
-
-with distributed_state.split_across_processes(["a dog", "a cat"]) as prompt:
-    result = pipe(prompt).images[0]
-    result.save(f"result_{distributed_state.rank}.png")
-```
-
-Use the `--num_processes` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script:
-
-```bash
-accelerate launch run_distributed.py --num_processes=2
 ```
\ No newline at end of file