Skip to content

Commit ca1a222

Browse files
patrickvonplatensayakpaulpcuenca
authored
[MS Text To Video] Add first text to video (#2738)
* [MS Text To Video} Add first text to video * upload * make first model example * match unet3d params * make sure weights are correcctly converted * improve * forward pass works, but diff result * make forward work * fix more * finish * refactor video output class. * feat: add support for a video export utility. * fix: opencv availability check. * run make fix-copies. * add: docs for the model components. * add: standalone pipeline doc. * edit docstring of the pipeline. * add: right path to TransformerTempModel * add: first set of tests. * complete fast tests for text to video. * fix bug * up * three fast tests failing. * add: note on slow tests * make work with all schedulers * apply styling. * add slow tests * change file name * update * more correction * more fixes * finish * up * Apply suggestions from code review * up * finish * make copies * fix pipeline tests * fix more tests * Apply suggestions from code review Co-authored-by: Pedro Cuenca <[email protected]> * apply suggestions * up * revert --------- Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>
1 parent 7fe8861 commit ca1a222

40 files changed

+3236
-28
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -192,6 +192,8 @@
192192
title: Stable unCLIP
193193
- local: api/pipelines/stochastic_karras_ve
194194
title: Stochastic Karras VE
195+
- local: api/pipelines/text_to_video
196+
title: Text-to-Video
195197
- local: api/pipelines/unclip
196198
title: UnCLIP
197199
- local: api/pipelines/latent_diffusion_uncond

docs/source/en/api/models.mdx

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,12 @@ The models are built on the base class ['ModelMixin'] that is a `torch.nn.module
3737
## UNet2DConditionModel
3838
[[autodoc]] UNet2DConditionModel
3939

40+
## UNet3DConditionOutput
41+
[[autodoc]] models.unet_3d_condition.UNet3DConditionOutput
42+
43+
## UNet3DConditionModel
44+
[[autodoc]] UNet3DConditionModel
45+
4046
## DecoderOutput
4147
[[autodoc]] models.vae.DecoderOutput
4248

@@ -58,6 +64,12 @@ The models are built on the base class ['ModelMixin'] that is a `torch.nn.module
5864
## Transformer2DModelOutput
5965
[[autodoc]] models.transformer_2d.Transformer2DModelOutput
6066

67+
## TransformerTemporalModel
68+
[[autodoc]] models.transformer_temporal.TransformerTemporalModel
69+
70+
## Transformer2DModelOutput
71+
[[autodoc]] models.transformer_temporal.TransformerTemporalModelOutput
72+
6173
## PriorTransformer
6274
[[autodoc]] models.prior_transformer.PriorTransformer
6375

docs/source/en/api/pipelines/overview.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ available a colab notebook to directly try them out.
7777
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Text-to-Image Generation |
7878
| [stable_unclip](./stable_unclip) | **Stable unCLIP** | Image-to-Image Text-Guided Generation |
7979
| [stochastic_karras_ve](./stochastic_karras_ve) | [**Elucidating the Design Space of Diffusion-Based Generative Models**](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
80+
| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
8081
| [unclip](./unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) | Text-to-Image Generation |
8182
| [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
8283
| [versatile_diffusion](./versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Text-to-video synthesis
14+
15+
Text-to-video synthesis from [ModelScope](https://modelscope.cn/) can be considered the same as Stable Diffusion structure-wise but it is extended to videos instead of static images. More specifically, this system allows us to generate videos from a natural language text prompt.
16+
17+
From the [model summary](https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis):
18+
19+
*This model is based on a multi-stage text-to-video generation diffusion model, which inputs a description text and returns a video that matches the text description. Only English input is supported.*
20+
21+
Resources:
22+
23+
* [Website](https://modelscope.cn/models/damo/text-to-video-synthesis/summary)
24+
* [GitHub repository](https://github.com/modelscope/modelscope/)
25+
* [Spaces] (TODO)
26+
27+
## Available Pipelines:
28+
29+
| Pipeline | Tasks | Demo
30+
|---|---|:---:|
31+
| [DiffusionPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py) | *Text-to-Video Generation* | [Spaces] (TODO)
32+
33+
## Usage example
34+
35+
Let's start by generating a short video with the default length of 16 frames (2s at 8 fps):
36+
37+
```python
38+
import torch
39+
from diffusers import DiffusionPipeline
40+
from diffusers.utils import export_to_video
41+
42+
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
43+
pipe = pipe.to("cuda")
44+
45+
prompt = "Spiderman is surfing"
46+
video_frames = pipe(prompt).frames
47+
video_path = export_to_video(video_frames)
48+
video_path
49+
```
50+
51+
Diffusers supports different optimization techniques to improve the latency
52+
and memory footprint of a pipeline. Since videos are often more memory-heavy than images,
53+
we can enable CPU offloading and VAE slicing to keep the memory footprint at bay.
54+
55+
Let's generate a video of 8 seconds (64 frames) on the same GPU using CPU offloading and VAE slicing:
56+
57+
```python
58+
import torch
59+
from diffusers import DiffusionPipeline
60+
from diffusers.utils import export_to_video
61+
62+
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
63+
pipe.enable_model_cpu_offload()
64+
65+
# memory optimization
66+
pipe.enable_vae_slicing()
67+
68+
prompt = "Darth Vader surfing a wave"
69+
video_frames = pipe(prompt, num_frames=64).frames
70+
video_path = export_to_video(video_frames)
71+
video_path
72+
```
73+
74+
It just takes **7 GBs of GPU memory** to generate the 64 video frames using PyTorch 2.0, "fp16" precision and the techniques mentioned above.
75+
76+
We can also use a different scheduler easily, using the same method we'd use for Stable Diffusion:
77+
78+
```python
79+
import torch
80+
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
81+
from diffusers.utils import export_to_video
82+
83+
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
84+
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
85+
pipe.enable_model_cpu_offload()
86+
87+
prompt = "Spiderman is surfing"
88+
video_frames = pipe(prompt, num_inference_steps=25).frames
89+
video_path = export_to_video(video_frames)
90+
video_path
91+
```
92+
93+
Here are some sample outputs:
94+
95+
<table>
96+
<tr>
97+
<td><center>
98+
An astronaut riding a horse.
99+
<br>
100+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astr.gif"
101+
alt="An astronaut riding a horse."
102+
style="width: 300px;" />
103+
</center></td>
104+
<td ><center>
105+
Darth vader surfing in waves.
106+
<br>
107+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/vader.gif"
108+
alt="Darth vader surfing in waves."
109+
style="width: 300px;" />
110+
</center></td>
111+
</tr>
112+
</table>
113+
114+
## Available checkpoints
115+
116+
* [damo-vilab/text-to-video-ms-1.7b](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b/)
117+
* [damo-vilab/text-to-video-ms-1.7b-legacy](https://huggingface.co/damo-vilab/text-to-video-ms-1.7b-legacy)
118+
119+
## DiffusionPipeline
120+
[[autodoc]] DiffusionPipeline
121+
- all
122+
- __call__

docs/source/en/index.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,8 +84,9 @@ The library has three main components:
8484
| [stable_unclip](./stable_unclip) | Stable unCLIP | Text-to-Image Generation |
8585
| [stable_unclip](./stable_unclip) | Stable unCLIP | Image-to-Image Text-Guided Generation |
8686
| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation |
87+
| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation |
8788
| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)(implementation by [kakaobrain](https://github.com/kakaobrain/karlo)) | Text-to-Image Generation |
8889
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation |
8990
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation |
9091
| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation |
91-
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |
92+
| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation |

examples/community/stable_diffusion_controlnet_img2img.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -216,7 +216,7 @@ def enable_model_cpu_offload(self, gpu_id=0):
216216
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
217217
from accelerate import cpu_offload_with_hook
218218
else:
219-
raise ImportError("`enable_model_offload` requires `accelerate v0.17.0` or higher.")
219+
raise ImportError("`enable_model_cpu_offload` requires `accelerate v0.17.0` or higher.")
220220

221221
device = torch.device(f"cuda:{gpu_id}")
222222

examples/community/stable_diffusion_controlnet_inpaint.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,7 @@ def enable_model_cpu_offload(self, gpu_id=0):
314314
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
315315
from accelerate import cpu_offload_with_hook
316316
else:
317-
raise ImportError("`enable_model_offload` requires `accelerate v0.17.0` or higher.")
317+
raise ImportError("`enable_model_cpu_offload` requires `accelerate v0.17.0` or higher.")
318318

319319
device = torch.device(f"cuda:{gpu_id}")
320320

examples/community/stable_diffusion_controlnet_inpaint_img2img.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -314,7 +314,7 @@ def enable_model_cpu_offload(self, gpu_id=0):
314314
if is_accelerate_available() and is_accelerate_version(">=", "0.17.0.dev0"):
315315
from accelerate import cpu_offload_with_hook
316316
else:
317-
raise ImportError("`enable_model_offload` requires `accelerate v0.17.0` or higher.")
317+
raise ImportError("`enable_model_cpu_offload` requires `accelerate v0.17.0` or higher.")
318318

319319
device = torch.device(f"cuda:{gpu_id}")
320320

0 commit comments

Comments
 (0)