[MS Text To Video] Add first text to video #2738

patrickvonplaten · 2023-03-19T20:11:44Z

This PR adds the text-to-video model from model scope: https://modelscope.cn/models/damo/text-to-video-synthesis/summary

Also see: https://www.reddit.com/r/StableDiffusion/comments/11vbyei/first_open_source_text_to_video_17_billion/

The model consists of three componests:

Text encoder: The same as Stable Diffusion 2.1
UNet3D: Structure looks quite similar to SD's UNet
Latent Upscaler is the same as Stable Diffusion 2.1

Simple command to run the model

import torch
from diffusers import TextToVideoMSPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = TextToVideoMSPipeline.from_pretrained("diffusers/ms-text-to-video-sd", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

prompt = "Spiderman is surfing"
video_frames = pipe(prompt).frames
video_path = export_to_video(video_frames)
print(video_path)

To reproduce results compared to original model:

import cv2
import tempfile
from huggingface_hub import HfApi
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch
from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline

seed = 0
api = HfApi()

prompt = "An astronaut riding a horse"


def write_video(video):
    output_video_path = tempfile.NamedTemporaryFile(suffix='.mp4').name

    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    h, w, c = video[0].shape
    video_writer = cv2.VideoWriter(
        output_video_path, fourcc, fps=8, frameSize=(w, h))
    for i in range(len(video)):
        img = cv2.cvtColor(video[i], cv2.COLOR_RGB2BGR)
        video_writer.write(img)
    return output_video_path


pipe = pipeline('text-to-video-synthesis', "/home/patrick_huggingface_co/text_to_video_model_scope/weights")
torch.manual_seed(seed)
video_path = pipe({'text': prompt})[OutputKeys.OUTPUT_VIDEO]

api.upload_file(
    path_or_fileobj=video_path,
    path_in_repo="video_orig.mp4",
    repo_id="patrickvonplaten/videos",
    repo_type="dataset",
)
del pipe
print("https://huggingface.co/datasets/patrickvonplaten/videos/blob/main/video_orig.mp4")

torch.cuda.empty_cache()

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", variant="fp16", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

generator = torch.manual_seed(seed)
latents = torch.randn((1, 4, 16, 32, 32)).half()
video = pipe(prompt, latents=latents, num_inference_steps=25)[0]

video_path = write_video(video)
api.upload_file(
    path_or_fileobj=video_path,
    path_in_repo="video_diff.mp4",
    repo_id="patrickvonplaten/videos",
    repo_type="dataset",
)
print("https://huggingface.co/datasets/patrickvonplaten/videos/blob/main/video_diff.mp4")

HuggingFaceDocBuilderDev · 2023-03-19T20:17:35Z

The documentation is not available anymore as the PR was closed or merged.

sayakpaul · 2023-03-21T10:53:41Z

@patrickvonplaten I checked the implementation of AutoencoderKL in ModelScope and I verified that with what we have in diffusers.

The implementations are functionally same (-- check the Colab).

We should (99% likely) have to just convert the parameters.

sayakpaul · 2023-03-21T11:42:52Z

@patrickvonplaten here is a Colab Notebook that shows the minor changes needed to make it work with create_vae_diffusers_config() and convert_ldm_vae_checkpoint(). Might be useful.

https://colab.research.google.com/gist/sayakpaul/f1b55ebcd2c850fcdeda351f3a4599e8/scratchpad.ipynb

sayakpaul · 2023-03-21T16:55:20Z

@patrickvonplaten I uploaded it here: https://huggingface.co/diffusers/ms-text-to-video-1.7b/tree/main/vae

Here's the Colab Notebook I used: https://colab.research.google.com/gist/sayakpaul/930d6f582e4c5e381db1b392b479141b/scratchpad.ipynb

One weird thing is that after getting the model checkpoints converted from ModelScope to Diffusers I am seeing a reduction in the parameter size.

https://huggingface.co/damo-vilab/modelscope-damo-text-to-video-synthesis/blob/main/VQGAN_autoencoder.pth (original VAE checkpoints) is ~5 GB.

Ours (https://huggingface.co/diffusers/ms-text-to-video-1.7b/blob/main/vae/diffusion_pytorch_model.bin) is 335 MB.

patrickvonplaten · 2023-03-21T20:52:00Z

Pipeline works. Seems to work also with other schedulers and fp16. Can run with just 7GB of memory using Torch2.0:

#!/usr/bin/env python3
import cv2
import tempfile
from huggingface_hub import HfApi
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch


api = HfApi()


def write_video(video):
    output_video_path = tempfile.NamedTemporaryFile(suffix='.mp4').name

    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    h, w, c = video[0].shape
    video_writer = cv2.VideoWriter(
        output_video_path, fourcc, fps=8, frameSize=(w, h))
    for i in range(len(video)):
        img = cv2.cvtColor(video[i], cv2.COLOR_RGB2BGR)
        video_writer.write(img)
    return output_video_path


pipe = DiffusionPipeline.from_pretrained("diffusers/ms-text-to-video-sd", variant="fp16, torch_device=torch.float16)                                                                                                                   
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
video = pipe("Spiderman is surfing", num_inference_steps=25).image

video_path = write_video(video)

api.upload_file(
    path_or_fileobj=video_path,
    path_in_repo="video.mp4",
    repo_id="patrickvonplaten/videos",
    repo_type="dataset",
)
print("https://huggingface.co/datasets/patrickvonplaten/videos/blob/main/video.mp4")

…text_to_video

patrickvonplaten · 2023-03-21T22:33:08Z

Can generate up to 8 seconds on V100 thanks to vae slicing:

pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()
video = pipe("Darth Vader surfing a wave", num_frames=64, num_inference_steps=25).image

src/diffusers/__init__.py

src/diffusers/models/transformer_temp.py

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py

src/diffusers/models/attention.py

patrickvonplaten · 2023-03-22T15:36:37Z

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py

@@ -0,0 +1,667 @@
+# Copyright 2023 Alibaba DAMO-VILAB and The HuggingFace Team. All rights reserved.


Add alibaba citation

Probably not a big deal, but the pipeline code is arguably mostly HF :)

Yeah true, removed it there since there is no code really from alibiba except the tensor2vid which is tiny and whree we left a link and comment

patrickvonplaten · 2023-03-22T15:37:00Z

src/diffusers/models/unet_3d_condition.py

@@ -0,0 +1,492 @@
+# Copyright 2023 Alibaba DAMO-VILAB and The HuggingFace Team. All rights reserved.


Add alibaba copy right

patrickvonplaten · 2023-03-22T15:37:07Z

src/diffusers/models/resnet.py

@@ -1,3 +1,18 @@
+# Copyright 2023 Alibaba DAMO-VILAB and The HuggingFace Team. All rights reserved.


Add alibaba copy right

patrickvonplaten · 2023-03-22T15:37:15Z

src/diffusers/models/resnet.py

+class TemporalConvLayer(nn.Module):
+    """
+    Temporal convolutional layer that can be used for video (sequence of images) input Code mostly copied from:
+    https://github.com/modelscope/modelscope/blob/1509fdb973e5871f37148a4b5e5964cafd43e64d/modelscope/models/multi_modal/video_synthesis/unet_sd.py#L1016


Add comment that it's copied code

pcuenca

Amazing work! Just pointed out a few minor questions, but this is totally good to go imo.

docs/source/en/api/models.mdx

docs/source/en/api/pipelines/text_to_video.mdx

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py

src/diffusers/utils/testing_utils.py

pcuenca · 2023-03-22T16:34:05Z

tests/models/test_models_unet_3d_condition.py

+    @unittest.skipIf(
+        torch_device != "cuda" or not is_xformers_available(),
+        reason="XFormers attention is only available with CUDA and `xformers` installed",
+    )
+    def test_xformers_enable_works(self):


This is going to be always skipped in our current CI, I think (it uses PyTorch 2 unless I'm mistaken)

Yeah we should maybe clean this up soon

pcuenca · 2023-03-22T16:36:17Z

tests/pipelines/text_to_video/test_text_to_video.py

+
+        assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2
+
+    def test_stable_diffusion_pix2pix_negative_prompt(self):


Does pix2pix work? 😮

Yeah not sure - removed this one 😅

Should actually work. Just needed a renaming.

Can we pop it back in?

sayakpaul · 2023-03-22T16:41:32Z

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py

+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            return_dict (`bool`, *optional*, defaults to `True`):


output_type missing in the docstring.

Co-authored-by: Pedro Cuenca <[email protected]>

src/diffusers/models/attention.py

src/diffusers/models/transformer_temporal.py

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py

williamberman · 2023-03-22T17:19:14Z

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py

+        >>> pipe = TextToVideoSDPipeline.from_pretrained(
+        ...     "damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16"
+        ... )
+        >>> pipe.enable_model_cpu_offload()


We had previously talked about example snippets standardizing on enabling all optimizations, including UniPCMultistepScheduler, xformers, and 20 steps. Is that worth adding here?

Yeah good point. I'm somewhat assuming people use torch 2.0 now so no need anymore for exformers. UniPC doesn't work well with the model, but DPM works well. We should maybe add it there

williamberman · 2023-03-22T17:30:14Z

tests/test_pipelines_common.py

+def to_np(tensor):
+    if isinstance(tensor, torch.Tensor):
+        tensor = tensor.detach().cpu().numpy()
+
+    return tensor


What's the reason we need to convert to numpy arrays here?

Text to video returns PyTorch tensors

williamberman

lgtm!

patrickvonplaten · 2023-03-22T17:39:14Z

Tests passing locally, merging now

kabachuha · 2023-03-22T21:31:19Z

Awesome! Thank you for all the great work! Dreambooth is the next step, I guess

* [MS Text To Video} Add first text to video * upload * make first model example * match unet3d params * make sure weights are correcctly converted * improve * forward pass works, but diff result * make forward work * fix more * finish * refactor video output class. * feat: add support for a video export utility. * fix: opencv availability check. * run make fix-copies. * add: docs for the model components. * add: standalone pipeline doc. * edit docstring of the pipeline. * add: right path to TransformerTempModel * add: first set of tests. * complete fast tests for text to video. * fix bug * up * three fast tests failing. * add: note on slow tests * make work with all schedulers * apply styling. * add slow tests * change file name * update * more correction * more fixes * finish * up * Apply suggestions from code review * up * finish * make copies * fix pipeline tests * fix more tests * Apply suggestions from code review Co-authored-by: Pedro Cuenca <[email protected]> * apply suggestions * up * revert --------- Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>

[MS Text To Video} Add first text to video

d6912ac

upload

bf1c935

patrickvonplaten changed the title ~~[MS Text To Video} Add first text to video~~ [MS Text To Video] Add first text to video Mar 19, 2023

patrickvonplaten mentioned this pull request Mar 19, 2023

Any plans to add ModelScope's 1.7B text2video synthesis diffusion model? #2736

Closed

2 tasks

patrickvonplaten added 4 commits March 20, 2023 10:14

make first model example

8a29fe6

match unet3d params

5973584

make sure weights are correcctly converted

d91862d

improve

aeab5ad

forward pass works, but diff result

d9dd98c

patrickvonplaten added 3 commits March 21, 2023 19:05

make forward work

40c80e2

fix more

c4f0aeb

finish

faa4e6d

Merge branch 'main' of https://github.com/huggingface/diffusers into …

e9d4340

…text_to_video

sayakpaul reviewed Mar 22, 2023

View reviewed changes

src/diffusers/__init__.py Outdated Show resolved Hide resolved

sayakpaul reviewed Mar 22, 2023

View reviewed changes

src/diffusers/models/transformer_temp.py Outdated Show resolved Hide resolved

sayakpaul reviewed Mar 22, 2023

View reviewed changes

src/diffusers/models/transformer_temp.py Outdated Show resolved Hide resolved

sayakpaul reviewed Mar 22, 2023

View reviewed changes

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py Show resolved Hide resolved

sayakpaul reviewed Mar 22, 2023

View reviewed changes

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py Show resolved Hide resolved

sayakpaul added 4 commits March 22, 2023 09:59

refactor video output class.

e27769b

feat: add support for a video export utility.

d5e544f

fix: opencv availability check.

5945729

run make fix-copies.

5251c3a

sayakpaul reviewed Mar 22, 2023

View reviewed changes

src/diffusers/models/attention.py Show resolved Hide resolved

add: docs for the model components.

cf8ac80

patrickvonplaten added 2 commits March 22, 2023 15:21

finish

b7bebeb

make copies

fc832f9

patrickvonplaten commented Mar 22, 2023

View reviewed changes

patrickvonplaten added 2 commits March 22, 2023 16:16

fix pipeline tests

975b02d

fix more tests

5b6be9b

pcuenca approved these changes Mar 22, 2023

View reviewed changes

sayakpaul reviewed Mar 22, 2023

View reviewed changes

Apply suggestions from code review

9d7cd2d

Co-authored-by: Pedro Cuenca <[email protected]>

williamberman reviewed Mar 22, 2023

View reviewed changes

src/diffusers/models/attention.py Show resolved Hide resolved

williamberman reviewed Mar 22, 2023

View reviewed changes

src/diffusers/models/transformer_temporal.py Outdated Show resolved Hide resolved

patrickvonplaten added 2 commits March 22, 2023 17:55

apply suggestions

522f3ae

apply suggestions

9795ff9

williamberman reviewed Mar 22, 2023

View reviewed changes

src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_synth.py Show resolved Hide resolved

up

d4a11a3

williamberman reviewed Mar 22, 2023

View reviewed changes

revert

04ac574

williamberman reviewed Mar 22, 2023

View reviewed changes

williamberman approved these changes Mar 22, 2023

View reviewed changes

patrickvonplaten merged commit ca1a222 into main Mar 22, 2023

patrickvonplaten deleted the text_to_video branch March 22, 2023 17:39

kabachuha mentioned this pull request Mar 22, 2023

[Feature request]: Dreambooth support for text2video diffusion models #2784

Closed

2 tasks

sayakpaul mentioned this pull request Mar 23, 2023

[Pipeline] Port Tune-A-Video pipeline to diffusers #2455

Closed

8 tasks

		@@ -0,0 +1,667 @@
		# Copyright 2023 Alibaba DAMO-VILAB and The HuggingFace Team. All rights reserved.

		@@ -0,0 +1,492 @@
		# Copyright 2023 Alibaba DAMO-VILAB and The HuggingFace Team. All rights reserved.

		@@ -1,3 +1,18 @@
		# Copyright 2023 Alibaba DAMO-VILAB and The HuggingFace Team. All rights reserved.


		assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2

		def test_stable_diffusion_pix2pix_negative_prompt(self):

[MS Text To Video] Add first text to video #2738

[MS Text To Video] Add first text to video #2738

Uh oh!

Conversation

patrickvonplaten commented Mar 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Simple command to run the model

To reproduce results compared to original model:

Uh oh!

HuggingFaceDocBuilderDev commented Mar 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Mar 21, 2023

Uh oh!

sayakpaul commented Mar 21, 2023

Uh oh!

sayakpaul commented Mar 21, 2023

Uh oh!

patrickvonplaten commented Mar 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Mar 21, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

williamberman left a comment

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Mar 19, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 19, 2023 •

edited

Loading

patrickvonplaten commented Mar 21, 2023 •

edited

Loading