Skip to content

[Pipeline] Add TextToVideoZeroPipeline #2954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

19and99
Copy link
Contributor

@19and99 19and99 commented Apr 3, 2023

This pull request adds TextToVideoZeroPipeline to diffusers library.

Materials

Sample code for inference

import torch
from diffusers import TextToVideoZeroPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

prompt = "A panda is playing guitar on times square"
result = pipe(prompt=prompt).images
imageio.mimsave("video.mp4", result, fps=4)

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 3, 2023

The documentation is not available anymore as the PR was closed or merged.

@19and99
Copy link
Contributor Author

19and99 commented Apr 4, 2023

@sayakpaul

@sayakpaul
Copy link
Member

Hey @19and99! Thanks for the PR. We will review it soon.

Could you please ensure "Run code quality checks / check_repository_consistency" tests pass?

For that, I suggest:

  • Head over to the diffusers directory locally (the one you forked).
  • Activate your Python virtual environment for developing diffusers.
  • Run make fix-copies.
  • And then push the changes.

@19and99
Copy link
Contributor Author

19and99 commented Apr 4, 2023

I also added two resource files in docs/source/en/api/pipelines/res folder, I guess these need to be moved to huggingface dataset @sayakpaul @patrickvonplaten

Comment on lines 17 to 22
[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) <br />
Levon Khachatryan,
Andranik Movsisyan,
Vahram Tadevosyan,
Roberto Henschel,
[Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might break our doc-builder.

Suggested change
[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) <br />
Levon Khachatryan,
Andranik Movsisyan,
Vahram Tadevosyan,
Roberto Henschel,
[Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com)
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators]((https://arxiv.org/abs/2303.13439)) by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't addressed @19and99

<br />
Results are temporally consistent and follow closely the guidance and textual prompts.

![img](./res/teaser_final.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We keep the repository lightweight.

So, please open a PR to https://huggingface.co/datasets/huggingface/documentation-images

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is https://huggingface.co/datasets/huggingface/documentation-images/discussions/71
What about test resources? I can see in some testes they download golden resources from https://huggingface.co/datasets/hf-internal-testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 123 to 125
reader = imageio.get_reader('path/to/your/video', 'ffmpeg')
frame_count = 8
video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


### Dreambooth specialization

Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control** can run with custom dreambooth models by simply set the `model_id` to corresponding model path or url.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand this a bit more with a code snippet? That will be useful for the users.

Suggested change
Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control** can run with custom dreambooth models by simply set the `model_id` to corresponding model path or url.
Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control** can run with custom [DreamBooth](../training/dreambooth) models by simply set the `model_id` to corresponding model path or URL. You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!


model_id = "runwayml/stable-diffusion-v1-5"
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool! Does this work? The CrossFrameAttnProcessor is enough to make it work? Nice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually added 2 more lines that were missing))

prompt = "A bear is playing a guitar on Times Square"
result = pipe(prompt=prompt, generator=generator).images

expected_result = torch.load("docs/source/en/api/pipelines/res/A bear is playing a guitar on Times Square.pt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected_result = torch.load("docs/source/en/api/pipelines/res/A bear is playing a guitar on Times Square.pt")
expected_result = torch.load("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/text-to-video/A%20bear%20is%20playing%20a%20guitar%20on%20Times%20Square.pt")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not upload tensors and images directly to the GitHub repo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

<br />
Results are temporally consistent and follow closely the guidance and textual prompts.

![img](./res/teaser_final.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
![img](./res/teaser_final.png)
![img](https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/text-to-video/teaser_final.png)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've uploaded this image to datasets/huggingface/documentation-images. You may remove it from hf-internal-testing

@patrickvonplaten
Copy link
Contributor

Besides some final things to change as @sayakpaul pointed out from my side we're good to merge for this model.
Please let's make sure to delete the res folder as we don't want to upload any heavy objects to the GitHub repo.

I've uploaded the data here for you: https://huggingface.co/datasets/hf-internal-testing/diffusers-images/tree/main/text-to-video

Let's make sure the tests pass and I think we're good to go :-)

@19and99 19and99 changed the title Add TextToVideoZeroPipeline [Pipeline] Add TextToVideoZeroPipeline Apr 6, 2023
[Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and
[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model

1. Download demo video from huggingface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Download demo video from huggingface
Download a demo video

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work here @19and99! Really well done and thanks so much for iterating so much.

I went ahead and changed a few nits. Hope that's okay.

@patrickvonplaten I think we should we ready to merge this one!

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @19and99

@patrickvonplaten patrickvonplaten merged commit ba49272 into huggingface:main Apr 10, 2023
@Skquark
Copy link

Skquark commented Apr 12, 2023

I got the TextToVideoZeroPipeline working, able to save the frames to video with imageio.mimsave, however I'm struggling to save the individual frames as png image files after exporting as mp4. The output_type="tensor" which was recommended default (didn't look like output_type np or pil was implemented), and the type shows as numpy.ndarray. I'm doing the standard 'for image in images:' and have tried saving the image with imageio.imwrite, .imsave, cv2.imwrite, Image.fromarray, pipe.numpy_to_pil, converting to uint8, and a bunch of other methods that just result in type errors or black images. Couldn't find any examples or posted issues that gave me any method. I previously struggled with the TextToVideoSDPipeine doing the same thing, but there it worked with cv2.imwrite method. It's probably simple answer, I'm just not getting it. Any help with saving those tensor frames? Thanks.

@sayakpaul
Copy link
Member

#3049 should make this more clear.

But if you do this (from the official documentation):

import torch
from diffusers import TextToVideoZeroPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")

prompt = "A panda is playing guitar on times square"
result = pipe(prompt=prompt).images
imageio.mimsave("video.mp4", result, fps=4)

then it should work.

Also, going forward please open a new issue as it's easier for us to keep track of them that way.

Cc: @19and99

@Skquark
Copy link

Skquark commented Apr 12, 2023

I got that part working to use imageio.mimsave to an mp4, that wasn't the problem. I'm trying to save those frame images as png files as well, that was the issue...

@sayakpaul
Copy link
Member

Ah, sorry. I got lost in the longer message.

If that that's case, you can do:

from PIL import Image

# The images are `np.ndarray`.
result = pipe(prompt=prompt).images

result = [Image.fromarray((image * 255).astype("uint8")) from image in result]
for i, image in enumerate(result):
    image.save(f"{i}.png")

Does this work?

@Skquark
Copy link

Skquark commented Apr 12, 2023

Nice, that worked, thanks. I tried something similar to that solution, but a little differently. Much appreciated.
Side note, when doing the imageio.mimsave, I get a series of these warnings:
WARNING:imageio:Lossy conversion from float32 to uint8. Range [0, 1]. Convert image to uint8 prior to saving to suppress this warning.
Still works, but is these a way around getting these warnings?

w4ffl35 pushed a commit to w4ffl35/diffusers that referenced this pull request Apr 14, 2023
* add TextToVideoZeroPipeline and CrossFrameAttnProcessor

* add docs for text-to-video zero

* add teaser image for text-to-video zero docs

* Fix review changes. Add Documentation. Add test

* clean up the codes in pipeline_text_to_video.py. Add descriptive comments and docstrings

* make style && make quality

* make fix-copies

* make requested changes to docs. use huggingface server links for resources, delete res folder

* make style && make quality && make fix-copies

* make style && make quality

* Apply suggestions from code review

---------

Co-authored-by: Sayak Paul <[email protected]>
dg845 pushed a commit to dg845/diffusers that referenced this pull request May 6, 2023
* add TextToVideoZeroPipeline and CrossFrameAttnProcessor

* add docs for text-to-video zero

* add teaser image for text-to-video zero docs

* Fix review changes. Add Documentation. Add test

* clean up the codes in pipeline_text_to_video.py. Add descriptive comments and docstrings

* make style && make quality

* make fix-copies

* make requested changes to docs. use huggingface server links for resources, delete res folder

* make style && make quality && make fix-copies

* make style && make quality

* Apply suggestions from code review

---------

Co-authored-by: Sayak Paul <[email protected]>
yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
* add TextToVideoZeroPipeline and CrossFrameAttnProcessor

* add docs for text-to-video zero

* add teaser image for text-to-video zero docs

* Fix review changes. Add Documentation. Add test

* clean up the codes in pipeline_text_to_video.py. Add descriptive comments and docstrings

* make style && make quality

* make fix-copies

* make requested changes to docs. use huggingface server links for resources, delete res folder

* make style && make quality && make fix-copies

* make style && make quality

* Apply suggestions from code review

---------

Co-authored-by: Sayak Paul <[email protected]>
AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
* add TextToVideoZeroPipeline and CrossFrameAttnProcessor

* add docs for text-to-video zero

* add teaser image for text-to-video zero docs

* Fix review changes. Add Documentation. Add test

* clean up the codes in pipeline_text_to_video.py. Add descriptive comments and docstrings

* make style && make quality

* make fix-copies

* make requested changes to docs. use huggingface server links for resources, delete res folder

* make style && make quality && make fix-copies

* make style && make quality

* Apply suggestions from code review

---------

Co-authored-by: Sayak Paul <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants