-
Notifications
You must be signed in to change notification settings - Fork 6.1k
[Pipeline] Add TextToVideoZeroPipeline #2954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pipeline] Add TextToVideoZeroPipeline #2954
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Hey @19and99! Thanks for the PR. We will review it soon. Could you please ensure "Run code quality checks / check_repository_consistency" tests pass? For that, I suggest:
|
src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/text_to_video_synthesis/pipeline_text_to_video_zero.py
Outdated
Show resolved
Hide resolved
…nd99/diffusers into add-text2video-zero-pipeline
…nd99/diffusers into add-text2video-zero-pipeline
I also added two resource files in |
[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) <br /> | ||
Levon Khachatryan, | ||
Andranik Movsisyan, | ||
Vahram Tadevosyan, | ||
Roberto Henschel, | ||
[Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might break our doc-builder.
[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://arxiv.org/abs/2303.13439) <br /> | |
Levon Khachatryan, | |
Andranik Movsisyan, | |
Vahram Tadevosyan, | |
Roberto Henschel, | |
[Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com) | |
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators]((https://arxiv.org/abs/2303.13439)) by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't addressed @19and99
<br /> | ||
Results are temporally consistent and follow closely the guidance and textual prompts. | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We keep the repository lightweight.
So, please open a PR to https://huggingface.co/datasets/huggingface/documentation-images
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is https://huggingface.co/datasets/huggingface/documentation-images/discussions/71
What about test resources? I can see in some testes they download golden resources from https://huggingface.co/datasets/hf-internal-testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @patrickvonplaten already added them.
See here: https://huggingface.co/datasets/hf-internal-testing/diffusers-images/tree/main/text-to-video
reader = imageio.get_reader('path/to/your/video', 'ffmpeg') | ||
frame_count = 8 | ||
video = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as what I mentioned in https://github.com/huggingface/diffusers/pull/2954/files#r1158020490.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
|
||
### Dreambooth specialization | ||
|
||
Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control** can run with custom dreambooth models by simply set the `model_id` to corresponding model path or url. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand this a bit more with a code snippet? That will be useful for the users.
Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control** can run with custom dreambooth models by simply set the `model_id` to corresponding model path or url. | |
Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control** can run with custom [DreamBooth](../training/dreambooth) models by simply set the `model_id` to corresponding model path or URL. You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
…nd99/diffusers into add-text2video-zero-pipeline
|
||
model_id = "runwayml/stable-diffusion-v1-5" | ||
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16) | ||
pipe = StableDiffusionControlNetPipeline.from_pretrained( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool! Does this work? The CrossFrameAttnProcessor
is enough to make it work? Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually added 2 more lines that were missing))
prompt = "A bear is playing a guitar on Times Square" | ||
result = pipe(prompt=prompt, generator=generator).images | ||
|
||
expected_result = torch.load("docs/source/en/api/pipelines/res/A bear is playing a guitar on Times Square.pt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected_result = torch.load("docs/source/en/api/pipelines/res/A bear is playing a guitar on Times Square.pt") | |
expected_result = torch.load("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/text-to-video/A%20bear%20is%20playing%20a%20guitar%20on%20Times%20Square.pt") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not upload tensors and images directly to the GitHub repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
<br /> | ||
Results are temporally consistent and follow closely the guidance and textual prompts. | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
 | |
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've uploaded this image to datasets/huggingface/documentation-images
. You may remove it from hf-internal-testing
Besides some final things to change as @sayakpaul pointed out from my side we're good to merge for this model. I've uploaded the data here for you: https://huggingface.co/datasets/hf-internal-testing/diffusers-images/tree/main/text-to-video Let's make sure the tests pass and I think we're good to go :-) |
[Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and | ||
[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model | ||
|
||
1. Download demo video from huggingface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Download demo video from huggingface | |
Download a demo video |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work here @19and99! Really well done and thanks so much for iterating so much.
I went ahead and changed a few nits. Hope that's okay.
@patrickvonplaten I think we should we ready to merge this one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @19and99
I got the TextToVideoZeroPipeline working, able to save the frames to video with imageio.mimsave, however I'm struggling to save the individual frames as png image files after exporting as mp4. The output_type="tensor" which was recommended default (didn't look like output_type np or pil was implemented), and the type shows as numpy.ndarray. I'm doing the standard 'for image in images:' and have tried saving the image with imageio.imwrite, .imsave, cv2.imwrite, Image.fromarray, pipe.numpy_to_pil, converting to uint8, and a bunch of other methods that just result in type errors or black images. Couldn't find any examples or posted issues that gave me any method. I previously struggled with the TextToVideoSDPipeine doing the same thing, but there it worked with cv2.imwrite method. It's probably simple answer, I'm just not getting it. Any help with saving those tensor frames? Thanks. |
#3049 should make this more clear. But if you do this (from the official documentation): import torch
from diffusers import TextToVideoZeroPipeline
model_id = "runwayml/stable-diffusion-v1-5"
pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
prompt = "A panda is playing guitar on times square"
result = pipe(prompt=prompt).images
imageio.mimsave("video.mp4", result, fps=4) then it should work. Also, going forward please open a new issue as it's easier for us to keep track of them that way. Cc: @19and99 |
I got that part working to use imageio.mimsave to an mp4, that wasn't the problem. I'm trying to save those frame images as png files as well, that was the issue... |
Ah, sorry. I got lost in the longer message. If that that's case, you can do: from PIL import Image
# The images are `np.ndarray`.
result = pipe(prompt=prompt).images
result = [Image.fromarray((image * 255).astype("uint8")) from image in result]
for i, image in enumerate(result):
image.save(f"{i}.png") Does this work? |
Nice, that worked, thanks. I tried something similar to that solution, but a little differently. Much appreciated. |
* add TextToVideoZeroPipeline and CrossFrameAttnProcessor * add docs for text-to-video zero * add teaser image for text-to-video zero docs * Fix review changes. Add Documentation. Add test * clean up the codes in pipeline_text_to_video.py. Add descriptive comments and docstrings * make style && make quality * make fix-copies * make requested changes to docs. use huggingface server links for resources, delete res folder * make style && make quality && make fix-copies * make style && make quality * Apply suggestions from code review --------- Co-authored-by: Sayak Paul <[email protected]>
* add TextToVideoZeroPipeline and CrossFrameAttnProcessor * add docs for text-to-video zero * add teaser image for text-to-video zero docs * Fix review changes. Add Documentation. Add test * clean up the codes in pipeline_text_to_video.py. Add descriptive comments and docstrings * make style && make quality * make fix-copies * make requested changes to docs. use huggingface server links for resources, delete res folder * make style && make quality && make fix-copies * make style && make quality * Apply suggestions from code review --------- Co-authored-by: Sayak Paul <[email protected]>
* add TextToVideoZeroPipeline and CrossFrameAttnProcessor * add docs for text-to-video zero * add teaser image for text-to-video zero docs * Fix review changes. Add Documentation. Add test * clean up the codes in pipeline_text_to_video.py. Add descriptive comments and docstrings * make style && make quality * make fix-copies * make requested changes to docs. use huggingface server links for resources, delete res folder * make style && make quality && make fix-copies * make style && make quality * Apply suggestions from code review --------- Co-authored-by: Sayak Paul <[email protected]>
* add TextToVideoZeroPipeline and CrossFrameAttnProcessor * add docs for text-to-video zero * add teaser image for text-to-video zero docs * Fix review changes. Add Documentation. Add test * clean up the codes in pipeline_text_to_video.py. Add descriptive comments and docstrings * make style && make quality * make fix-copies * make requested changes to docs. use huggingface server links for resources, delete res folder * make style && make quality && make fix-copies * make style && make quality * Apply suggestions from code review --------- Co-authored-by: Sayak Paul <[email protected]>
This pull request adds
TextToVideoZeroPipeline
to diffusers library.Materials
Sample code for inference