-
Notifications
You must be signed in to change notification settings - Fork 6.6k
[Pipeline] animatediff + vid2vid + controlnet #9337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pipeline] animatediff + vid2vid + controlnet #9337
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
a-r-r-o-w
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much, this is looking really good! I would like to and appreciate seeing a few experimental results though, just to verify functionality:
- An example making use of
enforce_inference_steps=True, possibly latent upscale from the other reference PRs or anything unique of your choice - An example making use of the FreeNoise prompt travel feature
- An example with IPAdapter usage alongside ControlNet
By looking at the diffs, all changes look great to me 💯
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
Show resolved
Hide resolved
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
Show resolved
Hide resolved
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
Show resolved
Hide resolved
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
Show resolved
Hide resolved
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
Outdated
Show resolved
Hide resolved
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
Outdated
Show resolved
Hide resolved
|
Prompt travel strength = 0.8
pipe.set_ip_adapter_scale(0.0)
context_length = 16
context_stride = 4
pipe.enable_free_noise(context_length=context_length, context_stride=context_stride)
# Can be a single prompt, or a dictionary with frame timesteps
prompt = {
0: "an austonaut on a winter day, sparkly leaves in the background, snow flakes, close up",
6: "an austonaut on a autumn day, yellow leaves in the background, close up",
12: "an austonaut on a rainy day, tropical leaves in the background, close up",
}
negative_prompt = "bad quality, worst quality"
with torch.inference_mode():
video = pipe(
video=video,
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=10,
guidance_scale=2.0,
controlnet_conditioning_scale=0.75,
conditioning_frames=conditioning_frames,
strength=strength,
generator=torch.Generator().manual_seed(42),
ip_adapter_image=ip_adapter_image,
).frames[0] |
|
Prompt travel on a tik-tok dance video strength = 0.8
pipe.set_ip_adapter_scale(0.0)
context_length = 16
context_stride = 4
pipe.enable_free_noise(context_length=context_length, context_stride=context_stride)
# Can be a single prompt, or a dictionary with frame timesteps
prompt = {
0: "a girl on a winter day, sparkly leaves in the background, snow flakes, close up",
10: "a girl on a autumn day, yellow leaves in the background, close up",
20: "a girl on a rainy day, tropical leaves in the background, close up",
}
negative_prompt = "bad quality, worst quality"
with torch.inference_mode():
video = pipe(
video=video,
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=10,
guidance_scale=2.0,
controlnet_conditioning_scale=0.75,
conditioning_frames=conditioning_frames,
strength=strength,
generator=torch.Generator().manual_seed(42),
ip_adapter_image=ip_adapter_image,
).frames[0] |
|
Latent upscaling also works strength = 0.8
pipe.set_ip_adapter_scale(0.0)
context_length = 16
context_stride = 4
pipe.enable_free_noise(context_length=context_length, context_stride=context_stride)
# Can be a single prompt, or a dictionary with frame timesteps
prompt = {
0: "a girl on a winter day, sparkly leaves in the background, snow flakes, close up",
10: "a girl on a autumn day, yellow leaves in the background, close up",
20: "a girl on a rainy day, tropical leaves in the background, close up",
}
negative_prompt = "bad quality, worst quality"
with torch.inference_mode():
latents = pipe(
video=video,
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=10,
guidance_scale=2.0,
controlnet_conditioning_scale=0.75,
conditioning_frames=conditioning_frames,
strength=strength,
generator=torch.Generator().manual_seed(42),
ip_adapter_image=ip_adapter_image,
output_type="latent",
).frames
import torch.nn.functional as F
# Run latent upscaling
# Note that only naive upscaling is done here. Alternatively, a latent upscaler
# model could be used
batch_size, num_channels, num_frames, latent_height, latent_width = latents.shape
height = 512
width = 512
scale_factor = 1
scale_method = "nearest-exact"
upscaled_height = int(height * scale_factor)
upscaled_width = int(width * scale_factor)
upscaled_latent_height = int(latent_height * scale_factor)
upscaled_latent_width = int(latent_width * scale_factor)
strength = 0.5
upscaled_latents = []
for i in range(batch_size):
latent = F.interpolate(latents[i], size=(upscaled_latent_height, upscaled_latent_width), mode="nearest-exact")
upscaled_latents.append(latent.unsqueeze(0))
upscaled_latents = torch.cat(upscaled_latents, dim=0)
# Run pipeline for denoising upscaled latents
with torch.inference_mode():
result = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=10,
guidance_scale=2.0,
controlnet_conditioning_scale=0.75,
conditioning_frames=conditioning_frames,
strength=strength,
generator=torch.Generator().manual_seed(42),
ip_adapter_image=ip_adapter_image,
output_type="pil",
latents=upscaled_latents,
enforce_inference_steps=True,
).frames[0]
result = [frame.resize(conditioning_frames[0].size) for frame in result]
export_to_gif(result, "latent_upscaled.gif", fps=8) |
a-r-r-o-w
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. cc @DN6 if you're free to give this a look
src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
Outdated
Show resolved
Hide resolved
…edHash/diffusers into redhash/animatediff-vid2vid-controlnet
|
I've updated the links to the input video and added an example output to the docs 🥳🥳 |
|
Could you run |
|
Sure, I used to run it before every commit, but forgot last time. Done ✅ |
|
A couple of failing tests need to be addressed here before merge:
|
|
It wasn't easy to understand what value to use for the |
|
Hey, this is looking good. The failing test is due to the tests being run on different machine types. To get the correct numbers, we'd have to get them from the specific CPU runners we use. I can get them to you some time tomorrow or over the weekend since caught up with other things at the moment |
Thank you <3, I appreciate it! |
|
Offtop: Isn't it weird that CPU tests depend on the CPU type? |
|
@a-r-r-o-w |
|
hey, sorry for the delay! here you go: |
|
Thanks a lot, Aryan! |
a-r-r-o-w
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool PR @reallyigor! Everything was correct for the most part from the very beginning, which is quite rare to come across so really great work :)
Thank you for your kind words 😍 , I really appreciate it |
|
What are the next steps for merging? |
|
Nothing much, it looks great! I'm just waiting to check if @DN6 would like to give this a review too by Monday since it was his ask initially. Happy to merge by EOD tomorrow even if he isn't able to take a look because seems like results are as expected :) |
|
🥳 |
|
@DN6 |
|
Glad to have you as a first-time-contributor! 🥳 |
|
@a-r-r-o-w @reallyigor Thank you for the great work! from diffusers import AnimateDiffVideoToVideoControlNetPipeline |
This pipeline was not shipped with 0.30.3. The only changes in that release were CogVideoX vid2vid and img2vid. For now, you will have to install diffusers from the |
|
@a-r-r-o-w, Thank you for the help! I wonder if the AnimateDiffVideoToVideoControlNetPipeline can support the tile controlnet from [,](https://huggingface.co/lllyasviel/control_v11f1e_sd15_tile), similar to how it supports the OpenPose ControlNet in the example: controlnet = ControlNetModel.from_pretrained( If it currently doesn't support this, is there is a way to modify the pipeline code to add support for it? |
|
Our ControlNetModel supports all the available controlnets I think. cc @asomoza in case I'm wrong. I believe it should work smoothly but if you're facing any errors with loading/inference, feel free to open a new issue and we can try and help there (so that others from the diffusers team can take a look too) |
|
Thank you for the clarification! The tile ControlNet loads smoothly into the pipeline, but it seems that the tile ControlNet isn’t having the expected effect in the output. Below are the results for comparison: For context, the second and third test uses the input video: The first test uses the first frame of the input video, and both tests use the same prompt: “best quality, astronaut in space, dancing.” The output from the AnimateDiffVideoToVideoControlNetPipeline with Tile ControlNet appears noticeably more blurry when using the tile ControlNet and AnimateDiffVideoToVideoControlNetPipeline with OpenPose ControlNet. Could you provide any guidance on how to resolve this issue? |
|
yes, we support all available controlnets for SD 1.5. Personally I haven't tested the SD 1.5 Tile one, not even with a single image. To understand better, the comparison you're doing is between the single image and with animatediff right? I've never seen someone using the Tile for animations, maybe you can test it with something like ComfyUI to see if we have something wrong or it's just that the Tile + AnimateDiff combination doesn't work. |
|
The AnimateDiff motion adapters are known to have sort of a blurring effect and poorer quality when it comes to following image/video condition. I can do some tests soon when I find time to help you more on this. Just curious, do you notice this behaviour when using AnimateLCM with tile controlnet? If not, it might be because the original motion adapters are not the best when it comes to high resolution animation quality (even with a controlnet). Typically, you have to involve more tricks like latent upscaling, unsampling, adetailer, etc. for good results. |
|
Hey @HanLiii , |
|
@reallyigor @a-r-r-o-w @asomoza, thanks for the reply! In my earlier comment, I included outputs from the AnimateDiffVideoToVideoControlNetPipeline using the OpenPose ControlNet. The results were vivid and detailed, which was encouraging. However, when I use the same pipeline with the Tile ControlNet, the outputs appear blurry and lack the vivid details present in the OpenPose results. To troubleshoot, I tried: Despite these changes, the outputs remain blurry. Here’s an example using Tile ControlNet with guidance_scale=3.0 and DDIMScheduler: |
* add animatediff + vid2vide + controlnet * post tests fixes * PR discussion fixes * update docs * change input video to links on HF + update an example * make quality fix * fix ip adapter test * fix ip adapter test input * update ip adapter test












What does this PR do?
This PR adds ControlNet support to the Video To Video AnimateDiff.
Fixes:
See #9326
[Pipeline] AnimateDiff + VideoToVideo + ControlNet #9326Results:
Default pipeline:
An example with IPAdapter usage alongside ControlNet
Prompt travel on a tik-tok dance video
How to test:
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
I created new tests by adapting ones from the regular AnimateDiff Video-to-Video pipeline.
My code fails only on one test:
test_from_pipe_consistent_forward_pass_cpu_offload========================================= 1 failed, 37 passed, 2 skipped, 37 warnings in 42.66s ==========================================The problem is that the original (that is already in Diffusers) AnimateDiff Video To Video also fails on this test. I couldn't identify the problem. These are the results of the original AnimateDiff Video To Video:
========================================= 1 failed, 37 passed, 2 skipped, 37 warnings in 40.86s ==========================================Who can review?
@DN6 @a-r-r-o-w