From 669d76f7bb085fac46a1de08af43668d789a6e63 Mon Sep 17 00:00:00 2001 From: Aryan Date: Wed, 11 Sep 2024 14:49:30 +0200 Subject: [PATCH 1/6] update docs --- docs/source/en/api/pipelines/animatediff.md | 73 +++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md index 7cacad87d78c..ada3abf35cba 100644 --- a/docs/source/en/api/pipelines/animatediff.md +++ b/docs/source/en/api/pipelines/animatediff.md @@ -914,6 +914,79 @@ export_to_gif(frames, "animatelcm-motion-lora.gif") +## Using FreeNoise + +[FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling](https://arxiv.org/abs/2310.15169) by Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu. + +FreeNoise is a sampling mechanism that allows the generation of longer videos with short-video generation models by employing noise-rescheduling, temporal attention over sliding windows, and weighted averaging of latent frames. It also can be used with multiple prompts to allow for interpolated video generations. More details are available in the paper. + +The currently supported AnimateDiff pipelines that can be used with FreeNoise are: +- AnimateDiffPipeline +- AnimateDiffControlNetPipeline +- AnimateDiffVideoToVideoPipeline +- AnimateDiffVideoToVideoControlNetPipeline + +```python +import torch +from diffusers import AutoencoderKL, AnimateDiffPipeline, LCMScheduler, MotionAdapter +from diffusers.utils import export_to_video, load_image + +# Load pipeline +dtype = torch.float16 +motion_adapter = MotionAdapter.from_pretrained("wangfuyun/AnimateLCM", torch_dtype=dtype) +vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=dtype) + +pipe = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=motion_adapter, vae=vae, torch_dtype=dtype) +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config, beta_schedule="linear") + +pipe.load_lora_weights( + "wangfuyun/AnimateLCM", weight_name="AnimateLCM_sd15_t2v_lora.safetensors", adapter_name="lcm_lora" +) +pipe.set_adapters(["lcm_lora"], [0.8]) + +# Enable FreeNoise for long prompt generation +pipe.enable_free_noise(context_length=16, context_stride=4) +pipe.to("cuda") + +# Can be a single prompt, or a dictionary with frame timesteps +prompt = { + 0: "A caterpillar on a leaf, high quality, photorealistic", + 40: "A caterpillar transforming into a cocoon, on a leaf, near flowers, photorealistic", + 80: "A cocoon on a leaf, flowers in the backgrond, photorealistic", + 120: "A cocoon maturing and a butterfly being born, flowers and leaves visible in the background, photorealistic", + 160: "A beautiful butterfly, vibrant colors, sitting on a leaf, flowers in the background, photorealistic", + 200: "A beautiful butterfly, flying away in a forest, photorealistic", + 240: "A cyberpunk butterfly, neon lights, glowing", +} +negative_prompt = "bad quality, worst quality, jpeg artifacts" + +# Run inference +output = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + num_frames=256, + guidance_scale=2.5, + num_inference_steps=10, + generator=torch.Generator("cpu").manual_seed(0), +) + +# Save video +frames = output.frames[0] +export_to_video(frames, "output.mp4", fps=16) +``` + +#### FreeNoise memory savings + +Since FreeNoise processes multiple frames together, there are parts in the modeling where the memory required exceeds that available on normal consumer GPUs. The main memory bottlenecks that we identified are spatial and temporal attention blocks, upsampling and downsampling blocks, resnet blocks and feed-forward layers. Since most of these blocks operate effectively only on the channel/embedding dimension, one can perform chunked inference across the batch dimensions. The batch dimension in AnimateDiff are either spatial (`[B x F, H x W, C]`) or temporal (`B x H x W, F, C`) in nature (note that it may seem counter-intuitive, but the batch dimension here are correct, because spatial blocks process across the `B x F` dimension while the temporal blocks process across the `B x H x W` dimension). We introduce a `SplitInferenceModule` that makes it easier to chunk across any dimension and perform inference. This saves a lot of memory but comes at the cost of requiring more time for inference. + +```diff +# Load pipeline and adapters +# ... ++ pipe.enable_free_noise_split_inference() ++ pipe.unet.enable_forward_chunking(16) +``` + +The call to `pipe.enable_free_noise_split_inference` method accepts two parameters: `spatial_split_size` (defaults to `256`) and `temporal_split_size` (defaults to `16`). These can be configured based on how much VRAM you have available. A lower split size results in lower memory usage but slower inference, whereas a larger split size results in faster inference at the cost of more memory. ## Using `from_single_file` with the MotionAdapter From beae195489a9c684e925f2e004a1a5614a8ba2c5 Mon Sep 17 00:00:00 2001 From: Aryan Date: Wed, 11 Sep 2024 16:36:22 +0200 Subject: [PATCH 2/6] apply suggestions from review --- docs/source/en/api/pipelines/animatediff.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md index ada3abf35cba..ed33ec3e2c56 100644 --- a/docs/source/en/api/pipelines/animatediff.md +++ b/docs/source/en/api/pipelines/animatediff.md @@ -921,10 +921,20 @@ export_to_gif(frames, "animatelcm-motion-lora.gif") FreeNoise is a sampling mechanism that allows the generation of longer videos with short-video generation models by employing noise-rescheduling, temporal attention over sliding windows, and weighted averaging of latent frames. It also can be used with multiple prompts to allow for interpolated video generations. More details are available in the paper. The currently supported AnimateDiff pipelines that can be used with FreeNoise are: -- AnimateDiffPipeline -- AnimateDiffControlNetPipeline -- AnimateDiffVideoToVideoPipeline -- AnimateDiffVideoToVideoControlNetPipeline +- [AnimateDiffPipeline] +- [AnimateDiffControlNetPipeline] +- [AnimateDiffVideoToVideoPipeline] +- [AnimateDiffVideoToVideoControlNetPipeline] + +In order to use FreeNoise, a single line needs to be added to the inference code after loading your pipelines. + +```diff ++ pipe.enable_free_noise() +``` + +After this, either a single prompt could be used, or multiple prompts can be passed as a dictionary of integer-string pairs. The integer keys of the dictionary correspond to the frame index at which the influence of that prompt would be maximum. Each frame index should map to a single string prompt. The prompts for intermediate frame indices, that are not passed in the dictionary, are created by interpolating between the frame prompts that are passed. By default, simple linear interpolation is used however one can customize this behaviour by a callback to the `prompt_interpolation_callback` parameter when enabling FreeNoise. + +Full example: ```python import torch From 5752032db11793300c6a7392e1a8bd0f09f6ae72 Mon Sep 17 00:00:00 2001 From: Aryan Date: Thu, 12 Sep 2024 00:36:11 +0530 Subject: [PATCH 3/6] Update docs/source/en/api/pipelines/animatediff.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/api/pipelines/animatediff.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md index ed33ec3e2c56..8a8a71dc1bae 100644 --- a/docs/source/en/api/pipelines/animatediff.md +++ b/docs/source/en/api/pipelines/animatediff.md @@ -918,7 +918,7 @@ export_to_gif(frames, "animatelcm-motion-lora.gif") [FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling](https://arxiv.org/abs/2310.15169) by Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu. -FreeNoise is a sampling mechanism that allows the generation of longer videos with short-video generation models by employing noise-rescheduling, temporal attention over sliding windows, and weighted averaging of latent frames. It also can be used with multiple prompts to allow for interpolated video generations. More details are available in the paper. +FreeNoise is a sampling mechanism that can generate longer videos with short-video generation models by employing noise-rescheduling, temporal attention over sliding windows, and weighted averaging of latent frames. It also can be used with multiple prompts to allow for interpolated video generations. More details are available in the paper. The currently supported AnimateDiff pipelines that can be used with FreeNoise are: - [AnimateDiffPipeline] From cd99de06a1ee68fc5f2d802acc308ee650e45499 Mon Sep 17 00:00:00 2001 From: Aryan Date: Thu, 12 Sep 2024 00:36:20 +0530 Subject: [PATCH 4/6] Update docs/source/en/api/pipelines/animatediff.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/api/pipelines/animatediff.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md index 8a8a71dc1bae..d0e6120d091e 100644 --- a/docs/source/en/api/pipelines/animatediff.md +++ b/docs/source/en/api/pipelines/animatediff.md @@ -932,7 +932,7 @@ In order to use FreeNoise, a single line needs to be added to the inference code + pipe.enable_free_noise() ``` -After this, either a single prompt could be used, or multiple prompts can be passed as a dictionary of integer-string pairs. The integer keys of the dictionary correspond to the frame index at which the influence of that prompt would be maximum. Each frame index should map to a single string prompt. The prompts for intermediate frame indices, that are not passed in the dictionary, are created by interpolating between the frame prompts that are passed. By default, simple linear interpolation is used however one can customize this behaviour by a callback to the `prompt_interpolation_callback` parameter when enabling FreeNoise. +After this, either a single prompt could be used, or multiple prompts can be passed as a dictionary of integer-string pairs. The integer keys of the dictionary correspond to the frame index at which the influence of that prompt would be maximum. Each frame index should map to a single string prompt. The prompts for intermediate frame indices, that are not passed in the dictionary, are created by interpolating between the frame prompts that are passed. By default, simple linear interpolation is used. However, you can customize this behaviour with a callback to the `prompt_interpolation_callback` parameter when enabling FreeNoise. Full example: From 6e7334f1f21d93f4f1c9c5adf1bc7c18c44558dd Mon Sep 17 00:00:00 2001 From: Aryan Date: Thu, 12 Sep 2024 00:36:26 +0530 Subject: [PATCH 5/6] Update docs/source/en/api/pipelines/animatediff.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/api/pipelines/animatediff.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md index d0e6120d091e..3069d18b546b 100644 --- a/docs/source/en/api/pipelines/animatediff.md +++ b/docs/source/en/api/pipelines/animatediff.md @@ -985,7 +985,7 @@ frames = output.frames[0] export_to_video(frames, "output.mp4", fps=16) ``` -#### FreeNoise memory savings +### FreeNoise memory savings Since FreeNoise processes multiple frames together, there are parts in the modeling where the memory required exceeds that available on normal consumer GPUs. The main memory bottlenecks that we identified are spatial and temporal attention blocks, upsampling and downsampling blocks, resnet blocks and feed-forward layers. Since most of these blocks operate effectively only on the channel/embedding dimension, one can perform chunked inference across the batch dimensions. The batch dimension in AnimateDiff are either spatial (`[B x F, H x W, C]`) or temporal (`B x H x W, F, C`) in nature (note that it may seem counter-intuitive, but the batch dimension here are correct, because spatial blocks process across the `B x F` dimension while the temporal blocks process across the `B x H x W` dimension). We introduce a `SplitInferenceModule` that makes it easier to chunk across any dimension and perform inference. This saves a lot of memory but comes at the cost of requiring more time for inference. From f5043925b878320106681699c529d32ecc4f4097 Mon Sep 17 00:00:00 2001 From: Aryan Date: Wed, 11 Sep 2024 21:07:07 +0200 Subject: [PATCH 6/6] apply suggestions from review --- docs/source/en/api/pipelines/animatediff.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md index 3069d18b546b..735901280362 100644 --- a/docs/source/en/api/pipelines/animatediff.md +++ b/docs/source/en/api/pipelines/animatediff.md @@ -921,10 +921,10 @@ export_to_gif(frames, "animatelcm-motion-lora.gif") FreeNoise is a sampling mechanism that can generate longer videos with short-video generation models by employing noise-rescheduling, temporal attention over sliding windows, and weighted averaging of latent frames. It also can be used with multiple prompts to allow for interpolated video generations. More details are available in the paper. The currently supported AnimateDiff pipelines that can be used with FreeNoise are: -- [AnimateDiffPipeline] -- [AnimateDiffControlNetPipeline] -- [AnimateDiffVideoToVideoPipeline] -- [AnimateDiffVideoToVideoControlNetPipeline] +- [`AnimateDiffPipeline`] +- [`AnimateDiffControlNetPipeline`] +- [`AnimateDiffVideoToVideoPipeline`] +- [`AnimateDiffVideoToVideoControlNetPipeline`] In order to use FreeNoise, a single line needs to be added to the inference code after loading your pipelines.