Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions docs/source/en/api/pipelines/i2vgenxl.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ The abstract from the paper is:

*Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280×720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at [this https URL](https://i2vgen-xl.github.io/).*

The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/).
The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/).

<Tip>

Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).

</Tip>

Expand All @@ -31,7 +31,7 @@ Sample output with I2VGenXL:
<table>
<tr>
<td><center>
masterpiece, bestquality, sunset.
library.
<br>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/i2vgen-xl-example.gif"
alt="library"
Expand All @@ -43,9 +43,9 @@ Sample output with I2VGenXL:
## Notes

* I2VGenXL always uses a `clip_skip` value of 1. This means it leverages the penultimate layer representations from the text encoder of CLIP.
* It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD).
* Unlike SVD, it additionally accepts text prompts as inputs.
* It can generate higher resolution videos.
* It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD).
* Unlike SVD, it additionally accepts text prompts as inputs.
* It can generate higher resolution videos.
* When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results.

## I2VGenXLPipeline
Expand Down
6 changes: 3 additions & 3 deletions docs/source/en/api/pipelines/pia.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ Here are some sample outputs:
<table>
<tr>
<td><center>
masterpiece, bestquality, sunset.
cat in a field.
<br>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pia-default-output.gif"
alt="cat in a field"
Expand Down Expand Up @@ -119,7 +119,7 @@ image = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/pix2pix/cat_6.png?download=true"
)
image = image.resize((512, 512))
prompt = "cat in a hat"
prompt = "cat in a field"
negative_prompt = "wrong white balance, dark, sketches,worst quality,low quality"

generator = torch.Generator("cpu").manual_seed(0)
Expand All @@ -132,7 +132,7 @@ export_to_gif(frames, "pia-freeinit-animation.gif")
<table>
<tr>
<td><center>
masterpiece, bestquality, sunset.
cat in a field.
<br>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pia-freeinit-output-cat.gif"
alt="cat in a field"
Expand Down
10 changes: 5 additions & 5 deletions docs/source/en/api/pipelines/text_to_video.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", tor
pipe = pipe.to("cuda")

prompt = "Spiderman is surfing"
video_frames = pipe(prompt).frames
video_frames = pipe(prompt).frames[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we clarified about this part well enough?

video_path = export_to_video(video_frames)
video_path
```
Expand All @@ -64,7 +64,7 @@ pipe.enable_model_cpu_offload()
pipe.enable_vae_slicing()

prompt = "Darth Vader surfing a wave"
video_frames = pipe(prompt, num_frames=64).frames
video_frames = pipe(prompt, num_frames=64).frames[0]
video_path = export_to_video(video_frames)
video_path
```
Expand All @@ -83,7 +83,7 @@ pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_frames = pipe(prompt, num_inference_steps=25).frames[0]
video_path = export_to_video(video_frames)
video_path
```
Expand Down Expand Up @@ -130,7 +130,7 @@ pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
pipe.enable_vae_slicing()

prompt = "Darth Vader surfing a wave"
video_frames = pipe(prompt, num_frames=24).frames
video_frames = pipe(prompt, num_frames=24).frames[0]
video_path = export_to_video(video_frames)
video_path
```
Expand All @@ -148,7 +148,7 @@ pipe.enable_vae_slicing()

video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]

video_frames = pipe(prompt, video=video, strength=0.6).frames
video_frames = pipe(prompt, video=video, strength=0.6).frames[0]
video_path = export_to_video(video_frames)
video_path
```
Expand Down
13 changes: 7 additions & 6 deletions src/diffusers/pipelines/animatediff/pipeline_output.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,13 @@
@dataclass
class AnimateDiffPipelineOutput(BaseOutput):
r"""
Output class for AnimateDiff pipelines.
Output class for AnimateDiff pipelines.

Args:
frames (`List[List[PIL.Image.Image]]` or `torch.Tensor` or `np.ndarray`):
List of PIL Images of length `batch_size` or torch.Tensor or np.ndarray of shape
`(batch_size, num_frames, height, width, num_channels)`.
Args:
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised
PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
`(batch_size, num_frames, channels, height, width)`
"""

frames: Union[List[List[PIL.Image.Image]], torch.Tensor, np.ndarray]
frames: Union[torch.Tensor, np.ndarray, List[List[PIL.Image.Image]]]
14 changes: 8 additions & 6 deletions src/diffusers/pipelines/i2vgen_xl/pipeline_i2vgen_xl.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
```py
>>> import torch
>>> from diffusers import I2VGenXLPipeline
>>> from diffusers.utils import export_to_gif, load_image

>>> pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
>>> pipeline.enable_model_cpu_offload()
Expand Down Expand Up @@ -95,15 +96,16 @@ def tensor2vid(video: torch.Tensor, processor: "VaeImageProcessor", output_type:
@dataclass
class I2VGenXLPipelineOutput(BaseOutput):
r"""
Output class for image-to-video pipeline.
Output class for image-to-video pipeline.

Args:
frames (`List[np.ndarray]` or `torch.FloatTensor`)
List of denoised frames (essentially images) as NumPy arrays of shape `(height, width, num_channels)` or as
a `torch` tensor. The length of the list denotes the video length (the number of frames).
Args:
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised
PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
`(batch_size, num_frames, channels, height, width)`
"""

frames: Union[List[np.ndarray], torch.FloatTensor]
frames: Union[torch.Tensor, np.ndarray, List[List[PIL.Image.Image]]]


class I2VGenXLPipeline(DiffusionPipeline):
Expand Down
4 changes: 2 additions & 2 deletions src/diffusers/pipelines/pia/pipeline_pia.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,13 +200,13 @@ class PIAPipelineOutput(BaseOutput):
Output class for PIAPipeline.

Args:
frames (`torch.Tensor`, `np.ndarray`, or List[PIL.Image.Image]):
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
Nested list of length `batch_size` with denoised PIL image sequences of length `num_frames`,
NumPy array of shape `(batch_size, num_frames, channels, height, width,
Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
"""

frames: Union[torch.Tensor, np.ndarray, PIL.Image.Image]
frames: Union[torch.Tensor, np.ndarray, List[List[PIL.Image.Image]]]


class PIAPipeline(DiffusionPipeline, TextualInversionLoaderMixin, IPAdapterMixin, LoraLoaderMixin):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from typing import List, Union

import numpy as np
import PIL
import torch

from ...utils import (
Expand All @@ -12,12 +13,13 @@
@dataclass
class TextToVideoSDPipelineOutput(BaseOutput):
"""
Output class for text-to-video pipelines.
Output class for text-to-video pipelines.

Args:
frames (`List[np.ndarray]` or `torch.FloatTensor`)
List of denoised frames (essentially images) as NumPy arrays of shape `(height, width, num_channels)` or as
a `torch` tensor. The length of the list denotes the video length (the number of frames).
Args:
frames (`torch.Tensor`, `np.ndarray`, or List[List[PIL.Image.Image]]):
List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised
PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape
`(batch_size, num_frames, channels, height, width)`
"""

frames: Union[List[np.ndarray], torch.FloatTensor]
frames: Union[torch.Tensor, np.ndarray, List[List[PIL.Image.Image]]]