Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
184 commits
Select commit Hold shift + click to select a range
bb7c412
let's see
sayakpaul Jan 5, 2024
d537b6c
better conditioning for class_embed_type
sayakpaul Jan 5, 2024
15f1607
determine in_channels programatically.
sayakpaul Jan 5, 2024
a329b73
worse condition
sayakpaul Jan 5, 2024
5660ba1
fix: sample_size.
sayakpaul Jan 5, 2024
eb8ea72
Merge branch 'main' into convert-i2vgen-xl
sayakpaul Jan 16, 2024
011329d
separte script for i2vgen
sayakpaul Jan 16, 2024
3e0015d
changes
sayakpaul Jan 16, 2024
f09c2dd
fix: basic transformer block init.
sayakpaul Jan 16, 2024
7dd0cb0
check
sayakpaul Jan 16, 2024
d6f1e6d
revert block_out_channels.
sayakpaul Jan 16, 2024
da5b83c
debug info
sayakpaul Jan 16, 2024
0ecef35
debug info
sayakpaul Jan 16, 2024
13ecc11
debug info
sayakpaul Jan 16, 2024
6778b3f
debug
sayakpaul Jan 16, 2024
20aeaf3
correct ffn inner dim
sayakpaul Jan 16, 2024
ef85c84
debug info
sayakpaul Jan 16, 2024
7d03162
input channels should be 8./
sayakpaul Jan 16, 2024
a736694
input channels corrected
sayakpaul Jan 16, 2024
34e7349
Revert "input channels corrected"
sayakpaul Jan 16, 2024
896d626
better input channels
sayakpaul Jan 16, 2024
02b76b5
Revert "better input channels"
sayakpaul Jan 16, 2024
15a6fbd
rectify conversion script
sayakpaul Jan 16, 2024
5a09722
conversion script.
sayakpaul Jan 16, 2024
bcccfdf
conversio
sayakpaul Jan 16, 2024
1c68e05
push_to_hub
sayakpaul Jan 16, 2024
3b5940b
remove print
sayakpaul Jan 16, 2024
aaae032
let's see.
sayakpaul Jan 16, 2024
1c72370
safeguard .
sayakpaul Jan 16, 2024
25527f8
device place,ent
sayakpaul Jan 16, 2024
c38ef7a
comment to remind that writing good code is important
sayakpaul Jan 16, 2024
4f4d4e6
device placement.
sayakpaul Jan 16, 2024
e717630
corrct layernorm condition.
sayakpaul Jan 16, 2024
292668e
norm3 condition
sayakpaul Jan 16, 2024
17a2418
correct norm3
sayakpaul Jan 16, 2024
d5b7693
incorporate einops
sayakpaul Jan 16, 2024
35b15f2
image_embeddings
sayakpaul Jan 16, 2024
693b2ce
okay
sayakpaul Jan 16, 2024
642cbe4
dtype debug
sayakpaul Jan 16, 2024
105ecc5
dtype fix
sayakpaul Jan 16, 2024
d7e6b2c
dtype fix.
sayakpaul Jan 16, 2024
5de4348
simplify code.
sayakpaul Jan 16, 2024
2852de1
remove print
sayakpaul Jan 16, 2024
76772c5
debug
sayakpaul Jan 22, 2024
600ffd8
debug
sayakpaul Jan 22, 2024
ecf0070
debug
sayakpaul Jan 22, 2024
b88d9a9
debug
sayakpaul Jan 22, 2024
87eff5e
debu
sayakpaul Jan 22, 2024
3178e74
debug
sayakpaul Jan 22, 2024
32f6151
remove print
sayakpaul Jan 22, 2024
87e70ab
add: dummy pipeline implementation too.
sayakpaul Jan 22, 2024
5e7f17f
pipeline draft
sayakpaul Jan 22, 2024
28b9d57
complete conversion script.
sayakpaul Jan 22, 2024
7943c91
add new unet to modules
sayakpaul Jan 22, 2024
7f3d559
enable chunked decoding on vae.
sayakpaul Jan 22, 2024
26d87c2
correct image latent behaviour
sayakpaul Jan 22, 2024
5d03574
remove comment
sayakpaul Jan 22, 2024
989c707
correct dtyp
sayakpaul Jan 22, 2024
7b88ad3
correct output type.
sayakpaul Jan 22, 2024
eec8791
Merge branch 'main' into convert-i2vgen-xl
sayakpaul Jan 22, 2024
6ff9606
init fix
sayakpaul Jan 22, 2024
b44e053
fix-copies
sayakpaul Jan 22, 2024
51fdf30
Merge branch 'main' into convert-i2vgen-xl
sayakpaul Jan 23, 2024
9bd5f16
chunked decoding should be optional
sayakpaul Jan 23, 2024
cc7e975
what happens if we take mode instead?
sayakpaul Jan 23, 2024
734274a
fix: type
sayakpaul Jan 23, 2024
b48f094
back to sampling and clean up tensorification
sayakpaul Jan 23, 2024
761c08e
better variable name
sayakpaul Jan 23, 2024
a0c00c0
try to follow the original implementation closely.
sayakpaul Jan 23, 2024
c6d35e2
proper repeatation
sayakpaul Jan 23, 2024
0ebad2e
fix: fps condition check
sayakpaul Jan 23, 2024
da309d5
fix: masking
sayakpaul Jan 23, 2024
5b0b5df
fix: masking
sayakpaul Jan 23, 2024
ef4dd34
go back to negative_image_image_latents.
sayakpaul Jan 23, 2024
670488e
make type casting for fps explicit
sayakpaul Jan 23, 2024
80a6f1a
original implementation image_latents.
sayakpaul Jan 23, 2024
85d364c
Revert "original implementation image_latents."
sayakpaul Jan 23, 2024
b0865dd
sinusoidal embedding?
sayakpaul Jan 23, 2024
87742e9
simple bilinear resizing.
sayakpaul Jan 23, 2024
8902440
remove the sinusoidal implementation from i2vgenxl
sayakpaul Jan 23, 2024
e9cd839
resolve conflicts
sayakpaul Jan 23, 2024
585a6b6
harmonize with main
sayakpaul Jan 23, 2024
90d91a8
fix: tensor2vid
sayakpaul Jan 23, 2024
4a7d4ae
fix: tensor2vid
sayakpaul Jan 23, 2024
ab9569f
fix: tensor2vid
sayakpaul Jan 23, 2024
58844fe
fix: doc
sayakpaul Jan 23, 2024
11fd646
fix model offload sequence.
sayakpaul Jan 23, 2024
6778e6b
update
DN6 Jan 26, 2024
9f73792
update
DN6 Jan 26, 2024
0ecd79b
add docs
DN6 Jan 26, 2024
eefa6cc
update
DN6 Jan 26, 2024
a9fecb3
update
DN6 Jan 28, 2024
2781791
update
DN6 Jan 28, 2024
0d1ea8c
update
DN6 Jan 28, 2024
1d3846d
update
DN6 Jan 29, 2024
db0213a
update
DN6 Jan 29, 2024
f2964ba
improve docs.
sayakpaul Jan 30, 2024
2d5071e
docstring to the pipeline
sayakpaul Jan 30, 2024
4cd0083
licensing in the pipeline scripts.
sayakpaul Jan 30, 2024
6012362
clean up the docstring of the UNet.
sayakpaul Jan 30, 2024
09519a1
Merge branch 'main' into convert-i2vgen-xl
sayakpaul Jan 30, 2024
23935a9
make _resize_bilinear and _center_crop_wide accept torch tensors as w…
sayakpaul Jan 30, 2024
57b20ee
data type fix
sayakpaul Jan 30, 2024
4d51fe8
unint8 > uint8
sayakpaul Jan 30, 2024
14404b2
channels_last
sayakpaul Jan 30, 2024
24d813e
debug
sayakpaul Jan 30, 2024
bf1eb40
fix download path for the example image
sayakpaul Jan 30, 2024
f35f3d8
fix: download path again
sayakpaul Jan 30, 2024
2880442
use cross_attention_dim to initialize
sayakpaul Jan 30, 2024
698f9c1
debug
sayakpaul Jan 30, 2024
68cbe59
debu
sayakpaul Jan 30, 2024
45c682e
reduce hidden size of the vision encoder
sayakpaul Jan 30, 2024
3a701a2
go
sayakpaul Jan 30, 2024
0a4c686
debug more
sayakpaul Jan 30, 2024
758acc0
reduce more hidden dim
sayakpaul Jan 30, 2024
0bfd042
remove callback and callback_steps from required params check
sayakpaul Jan 30, 2024
dd5a8f0
remove print
sayakpaul Jan 30, 2024
50d4606
assertions for the default case..
sayakpaul Jan 30, 2024
2a4c727
skip test_attention_slicing_forward_pass as it's depcrecated.
sayakpaul Jan 30, 2024
66034c5
feature_extractor.
sayakpaul Jan 30, 2024
b1819cd
feature_extractor.
sayakpaul Jan 30, 2024
48e7694
relax precision
sayakpaul Jan 30, 2024
0f230f9
relax more.
sayakpaul Jan 30, 2024
836fb67
torch.manual_seed(0)
sayakpaul Jan 30, 2024
a5cb5b1
relax precision
sayakpaul Jan 30, 2024
947e63a
uncomment batching tests
sayakpaul Jan 30, 2024
216b9dd
debug
sayakpaul Jan 30, 2024
7c81052
debug more
sayakpaul Jan 30, 2024
31409af
debug more
sayakpaul Jan 30, 2024
2faaffe
make the pt to pil utilities better
sayakpaul Jan 30, 2024
a29c201
debug
sayakpaul Jan 30, 2024
4adc851
format string
sayakpaul Jan 30, 2024
9cb0b84
okay
sayakpaul Jan 30, 2024
0b9a9ef
force_feature_extractor_resize
sayakpaul Jan 30, 2024
7cffe74
debug
sayakpaul Jan 30, 2024
3d0ef8b
expand to samples's shape
sayakpaul Jan 30, 2024
ca422ef
check
sayakpaul Jan 30, 2024
cfafe51
fix: batching behaviour for fps
sayakpaul Jan 30, 2024
d85bd2d
test_inference_batch_single_identical
sayakpaul Jan 30, 2024
7324291
relax test_inference_batch_single_identical
sayakpaul Jan 30, 2024
bb10302
relax a bit more.
sayakpaul Jan 30, 2024
5e79f3d
test_num_videos_per_prompt
sayakpaul Jan 30, 2024
f3c58a2
let's go.
sayakpaul Jan 30, 2024
7cb384c
remove extra prints.
sayakpaul Jan 30, 2024
f64c3d2
remove force_feature_extractor_resize
sayakpaul Jan 30, 2024
1675b07
remove force_feature_extractor_resize
sayakpaul Jan 30, 2024
8c20445
fix: test_num_videos_per_prompt
sayakpaul Jan 30, 2024
2221bbc
fix: test_num_videos_per_prompt
sayakpaul Jan 30, 2024
6584988
fix: test_num_videos_per_prompt
sayakpaul Jan 30, 2024
7fad585
fix a bit more
sayakpaul Jan 30, 2024
6109178
style
sayakpaul Jan 30, 2024
7b281b5
add: slow test
sayakpaul Jan 30, 2024
24cfc1d
flattened image slice
sayakpaul Jan 30, 2024
edd6cc5
variant
sayakpaul Jan 30, 2024
6ebb265
assertion
sayakpaul Jan 30, 2024
a03649f
add: note about memory optimization
sayakpaul Jan 30, 2024
c78617e
being to cpu before calling numpy()
sayakpaul Jan 30, 2024
76a42b3
finish slow test fixes
sayakpaul Jan 30, 2024
ca4a977
Merge branch 'main' into convert-i2vgen-xl
sayakpaul Jan 30, 2024
182447a
Empty-Commit
sayakpaul Jan 30, 2024
a54facc
pin peft dependencies.
sayakpaul Jan 30, 2024
72e466e
remove attention slicing and unload_lora
sayakpaul Jan 30, 2024
5ba9b4a
remove attention mask
sayakpaul Jan 30, 2024
693d827
timsteps.
sayakpaul Jan 30, 2024
88f03a3
add missing entries in the unet docstring
sayakpaul Jan 30, 2024
9bf706e
Apply suggestions from code review
sayakpaul Jan 30, 2024
a682dca
remove textual inversion
sayakpaul Jan 30, 2024
a9c23e8
remove _to_tensor on fps.
sayakpaul Jan 30, 2024
ec5694a
leverage VaeImageProcessor.
sayakpaul Jan 30, 2024
e6c07b5
remove unnecessary config vars.
sayakpaul Jan 30, 2024
8c980df
use num_attention_heads
sayakpaul Jan 30, 2024
5cbaf2e
clean up conv_out layer creation
sayakpaul Jan 30, 2024
be518c8
refactor attention logic for cleaning up norm handling
sayakpaul Jan 30, 2024
b8fde10
Apply suggestions from code review
sayakpaul Jan 31, 2024
a0e6db1
simplify norm_type checks in the forwards.
sayakpaul Jan 31, 2024
c6c0b31
add copied from statement where missing
sayakpaul Jan 31, 2024
5038c21
move _center_crop and _resize_bilinear out of the encode image function
sayakpaul Jan 31, 2024
6ac5ec5
Merge branch 'main' into convert-i2vgen-xl
sayakpaul Jan 31, 2024
6d7fb87
update
DN6 Jan 31, 2024
2c1caea
Merge branch 'main' into convert-i2vgen-xl
DN6 Jan 31, 2024
513ab1f
clean up
DN6 Jan 31, 2024
13fcc20
update
DN6 Jan 31, 2024
7b7f075
update
DN6 Jan 31, 2024
fe50995
change checkpoints.
sayakpaul Jan 31, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,8 @@
title: DiffEdit
- local: api/pipelines/dit
title: DiT
- local: api/pipelines/i2vgenxl
title: I2VGen-XL
- local: api/pipelines/pix2pix
title: InstructPix2Pix
- local: api/pipelines/kandinsky
Expand Down
57 changes: 57 additions & 0 deletions docs/source/en/api/pipelines/i2vgenxl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# I2VGen-XL

[I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models](https://hf.co/papers/2311.04145.pdf) by Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou.

The abstract from the paper is:

*Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280×720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at [this https URL](https://i2vgen-xl.github.io/).*

The original codebase can be found [here](https://github.com/ali-vilab/i2vgen-xl/). The model checkpoints can be found [here](https://huggingface.co/ali-vilab/).

<Tip>

Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).

</Tip>

Sample output with I2VGenXL:

<table>
<tr>
<td><center>
masterpiece, bestquality, sunset.
<br>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/i2vgen-xl-example.gif"
alt="library"
style="width: 300px;" />
</center></td>
</tr>
</table>

## Notes

* I2VGenXL always uses a `clip_skip` value of 1. This means it leverages the penultimate layer representations from the text encoder of CLIP.
* It can generate videos of quality that is often on par with [Stable Video Diffusion](../../using-diffusers/svd) (SVD).
* Unlike SVD, it additionally accepts text prompts as inputs.
* It can generate higher resolution videos.
* When using the [`DDIMScheduler`] (which is default for this pipeline), less than 50 steps for inference leads to bad results.

## I2VGenXLPipeline
[[autodoc]] I2VGenXLPipeline
- all
- __call__

## I2VGenXLPipelineOutput
[[autodoc]] pipelines.i2vgen_xl.pipeline_i2vgen_xl.I2VGenXLPipelineOutput
Loading