Skip to content

Commit 104e886

Browse files
author
wjay
committed
add docs
1 parent 27d3f1b commit 104e886

File tree

3 files changed

+190
-0
lines changed

3 files changed

+190
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -628,6 +628,8 @@
628628
- sections:
629629
- local: api/pipelines/allegro
630630
title: Allegro
631+
- local: api/pipelines/chronoedit
632+
title: ChronoEdit
631633
- local: api/pipelines/cogvideox
632634
title: CogVideoX
633635
- local: api/pipelines/consisid
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
<!-- Copyright 2025 The ChronoEdit Team and HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# ChronoEditTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data from [ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation](https://huggingface.co/papers/2510.04290) from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling.
15+
16+
> **TL;DR:** ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory.
17+
18+
The model can be loaded with the following code snippet.
19+
20+
```python
21+
from diffusers import ChronoEditTransformer3DModel
22+
23+
transformer = ChronoEditTransformer3DModel.from_pretrained("nvidia/ChronoEdit-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
24+
```
25+
26+
## ChronoEditTransformer3DModel
27+
28+
[[autodoc]] ChronoEditTransformer3DModel
29+
30+
## Transformer2DModelOutput
31+
32+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
<!-- Copyright 2025 The ChronoEdit Team and HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
<div style="float: right;">
16+
<div class="flex flex-wrap space-x-1">
17+
<a href="https://huggingface.co/docs/diffusers/main/en/tutorials/using_peft_for_inference" target="_blank" rel="noopener">
18+
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-d8b4fe?style=flat"/>
19+
</a>
20+
</div>
21+
</div>
22+
23+
# ChronoEdit
24+
25+
[ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation](https://huggingface.co/papers/2510.04290) from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling.
26+
27+
> **TL;DR:** ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory.
28+
29+
*Recent advances in large generative models have greatly enhanced both image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Project page for code and models: [this https URL](https://research.nvidia.com/labs/toronto-ai/chronoedit).*
30+
31+
The ChronoEdit pipeline is developed by the ChronoEdit Team. The original code is available on [GitHub](https://github.com/nv-tlabs/ChronoEdit), and pretrained models can be found in the [nvidia/ChronoEdit](https://huggingface.co/collections/nvidia/chronoedit) collection on Hugging Face.
32+
33+
34+
### Image Editing
35+
36+
```py
37+
import torch
38+
import numpy as np
39+
from diffusers import AutoencoderKLWan, ChronoEditTransformer3DModel, ChronoEditPipeline
40+
from diffusers.utils import export_to_video, load_image
41+
from transformers import CLIPVisionModel
42+
from PIL import Image
43+
44+
model_id = "nvidia/ChronoEdit-14B-Diffusers"
45+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
46+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
47+
transformer = ChronoEditTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
48+
pipe = ChronoEditPipeline.from_pretrained(model_id, image_encoder=image_encoder, transformer=transformer, vae=vae, torch_dtype=torch.bfloat16)
49+
pipe.to("cuda")
50+
51+
image = load_image(
52+
"https://huggingface.co/spaces/nvidia/ChronoEdit/resolve/main/examples/3.png"
53+
)
54+
max_area = 720 * 1280
55+
aspect_ratio = image.height / image.width
56+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
57+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
58+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
59+
print("width", width, "height", height)
60+
image = image.resize((width, height))
61+
prompt = (
62+
"The user wants to transform the image by adding a small, cute mouse sitting inside the floral teacup, enjoying a spa bath. The mouse should appear relaxed and cheerful, with a tiny white bath towel draped over its head like a turban. It should be positioned comfortably in the cup’s liquid, with gentle steam rising around it to blend with the cozy atmosphere. "
63+
"The mouse’s pose should be natural—perhaps sitting upright with paws resting lightly on the rim or submerged in the tea. The teacup’s floral design, gold trim, and warm lighting must remain unchanged to preserve the original aesthetic. The steam should softly swirl around the mouse, enhancing the spa-like, whimsical mood."
64+
)
65+
66+
output = pipe(
67+
image=image,
68+
prompt=prompt,
69+
height=height,
70+
width=width,
71+
num_frames=5,
72+
num_inference_steps=50,
73+
guidance_scale=5.0,
74+
enable_temporal_reasoning=False,
75+
num_temporal_reasoning_steps=0,
76+
).frames[0]
77+
Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png")
78+
```
79+
80+
Optionally, enable **temporal reasoning** for improved physical consistency:
81+
```py
82+
output = pipe(
83+
image=image,
84+
prompt=prompt,
85+
height=height,
86+
width=width,
87+
num_frames=29,
88+
num_inference_steps=50,
89+
guidance_scale=5.0,
90+
enable_temporal_reasoning=True,
91+
num_temporal_reasoning_steps=50,
92+
).frames[0]
93+
export_to_video(output, "output.mp4", fps=16)
94+
Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png")
95+
```
96+
97+
### Inference with 8-Step Distillation Lora
98+
99+
```py
100+
import torch
101+
import numpy as np
102+
from diffusers import AutoencoderKLWan, ChronoEditTransformer3DModel, ChronoEditPipeline
103+
from diffusers.utils import export_to_video, load_image
104+
from transformers import CLIPVisionModel
105+
from PIL import Image
106+
107+
model_id = "nvidia/ChronoEdit-14B-Diffusers"
108+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float32)
109+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
110+
transformer = ChronoEditTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
111+
pipe = ChronoEditPipeline.from_pretrained(model_id, image_encoder=image_encoder, transformer=transformer, vae=vae, torch_dtype=torch.bfloat16)
112+
lora_path = hf_hub_download(repo_id=model_id, filename="lora/chronoedit_distill_lora.safetensors")
113+
pipe.load_lora_weights(lora_path)
114+
pipe.fuse_lora(lora_scale=1.0)
115+
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=2.0)
116+
pipe.to("cuda")
117+
118+
image = load_image(
119+
"https://huggingface.co/spaces/nvidia/ChronoEdit/resolve/main/examples/3.png"
120+
)
121+
max_area = 720 * 1280
122+
aspect_ratio = image.height / image.width
123+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
124+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
125+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
126+
print("width", width, "height", height)
127+
image = image.resize((width, height))
128+
prompt = (
129+
"The user wants to transform the image by adding a small, cute mouse sitting inside the floral teacup, enjoying a spa bath. The mouse should appear relaxed and cheerful, with a tiny white bath towel draped over its head like a turban. It should be positioned comfortably in the cup’s liquid, with gentle steam rising around it to blend with the cozy atmosphere. "
130+
"The mouse’s pose should be natural—perhaps sitting upright with paws resting lightly on the rim or submerged in the tea. The teacup’s floral design, gold trim, and warm lighting must remain unchanged to preserve the original aesthetic. The steam should softly swirl around the mouse, enhancing the spa-like, whimsical mood."
131+
)
132+
133+
output = pipe(
134+
image=image,
135+
prompt=prompt,
136+
height=height,
137+
width=width,
138+
num_frames=5,
139+
num_inference_steps=8,
140+
guidance_scale=1.0,
141+
enable_temporal_reasoning=False,
142+
num_temporal_reasoning_steps=0,
143+
).frames[0]
144+
export_to_video(output, "output.mp4", fps=16)
145+
Image.fromarray((output[-1] * 255).clip(0, 255).astype("uint8")).save("output.png")
146+
```
147+
148+
## ChronoEditPipeline
149+
150+
[[autodoc]] ChronoEditPipeline
151+
- all
152+
- __call__
153+
154+
## ChronoEditPipelineOutput
155+
156+
[[autodoc]] pipelines.chronoedit.pipeline_output.ChronoEditPipelineOutput

0 commit comments

Comments
 (0)