-
Notifications
You must be signed in to change notification settings - Fork 6.3k
Add MAGI-1: Autoregressive Video Generation at Scale #11713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
tolgacangoz
wants to merge
132
commits into
huggingface:main
Choose a base branch
from
tolgacangoz:add-magi-1
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+5,929
−4
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…te attention mechanism accordingly. Updated initialization parameters and reshaping logic.
…tering and equal split ratio. Add utility functions for resizing and cropping images while preserving aspect ratio. Enhance 3D rotary positional embeddings Adds `center_grid_hw_indices` and `equal_split_ratio` parameters to the 3D rotary positional embedding function for more flexible configuration. The `center_grid_hw_indices` option centers the spatial grid indices around zero. The `equal_split_ratio` parameter provides an alternative way to divide the embedding dimension equally among the temporal and spatial axes. Updates the Magi1 VAE to utilize these new embedding features, introducing helper functions to prepare the embeddings dynamically based on input tensor dimensions.
Replaces the initial causal 3D convolution in the encoder with a standard `Conv3d` patch embedding layer. This simplifies the model and makes its input processing more consistent with Diffusion Transformer (DiT) architectures. Additionally, this change: - Removes the unused `Magi1CausalConv3d` class. - Updates the attention mechanism to use the standard `scaled_dot_product_attention`. - Sets the default for `sample_posterior` to `True` in the forward pass.
Removes the feature caching logic (`feat_cache`, `feat_idx`) from the encoder, decoder, and their sub-modules. This change significantly simplifies the forward pass implementation by removing stateful cache management. Additionally, this commit replaces the custom `Magi1RMS_norm` with a standard `nn.LayerNorm` and updates several custom causal convolution layers to use standard `nn.Linear` or `nn.Conv3d` layers.
Moves the positional embedding and dropout layers from the main autoencoder class into the decoder module. This improves encapsulation as the embedding is only used within the decoder. The decoder's forward pass is updated to apply the positional embedding and to remove the class token before the final output convolution. Additionally, `quant_conv` is renamed to `quant_linear` to accurately reflect the layer type.
Updates the `Magi1Decoder3d` from a convolutional design to a Transformer-like structure that operates on patches. This change replaces the initial convolutional and middle blocks with a linear projection layer, positional embeddings, and a class token. The logic for these components is moved from the parent `AutoencoderKLMagi1` model into the decoder for better encapsulation.
Removes several custom modules, including `Magi1ResidualBlock`, `Magi1Resample`, and `Magi1UpBlock`. Replaces the previous `Magi1MidBlock` with a more standard transformer-style `Magi1Block`. This change simplifies the overall VAE architecture by consolidating complex, specialized blocks into a more conventional design.
Replaces the custom `Magi1AttentionBlock` with the more generic `diffusers.Attention` module, combined with a new (?) `Magi1AttnProcessor2_0`. This change aligns the implementation with standard library patterns and leverages PyTorch 2.0's `scaled_dot_product_attention` for improved efficiency. The `Magi1Block` is also refactored into a more conventional transformer block structure using `Attention` and `FeedForward` modules.
Refactors the Magi1 VAE decoder to use a more standard transformer-based architecture. This change replaces the previous U-Net-like upsampling blocks with a series of standard transformer blocks, each containing self-attention and a feed-forward network. The custom rotary positional embedding logic and its helper functions have been removed, and the attention processor is simplified to work with the standard `Attention` module. This simplifies the overall model implementation.
Replaces the previous convolutional U-Net style encoder with a Vision Transformer (ViT) based implementation. This new architecture processes the input by dividing it into patches, adding positional embeddings, and then passing the sequence through a series of transformer blocks. The attention processor is also updated to support attention masks, and the model's configuration is adjusted to accommodate the new transformer-specific parameters.
Removes complex and unused parameters from the Magi1 VAE, encoder, and decoder modules. This change refactors the model to use a more standard Transformer architecture, eliminating the previous U-Net-like structure with dimension multipliers and residual blocks. The configuration is now more direct, improving clarity and maintainability.
Simplifies the initialization of the Magi1 VAE, encoder, and decoder. Reorders constructor parameters for clarity and removes unused arguments. The spatial and temporal compression ratios are now derived directly from the `patch_size` configuration, making the relationship more explicit. The pipeline is updated to use these new VAE attributes.
Simplifies the model architecture by removing the quantization and post-quantization convolution layers. This streamlines the `encode` and `decode` methods. The decoder is also updated to process the entire latent tensor at once, removing the previous frame-by-frame processing loop. Additionally, this change updates an import path for the `timm` library and renames an internal variable for consistency.
Updates the conversion script for the MAGI-1 VAE to correctly handle its Vision Transformer (ViT) based architecture. The state dictionary mapping is rewritten to align with the ViT structure. This includes adding logic to split the original checkpoint's combined QKV weights into separate query, key, and value tensors for the `diffusers` model. The model class and its configuration are also updated to reflect the appropriate ViT parameters, ensuring a correct conversion.
Renames the Magi autoencoder class to align with the "MAGI-1" model name. This refactoring improves consistency and clarity throughout the codebase, including documentation and tests.
Aligns the model naming with the source paper, "MAGI-1". This change refactors the model class, associated files, tests, and documentation to use the `Magi1` prefix for better clarity and consistency.
…ross multiple files
Improve compatibility by handling various PyTorch checkpoint formats. The loader now correctly extracts the state dictionary when it is nested under common keys like "model" or "state_dict". Ensure consistent loading of sharded safetensors by sorting the checkpoint files before merging them.
Updates the Magi-1 transformer to use a real-number-based rotary position embedding implementation, replacing the previous complex-number-based approach. This improves compilibility. The `Magi1RotaryPosEmbed` class is aligned with the `Wan` implementation, now generating separate cosine and sine frequency tensors. The attention processor is updated accordingly to apply these embeddings. Additionally, the transformer block is simplified by removing an unnecessary linear projection layer.
Replaces the standard `LayerNorm` for query and key projections with `FP32LayerNorm` to ensure normalization operations are performed in full precision. This improves numerical stability during training and inference, especially when using mixed precision. Additionally, removes unused code, including a kv projection function and an unnecessary layer norm attribute.
Updates the Magi-1 transformer implementation to more closely match the original paper's architecture and hyperparameters. Key changes include: - Revises the rotary position embedding to use a 3D spatial grid, supporting fine-tuning with rescaled feature shapes. - Updates attention block parameters, normalization layers, and projection dimensions. - Introduces distillation logic for timestep embeddings. - Refactors text, time, and image embedding modules, removing unused positional embedding logic. - Adds support for variable-length cross-attention sequences using an attention mask.
… replaces its usage with the VAE encoder.
Updates the `Magi1RotaryPosEmbed` implementation to more closely match the original repo, improving support for multi-resolution and multi-aspect ratio training. The new implementation: - Simplifies the rotary embedding module by removing unused parameters. - Calculates a rescale factor to handle varying input resolutions. - Corrects the embedding generation and application logic within the main transformer model. - Adds support for `half_channel_vae` and input rescaling.
Updates the application of rotary position embeddings (RoPE) to align with standard transformer practices. The rotary embedding logic is modified to apply rotations to only a partial dimension of the query and key vectors. This also involves changing the rotary embedding module to return separate cosine and sine frequency tensors instead of a concatenated one.
…mage or video input into the initial random tensor
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thanks for the opportunity to fix #11519!
Original repo: https://github.com/SandAI-org/MAGI-1
AutoencoderKLMagi1
: Tiling option by @kuantuna -> Feat: Implement tiling in VAE tolgacangoz/diffusers#6Magi1Transformer3DModel
: Almost done...kernels
.MAGI-1-Diffusers
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.