[Core] support saving and loading of sharded checkpoints #7830

sayakpaul · 2024-05-01T10:46:03Z

What does this PR do?

Follow-up of #6396.

This PR adds support for saving a big model's state dict into multiple shards for efficient portability and loading. Adds support for loading the sharded checkpoints, too.

This is much akin to handling big models like T5XXL.

Also, added a nice test to ensure the models that have _no_split_modules specified can be sharded and loaded back to perform inference ensuring numerical assertions.

Here's a real use-case. Consider this Transformer2DModel checkpoint: https://huggingface.co/sayakpaul/actual_bigger_transformer/.

It was serialized like so:

from diffusers import Transformer2DModel
from accelerate.utils import compute_module_sizes, shard_checkpoint
from accelerate import init_empty_weights
import torch.nn as nn

def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"

with init_empty_weights():
    pixart_transformer = Transformer2DModel.from_config("PixArt-alpha/PixArt-XL-2-1024-MS", subfolder="transformer")
    bigger_transformer = Transformer2DModel.from_config(
        pixart_transformer.config, num_layers=72, num_attention_heads=36, cross_attention_dim=2592,
    )
    module_size = bytes_to_giga_bytes(compute_module_sizes(bigger_transformer)[""])
    print(f"{module_size=} GB")
    pytorch_total_params = sum(p.numel() for p in bigger_transformer.parameters()) / 1e9
    print(f"{pytorch_total_params=} B")

    model = nn.Sequential(*[nn.Linear(8944, 8944) for _ in range(1000)])
    module_size = bytes_to_giga_bytes(compute_module_sizes(model)[""])
    print(f"{module_size=} GB")
    pytorch_total_params = sum(p.numel() for p in model.parameters()) / 1e9
    print(f"{pytorch_total_params=} B")

actual_bigger_transformer = Transformer2DModel.from_config(
    pixart_transformer.config, num_layers=72, num_attention_heads=36, cross_attention_dim=2592
)
actual_bigger_transformer.save_pretrained("/raid/.cache/actual_bigger_transformer", max_shard_size="10GB", push_to_hub=True)

As we can see from the Hub repo that its state dict is sharded. To perform with the model, all we have to do is this:

from diffusers import Transformer2DModel
import tempfile
import torch
import os

def get_inputs():
    sample = torch.randn(1, 4, 128, 128)
    timestep = torch.randint(0, 1000, size=(1, ))
    encoder_hidden_states = torch.randn(1, 120, 4096)

    resolution = torch.tensor([1024, 1024]).repeat(1, 1)
    aspect_ratio = torch.tensor([1.]).repeat(1, 1)
    added_cond_kwargs = {"resolution": resolution, "aspect_ratio": aspect_ratio}
    return sample, timestep, encoder_hidden_states, added_cond_kwargs

with torch.no_grad():
    # max_memory = {0: "15GB"} # reasonable estimate for a consumer-gpu.
    with tempfile.TemporaryDirectory() as tmp_dir:
        new_model = Transformer2DModel.from_pretrained(
            "sayakpaul/actual_bigger_transformer",
            device_map="auto",
        )

        sample, timestep, encoder_hidden_states, added_cond_kwargs = get_inputs()
        out = new_model(
            hidden_states=sample,
            encoder_hidden_states=encoder_hidden_states,
            timestep=timestep, 
            added_cond_kwargs=added_cond_kwargs
        ).sample
        print(f"{out.shape=}, {out.device=}")

I haven't purposefully haven't added documentation because all of this will become useful once we use this in the context of a full-fledged pipeline execution (up next) :)

HuggingFaceDocBuilderDev · 2024-05-01T10:51:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/diffusers/models/modeling_utils.py

src/diffusers/utils/hub_utils.py

sayakpaul · 2024-05-10T18:45:48Z

@yiyixuxu @SunMarc a gentle ping here.

BenjaminBossan

Always delightful to deal with the from_pretrained code ;)

I don't really have any bigger comments, as this should hopefully work well since it's based on the transformers implementation. Only some smaller comments.

src/diffusers/models/modeling_utils.py

src/diffusers/utils/hub_utils.py

tests/models/test_modeling_common.py

SunMarc

Thanks for your work @sayakpaul ! Left a suggestion (not a blocker, we can do it afterwards if needed) ! No major comments since @BenjaminBossan did a very thorough review already !

src/diffusers/models/modeling_utils.py

… okay

sayakpaul · 2024-05-29T09:17:45Z

I'd rather have another pair of eyes reviewing it, given it's fairly easy to miss something when iterating/reviewing several times on the same code.

Yeah. @yiyixuxu would be the final approver here :)

yiyixuxu

thanks for the PR!!
I left some comments and questions :)

yiyixuxu · 2024-05-31T00:45:51Z

src/diffusers/configuration_utils.py

        revision = kwargs.pop("revision", None)
        _ = kwargs.pop("mirror", None)
-        subfolder = kwargs.pop("subfolder", None)
+        subfolder = kwargs.pop("subfolder", None) or ""


why don't we handle it where it fails then

diffusers/src/diffusers/models/modeling_utils.py

Line 658 in 0706cae

subfolder,

we would only need to change one place, no?

src/diffusers/models/modeling_utils.py

yiyixuxu · 2024-05-31T02:37:57Z

src/diffusers/utils/hub_utils.py

-            raise EnvironmentError(
-                f"{pretrained_model_name_or_path} does not appear to have a file named {weights_name}."
-            )
+            # This should correspond to a shard index file.


why do we need to return something different when we can't find the shard index file?

can we do

try: model_file = _get_model_file(...) ... except ... model_file = None

I guess this question I still have: why do we need to return None when we can't find a shard index file? vs for any other file we get find we raise errors -
where in the code is this needed?

sayakpaul · 2024-06-03T12:35:19Z

@yiyixuxu do the recent changes work for you?

(I have run the tests)

yiyixuxu

thanks!
I have one quetions! the rest look good to me

yiyixuxu · 2024-06-07T07:48:20Z

src/diffusers/utils/hub_utils.py

-            raise EnvironmentError(
-                f"{pretrained_model_name_or_path} does not appear to have a file named {weights_name}."
-            )
+            # This should correspond to a shard index file.


I guess this question I still have: why do we need to return None when we can't find a shard index file? vs for any other file we get find we raise errors -
where in the code is this needed?

yiyixuxu

nevermind - i got confused in my last review!
good to merge!

Wauplin · 2024-06-07T10:01:53Z

Yay! Great job @sayakpaul ! 🎉

* feat: support saving a model in sharded checkpoints. * feat: make loading of sharded checkpoints work. * add tests * cleanse the loading logic a bit more. * more resilience while loading from the Hub. * parallelize shard downloads by using snapshot_download()/ * default to a shard size. * more fix * Empty-Commit * debug * fix * uality * more debugging * fix more * initial comments from Benjamin * move certain methods to loading_utils * add test to check if the correct number of shards are present. * add a test to check if loading of sharded checkpoints from the Hub is okay * clarify the unit when passed as an int. * use hf_hub for sharding. * remove unnecessary code * remove unnecessary function * lucain's comments. * fixes * address high-level comments. * fix test * subfolder shenanigans./ * Update src/diffusers/utils/hub_utils.py Co-authored-by: Lucain <[email protected]> * Apply suggestions from code review Co-authored-by: Lucain <[email protected]> * remove _huggingface_hub_version as not needed. * address more feedback. * add a test for local_files_only=True/ * need hf hub to be at least 0.23.2 * style * final comment. * clean up subfolder. * deal with suffixes in code. * _add_variant default. * use weights_name_pattern * remove add_suffix_keyword * clean up downloading of sharded ckpts. * don't return something special when using index.json * fix more * don't use bare except * remove comments and catch the errors better * fix a couple of things when using is_file() * empty --------- Co-authored-by: Lucain <[email protected]>

sayakpaul added 6 commits May 1, 2024 12:10

feat: support saving a model in sharded checkpoints.

b566c95

feat: make loading of sharded checkpoints work.

8605909

add tests

885d5b6

cleanse the loading logic a bit more.

560fe32

more resilience while loading from the Hub.

fc5d837

parallelize shard downloads by using snapshot_download()/

0d3b9e1

sayakpaul requested review from SunMarc and yiyixuxu May 1, 2024 10:46

sayakpaul added 9 commits May 1, 2024 16:49

default to a shard size.

df8e945

more fix

6eff632

Empty-Commit

ed83244

debug

642ee39

fix

36de0c4

uality

cc5656e

more debugging

8898717

fix more

2dfb9a1

Merge branch 'main' into feat-save-sharded-ckpt

179495f

sayakpaul commented May 1, 2024

View reviewed changes

src/diffusers/models/modeling_utils.py Outdated Show resolved Hide resolved

sayakpaul commented May 1, 2024

View reviewed changes

src/diffusers/utils/hub_utils.py Outdated Show resolved Hide resolved

merge main and fix conflicts.

7e2c09b

yiyixuxu requested a review from BenjaminBossan May 13, 2024 22:24

BenjaminBossan reviewed May 14, 2024

View reviewed changes

SunMarc approved these changes May 14, 2024

View reviewed changes

src/diffusers/models/modeling_utils.py Outdated Show resolved Hide resolved

sayakpaul added 5 commits May 15, 2024 09:48

resolve conflicts.

3535701

initial comments from Benjamin

5ae8e46

move certain methods to loading_utils

aefd0db

add test to check if the correct number of shards are present.

80005be

add a test to check if loading of sharded checkpoints from the Hub is…

d144526

… okay

final comment.

a7fc2ae

yiyixuxu reviewed May 31, 2024

View reviewed changes

sayakpaul added 11 commits June 3, 2024 14:18

Merge branch 'main' into feat-save-sharded-ckpt

f74fc67

clean up subfolder.

38749fc

deal with suffixes in code.

edbd8de

_add_variant default.

2ecd4da

use weights_name_pattern

d51d0b9

remove add_suffix_keyword

c2a71a0

clean up downloading of sharded ckpts.

a70e927

don't return something special when using index.json

65da7dc

fix more

5599388

don't use bare except

7cdf958

remove comments and catch the errors better

16dcdf8

sayakpaul requested a review from yiyixuxu June 3, 2024 12:35

fix a couple of things when using is_file()

737e627

Wauplin mentioned this pull request Jun 4, 2024

Serialization: support saving torch state dict to disk huggingface/huggingface_hub#2314

Merged

resolve conflicts

491e1e2

yiyixuxu reviewed Jun 7, 2024

View reviewed changes

yiyixuxu approved these changes Jun 7, 2024

View reviewed changes

sayakpaul added 2 commits June 7, 2024 13:57

empty

2a18518

Merge branch 'main' into feat-save-sharded-ckpt

51906c6

sayakpaul merged commit 7d88711 into main Jun 7, 2024

sayakpaul deleted the feat-save-sharded-ckpt branch June 7, 2024 09:19

sayakpaul mentioned this pull request Jun 7, 2024

[Core] add suport for using sharded models in the pipeline context #8428

Closed

2 tasks

bghira mentioned this pull request Jun 9, 2024

backwards compatibility breaking change from sharded checkpoints #8443

Closed

Wauplin mentioned this pull request Jun 10, 2024

change max_shard_size to 10GB #8445

Merged

[Core] support saving and loading of sharded checkpoints #7830

[Core] support saving and loading of sharded checkpoints #7830

Uh oh!

Conversation

sayakpaul commented May 1, 2024

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2024

Uh oh!

Uh oh!

Uh oh!

sayakpaul commented May 10, 2024

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sayakpaul commented May 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

yiyixuxu May 31, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yiyixuxu May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

sayakpaul commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Jun 7, 2024

Choose a reason for hiding this comment

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Wauplin commented Jun 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sayakpaul commented May 29, 2024 •

edited

Loading

yiyixuxu May 31, 2024 •

edited

Loading

sayakpaul commented Jun 3, 2024 •

edited

Loading