Skip to content

accelerate + FSDP + T2I train saving ckpt error #6705

@Forainest

Description

@Forainest

Describe the bug

I have used /examples/text_to_image/train_text_to_image_sdxl.py to train a fine tune sdxl. I used accelerate 0.25.0 + FSDP, when I was saving a checkpoint it will stuck and can't save a whole ckpt. And I also tried deepspeed it will stuck too. I didn't change any code in train_text_to_image_sdxl.py

Reproduction

accelerate config is

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
diwbcast_bf16: 'no'
fsdp_config:
    fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
    fsdp_backward_prefetch_policy: BACKWARD_PRE
    fsdp_cpu_ram_efficient_loading: true
    fsdp_forward_prefetch: true
    fsdp_offload_params: true
    fsdp_sharding_strategy: 1
    fsdp_state_dict_type: FULL_STATE_DICT
    fsdp_sync_module_state: true
    fsdp_transformer_layer_cls_to_wrap: UNet2DConditionModel, DownBlock2D, CrossAttnDownBlock2D. UpBlock2D, CrossAttnUpBlock2D
    fsdp_use_orig_params: true
machine_rank: 0,
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false 

code: /examples/text_to_image/train_text_to_image_sdxl.py
pretrain_model and dataset: totally follow README

Logs

No response

System Info

Linux localhost.localdomain 4.14.0-115.el7a.0.1.aarch64

Who can help?

@yiyixuxu @sayakpaul

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleIssues that haven't received updates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions