-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Closed
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates
Description
Describe the bug
I have used /examples/text_to_image/train_text_to_image_sdxl.py to train a fine tune sdxl. I used accelerate 0.25.0 + FSDP, when I was saving a checkpoint it will stuck and can't save a whole ckpt. And I also tried deepspeed it will stuck too. I didn't change any code in train_text_to_image_sdxl.py
Reproduction
accelerate config is
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
diwbcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: true
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_state: true
fsdp_transformer_layer_cls_to_wrap: UNet2DConditionModel, DownBlock2D, CrossAttnDownBlock2D. UpBlock2D, CrossAttnUpBlock2D
fsdp_use_orig_params: true
machine_rank: 0,
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false code: /examples/text_to_image/train_text_to_image_sdxl.py
pretrain_model and dataset: totally follow README
Logs
No response
System Info
Linux localhost.localdomain 4.14.0-115.el7a.0.1.aarch64
Who can help?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingstaleIssues that haven't received updatesIssues that haven't received updates