Skip to content

[Latent Consistency Distillation] training stuck at 0% #5743

@rdcoder33

Description

@rdcoder33

Describe the bug

When using LCM-LoRA training script, the training just stucks at 0%, tired all different parameters, got same error.

Reproduction

Follow the steps on this page,

[Latent Consistency Distillation Example:]

I have used mutliple datasets from huggingface got same error on all,

my current dataset is in format:

data.tar => shard00001.tar ......... shard00030.tar

each shard has three files, for example shard00001.tar has:

file_1.jpg
file_1.json
file_1.txt

I am not sure if the issue is caused by dataset or something else

Logs

root@55a014677aad:/workspace/diffusers/examples/consistency_distillation# accelerate launch train_lcm_distill_lora_sdxl_wds.py --pretrained_teacher_model=$MODEL_DIR --pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix --mixed_precision=fp16 --resolution=512 --lora_rank=64 --learning_rate=1e-6 --loss_type="huber" --use_fix_crop_and_size --adam_weight_decay=0.0 --max_train_steps=1000 --max_train_samples=4000 --dataloader_num_workers=1 --train_shards_path_or_url='/workspace/output.tar' --validation_steps=200 --checkpointing_steps=200 --checkpoints_total_limit=10 --train_batch_size=1 --gradient_checkpointing --enable_xformers_memory_efficient_attention --gradient_accumulation_steps=1 --use_8bit_adam --resume_from_checkpoint=latest --report_to=wandb --seed=453645634 --push_to_hub
/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py:384: UserWarning: `log_with=wandb` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
11/10/2023 12:16:43 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py:486: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'attention_type', 'dropout', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
{'attention_type', 'dropout', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
11/10/2023 12:16:58 - INFO - __main__ - ***** Running training *****
11/10/2023 12:16:58 - INFO - __main__ -   Num batches each epoch = 4000
11/10/2023 12:16:58 - INFO - __main__ -   Num Epochs = 1
11/10/2023 12:16:58 - INFO - __main__ -   Instantaneous batch size per device = 1
11/10/2023 12:16:58 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
11/10/2023 12:16:58 - INFO - __main__ -   Gradient Accumulation steps = 1
11/10/2023 12:16:58 - INFO - __main__ -   Total optimization steps = 1000
Checkpoint 'latest' does not exist. Starting a new training run.
Steps:   0%|                                                                                 | 0/1000 [00:00<?, ?it/s]

System Info

On Rupod Cloud GPU A5000 24GB:

runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04

root@55a014677aad:/workspace# diffusers-cli env

diffusers-cli env:

- `diffusers` version: 0.23.0.dev0
- Platform: Linux-5.15.0-84-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- PyTorch version (GPU?): 2.1.0+cu118 (True)
- Huggingface_hub version: 0.17.3
- Transformers version: 4.35.0
- Accelerate version: 0.24.1
- xFormers version: 0.0.22.post7+cu118
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Who can help?

@sayakpaul @patil-suraj @pcuenca @yiyixuxu

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions