Skip to content

Error in Research Projects Consistency Training Script(DistributedDataParallel Error) line 1198 #8477

@KetanMann

Description

@KetanMann

Model/Pipeline/Scheduler description

ConsistencyModelPipeline
In diffusers/examples/research_projects /consistency_training/ example, When I use multi_gpu then there is this error:- Traceback (most recent call last): File "/kaggle/working/train_cm_ct_unconditional.py", line 1438, in main(args) File "/kaggle/working/train_cm_ct_unconditional.py", line 1198, in main args.huber_c = 0.00054 * args.resolution * math.sqrt(unet.config.in_channels) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'DistributedDataParallel' object has no attribute 'config' Traceback (most recent call last): File "/kaggle/working/train_cm_ct_unconditional.py", line 1438, in main(args) File "/kaggle/working/train_cm_ct_unconditional.py", line 1198, in main args.huber_c = 0.00054 * args.resolution * math.sqrt(unet.config.in_channels) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'DistributedDataParallel' object has no attribute 'config' [2024-06-11 19:37:38,530] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 149) of binary: /opt/conda/bin/python3.10 Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Open source status

  • The model implementation is available.
  • The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

@vanakema
https://github.com/huggingface/diffusers/blob/main/examples/research_projects/consistency_training/train_cm_ct_unconditional.py
!accelerate launch train_cm_ct_unconditional.py
--dataset_name="cifar10"
--dataset_image_column_name="img"
--output_dir="/kaggle/working/"
--mixed_precision="no"
--resolution=32
--max_train_steps=1000
--max_train_samples=10000
--dataloader_num_workers=4
--noise_precond_type="cm"
--input_precond_type="cm"
--train_batch_size=4
--learning_rate=1e-04
--lr_scheduler="constant"
--lr_warmup_steps=0
--use_8bit_adam
--use_ema
--validation_steps=100
--eval_batch_size=4
--checkpointing_steps=10000
--checkpoints_total_limit=10
--class_conditional
--num_classes=10
In diffusers/examples/research_projects /consistency_training/ example, When I use multi_gpu then there is this error:- Traceback (most recent call last): File "/kaggle/working/train_cm_ct_unconditional.py", line 1438, in main(args) File "/kaggle/working/train_cm_ct_unconditional.py", line 1198, in main args.huber_c = 0.00054 * args.resolution * math.sqrt(unet.config.in_channels) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'DistributedDataParallel' object has no attribute 'config' Traceback (most recent call last): File "/kaggle/working/train_cm_ct_unconditional.py", line 1438, in main(args) File "/kaggle/working/train_cm_ct_unconditional.py", line 1198, in main args.huber_c = 0.00054 * args.resolution * math.sqrt(unet.config.in_channels) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'DistributedDataParallel' object has no attribute 'config' [2024-06-11 19:37:38,530] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 149) of binary: /opt/conda/bin/python3.10 Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1073, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
@dg845

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions