Skip to content

中断重启同一个实验之后wandb如何继续而不是新开一个runs #7855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task done
FloSophorae opened this issue Apr 26, 2025 · 0 comments
Open
1 task done
Labels
bug Something isn't working pending This problem is yet to be addressed

Comments

@FloSophorae
Copy link

FloSophorae commented Apr 26, 2025

Reminder

  • I have read the above rules and searched the existing issues.

System Info

  • llamafactory version: 0.9.3.dev0
  • Platform: Linux-5.4.250-2-velinux1u1-amd64-x86_64-with-glibc2.31
  • Python version: 3.10.16
  • PyTorch version: 2.6.0+cu124 (GPU)
  • Transformers version: 4.51.3
  • Datasets version: 3.5.0
  • Accelerate version: 1.6.0
  • PEFT version: 0.15.1
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800-SXM4-80GB
  • GPU number: 4
  • GPU memory: 79.35GB
  • DeepSpeed version: 0.16.6
  • vLLM version: 0.8.4
  • Git commit: d07983d

Reproduction

我的实验中间断了,然后我现在想重跑这个实验,我是用了wandb,设置了run_name和自己的wandb_key,我要从之前的checkpoint重新开始跑,wandb的run_name和key都保持一直,但是我的wandb会新开一个runs,我应该如何让wandb也接着之前的继续记录呢?感谢解答

Others

No response

@FloSophorae FloSophorae added bug Something isn't working pending This problem is yet to be addressed labels Apr 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant