Skip to content

[BUG]: Hybrid Parallel Plugin,zero_stage=1,zero_cpu_offload=true,terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call #6266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 tasks done
happynaruto opened this issue Apr 16, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@happynaruto
Copy link

Is there an existing issue for this bug?

  • I have searched the existing issues

The bug has not been fixed in the latest main branch

  • I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

Yes, I will share a minimal reproducible script.

🐛 Describe the bug

colossal=0.4.9,Hybrid Parallel Plugin,zero_stage=1,zero_cpu_offload=true,在八张A100显卡上训练QWQ32B,当pp=2,tp=4时程序正常运行,但GPU显存占用很少,80G的显卡只占用了20G,而CPU内存占用较大,占满了服务器CPU内存,增大max_length后报错:terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call so the stacktrace bellow might be incorrect,如何增大GPU显存占用、减小CPU内存占用?

Environment

No response

@happynaruto happynaruto added the bug Something isn't working label Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant