[BUG]: Hybrid Parallel Plugin,zero_stage=1,zero_cpu_offload=true,terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call #6266
Labels
bug
Something isn't working
Is there an existing issue for this bug?
The bug has not been fixed in the latest main branch
Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)
Yes, I will share a minimal reproducible script.
🐛 Describe the bug
colossal=0.4.9,Hybrid Parallel Plugin,zero_stage=1,zero_cpu_offload=true,在八张A100显卡上训练QWQ32B,当pp=2,tp=4时程序正常运行,但GPU显存占用很少,80G的显卡只占用了20G,而CPU内存占用较大,占满了服务器CPU内存,增大max_length后报错:terminate called after throwing an instance of 'c10::Error' what() Cuda error: unspecified launch failure cuda kernel errors might be asynchronously reported at some other API call so the stacktrace bellow might be incorrect,如何增大GPU显存占用、减小CPU内存占用?
Environment
No response
The text was updated successfully, but these errors were encountered: