-
Notifications
You must be signed in to change notification settings - Fork 6k
flux.1-dev device_map didn't work #9127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What did you try? What did not work? What environment are your running in? Could you update the description with the minimal reproducible snippet that can replicate the behaviour you are facing, and run cc @sayakpaul |
I try to use device_map in FluxPipeline: pipe = FluxPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, use_safetensors=True, device_map="balanced"), but it not worked |
I run in ubantu, and I have 8 gpus |
Please post the error trace. |
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling |
Provide the full error trace please. |
now I can worked, this is my code: prompt = "A real person, with long red hair, hot body, sexy, lovely, 8k" |
this is full error trance: and code: prompt = "A real person, with long red hair, hot body, sexy, lovely, 8k" |
I am able to run the following: from diffusers import FluxPipeline
import torch
model_path = "black-forest-labs/FLUX.1-dev"
pipe = FluxPipeline.from_pretrained(
model_path, torch_dtype=torch.bfloat16, device_map="balanced"
)
print(pipe.hf_device_map)
image = pipe(
prompt="dog",
height=1024,
width=1024,
guidance_scale=3.5,
output_type="pil",
num_inference_steps=50,
max_sequence_length=512,
).images[0] I am running this on two A100s. This is how the device map looks like: {'transformer': 0, 'vae': 0, 'text_encoder_2': 1, 'text_encoder': 1} |
Able to do do this with the following setup too: pipe = FluxPipeline.from_pretrained(
model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"24GB", 1:"24GB"}
) |
Also able to run with pipe = FluxPipeline.from_pretrained(
model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"16GB", 1:"16GB", 2:"24GB"}
) |
this is my code: import torch
from diffusers import FluxPipeline
from accelerate import PartialState
model_path = "/home/omnisky/flux.1-dev/FLUX.1-dev"
pipe = FluxPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"16GB", 1:"16GB", 2:"16GB", 3:"16GB"})
#pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
#pipe.enable_sequential_cpu_offload()
print(pipe.hf_device_map)
prompt = "A real person, with long red hair, hot body, sexy, lovely, 8k"
image = pipe(
prompt,
height=1024,
width=1024,
guidance_scale=3.5,
output_type="pil",
num_inference_steps=50,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-dev.png") the device map is : |
I removed generator=torch.Generator("cpu").manual_seed(0), but it always show transformer in cpu |
Probably because of the low VRAM. May try increasing that to 20GB since it's supported? |
max_memory={0:"16GB", 1:"16GB", 2:"16GB", 3:"16GB"}) this 16gb increasing to 20GB? |
It's still not working. |
I don't know what is happening then. I have provided three different scenarios where it is working. Maybe there's something wrong with your env? |
Thank u.Maybe I should get some better gpus. |
Possible to try out the following? pipe = FluxPipeline.from_pretrained(
model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"20GB", 1:"20GB", 2:"22GB"}
) And please post the error trace you are seeing if it's erroring out with a proper formatting. It's really hard to read through raw traces. |
Did u mean this? (flux.1-dev) root@master:/home/omnisky/flux.1-dev/FLUX.1-dev# python test.py If I use max_memory={0:"20GB", 1:"20GB", 2:"22GB"}, gpu2 will out of memory, the image below is the gpu usage when I don't have it running: |
Can you ensure your GPUs don't have any other processes running? You can do so by first doing |
Also, possible to try this after updating CUDA and your PyTorch? |
It print is none |
It's not supposed to print anything. It's supposed to clear your CUDA cache if there's any residue remaining. |
My cuda version is 12.1, my pytorch version is 2.4.0+cu121, I cant update cuda. |
Now I can run, but always run in cpu: It's very slow and my gpu0's volatitle Gpu-util used 98%,my code is: pipe = FluxPipeline.from_pretrained( |
#9159 should probably work better for you. |
@hznnnnnn did you try it out? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Gentle ping to request status of this issue. If it has been resolved, I think we can close it. As far as I understand, we were unable to replicate the problems but errors persisted on @hznnnnnn's end, which might have be due to env errors or possibly resolved with the linked PR. |
hi @sayakpaul I am able to use device_map=balanced and see multiple GPUs computing, however, the performance is much worse than only using 1 GPU. I see different models are mapped to different GPUs, but does it help with performance? ideally, we should see performance improvement when turning on sharding, right? what is the underlying logic of device_map? I also see the models are mapped pretty randomly, sometimes on GPU, sometimes on even on CPU. I am using 8 AMD MI300x GPUs |
No, because the computation is not in the lines of "context-parallel". The computation is still sequential. However, the upside is that we instead of having to place all the models on a single low-memory GPU, we can use multiple low-memory ones to put them to some use. Say we have two 16GB cards. If we were to use a single 16GB card, we couldn't have placed all the models on that card, but with GPUs, we can at least the different models and run inference. Sharding a single model across different GPUs is also possible, see: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference#model-sharding. If you're looking for context-parallelism, then we don't support that natively (we might very soon). For now, your best options to explore xDiT and https://github.com/chengzeyi/ParaAttention. Hope that helps.
|
I am going to close this issue because we have shown #9127 (comment) sheds light on context-parallelism and its support. |
I try to use device_map to use multiple gpu's, but it not worked, how can I use all my gpus?
The text was updated successfully, but these errors were encountered: