Skip to content

flux.1-dev device_map didn't work #9127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hznnnnnn opened this issue Aug 8, 2024 · 33 comments
Closed

flux.1-dev device_map didn't work #9127

hznnnnnn opened this issue Aug 8, 2024 · 33 comments

Comments

@hznnnnnn
Copy link

hznnnnnn commented Aug 8, 2024

I try to use device_map to use multiple gpu's, but it not worked, how can I use all my gpus?

@a-r-r-o-w
Copy link
Member

What did you try? What did not work? What environment are your running in? Could you update the description with the minimal reproducible snippet that can replicate the behaviour you are facing, and run diffusers-cli env to tell us about your environment?

cc @sayakpaul

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 8, 2024

What did you try? What did not work? What environment are your running in? Could you update the description with the minimal reproducible snippet that can replicate the behaviour you are facing, and run diffusers-cli env to tell us about your environment?

cc @sayakpaul

I try to use device_map in FluxPipeline: pipe = FluxPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, use_safetensors=True, device_map="balanced"), but it not worked

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 8, 2024

What did you try? What did not work? What environment are your running in? Could you update the description with the minimal reproducible snippet that can replicate the behaviour you are facing, and run diffusers-cli env to tell us about your environment?

cc @sayakpaul

I run in ubantu, and I have 8 gpus

@sayakpaul
Copy link
Member

Please post the error trace.

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 8, 2024

Please post the error trace.

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)
but I have 8 gpus, only used 2 gpus

@sayakpaul
Copy link
Member

Provide the full error trace please.

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 8, 2024

Provide the full error trace please.

now I can worked, this is my code:
max_memory = {0:"20GB", 1:"20GB", 2:"20GB", 3:"20GB", 4:"20GB", 5:"20GB", 6:"20GB", 7:"20GB"}
pipe = FluxPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, use_safetensors=True, device_map="balanced", max_memory=max_memory)
#pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
#pipe.enable_sequential_cpu_offload()

prompt = "A real person, with long red hair, hot body, sexy, lovely, 8k"
image = pipe(
prompt,
height=1024,
width=1024,
guidance_scale=3.5,
output_type="pil",
num_inference_steps=50,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-dev.png")

But now it only one gpu is working
image

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 8, 2024

Provide the full error trace please.

this is full error trance:
(flux.1-dev) root@master:/home/omnisky/flux.1-dev/FLUX.1-dev# python test.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00, 1.10it/s]
Loading pipeline components...: 86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 6/7 [00:08<00:02, 2.48s/it]You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00, 1.31s/it]
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/omnisky/flux.1-dev/FLUX.1-dev/test.py", line 13, in
image = pipe(
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/diffusers/pipelines/flux/pipeline_flux.py", line 696, in call
noise_pred = self.transformer(
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py", line 362, in forward
hidden_states = self.x_embedder(hidden_states)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 117, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

and code:
pipe = FluxPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, use_safetensors=True, device_map="balanced")
#pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
#pipe.enable_sequential_cpu_offload()

prompt = "A real person, with long red hair, hot body, sexy, lovely, 8k"
image = pipe(
prompt,
height=1024,
width=1024,
guidance_scale=3.5,
output_type="pil",
num_inference_steps=50,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-dev.png")

@sayakpaul
Copy link
Member

sayakpaul commented Aug 8, 2024

I am able to run the following:

from diffusers import FluxPipeline 
import torch 


model_path = "black-forest-labs/FLUX.1-dev"
pipe = FluxPipeline.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="balanced"
)
print(pipe.hf_device_map) 
image = pipe(
    prompt="dog",
    height=1024,
    width=1024,
    guidance_scale=3.5,
    output_type="pil",
    num_inference_steps=50,
    max_sequence_length=512,
).images[0]

I am running this on two A100s.

This is how the device map looks like:

{'transformer': 0, 'vae': 0, 'text_encoder_2': 1, 'text_encoder': 1}

@sayakpaul
Copy link
Member

Able to do do this with the following setup too:

pipe = FluxPipeline.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"24GB", 1:"24GB"}
)

@sayakpaul
Copy link
Member

Also able to run with

pipe = FluxPipeline.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"16GB", 1:"16GB", 2:"24GB"}
)

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 9, 2024

Also able to run with

pipe = FluxPipeline.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"16GB", 1:"16GB", 2:"24GB"}
)

this is my code:

import torch
from diffusers import FluxPipeline
from accelerate import PartialState

model_path = "/home/omnisky/flux.1-dev/FLUX.1-dev"

pipe = FluxPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"16GB", 1:"16GB", 2:"16GB", 3:"16GB"})
#pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
#pipe.enable_sequential_cpu_offload()

print(pipe.hf_device_map)

prompt = "A real person, with long red hair, hot body, sexy, lovely, 8k"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    output_type="pil",
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-dev.png")

the device map is : {'transformer': 'cpu', 'text_encoder_2': 0, 'text_encoder': 1, 'vae': 2}
Why transformer worked in cpu? Is it my gpu that sucks?My gpu is Tesla p40, 8 gpus.

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 9, 2024

Also able to run with

pipe = FluxPipeline.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"16GB", 1:"16GB", 2:"24GB"}
)

I removed generator=torch.Generator("cpu").manual_seed(0), but it always show transformer in cpu

@sayakpaul
Copy link
Member

Probably because of the low VRAM. May try increasing that to 20GB since it's supported?

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 9, 2024

Probably because of the low VRAM. May try increasing that to 20GB since it's supported?

max_memory={0:"16GB", 1:"16GB", 2:"16GB", 3:"16GB"}) this 16gb increasing to 20GB?

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 9, 2024

Probably because of the low VRAM. May try increasing that to 20GB since it's supported?

It's still not working.

@sayakpaul
Copy link
Member

I don't know what is happening then. I have provided three different scenarios where it is working. Maybe there's something wrong with your env?

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 9, 2024

I don't know what is happening then. I have provided three different scenarios where it is working. Maybe there's something wrong with your env?

Thank u.Maybe I should get some better gpus.

@sayakpaul
Copy link
Member

sayakpaul commented Aug 9, 2024

Possible to try out the following?

pipe = FluxPipeline.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"20GB", 1:"20GB", 2:"22GB"}
)

And please post the error trace you are seeing if it's erroring out with a proper formatting. It's really hard to read through raw traces.

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 9, 2024

Possible to try out the following?

pipe = FluxPipeline.from_pretrained(
    model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"20GB", 1:"20GB", 2:"22GB"}
)

And please post the error trace you are seeing if it's erroring out with a proper formatting. It's really hard to read through raw traces.

Did u mean this?

(flux.1-dev) root@master:/home/omnisky/flux.1-dev/FLUX.1-dev# python test.py
Loading checkpoint shards: 50%|████████████████████████████████████████████████████████████████████████ | 1/2 [00:02<00:02, 2.08s/it]
Loading pipeline components...: 0%| | 0/7 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/home/omnisky/flux.1-dev/FLUX.1-dev/test.py", line 8, in
pipe = FluxPipeline.from_pretrained(
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 876, in from_pretrained
loaded_sub_model = load_sub_model(
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/diffusers/pipelines/pipeline_loading_utils.py", line 700, in load_sub_model
loaded_sub_model = load_method(os.path.join(cached_folder, name), **loading_kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3926, in from_pretrained
) = cls._load_pretrained_model(
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4400, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/transformers/modeling_utils.py", line 936, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/root/anaconda3/envs/flux.1-dev/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 416, in set_module_tensor_to_device
new_value = value.to(device)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 2 has a total capacity of 22.38 GiB of which 36.19 MiB is free. Process 7193 has 14.19 GiB memory in use. Including non-PyTorch memory, this process has 8.14 GiB memory in use. Of the allocated memory 8.00 GiB is allocated by PyTorch, and 1.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

If I use max_memory={0:"20GB", 1:"20GB", 2:"22GB"}, gpu2 will out of memory, the image below is the gpu usage when I don't have it running:
image

@sayakpaul
Copy link
Member

Can you ensure your GPUs don't have any other processes running?

You can do so by first doing torch.cuda.empty_cache() and then start your script.

@sayakpaul
Copy link
Member

Also, possible to try this after updating CUDA and your PyTorch?

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 9, 2024

torch.cuda.empty_cache()

It print is none

@sayakpaul
Copy link
Member

It's not supposed to print anything. It's supposed to clear your CUDA cache if there's any residue remaining.

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 9, 2024

Also, possible to try this after updating CUDA and your PyTorch?

My cuda version is 12.1, my pytorch version is 2.4.0+cu121, I cant update cuda.

@hznnnnnn
Copy link
Author

hznnnnnn commented Aug 9, 2024

Also, possible to try this after updating CUDA and your PyTorch?

Now I can run, but always run in cpu:
(flux.1-dev) root@master:/home/omnisky/flux.1-dev/FLUX.1-dev# python test.py
None
Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]Some parameters are on the meta device device because they were offloaded to the cpu.
Some parameters are on the meta device device because they were offloaded to the cpu.
Loading pipeline components...: 57%|███████████████████████████████████████████████████████████████████████████████▍ | 4/7 [00:01<00:01, 2.82it/s]You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.01s/it]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00, 1.72it/s]
{'transformer': 'cpu', 'text_encoder_2': 6, 'text_encoder': 7, 'vae': 0}
4%|██████▊ | 2/50 [00:51<20:32, 25.67s/it]

It's very slow and my gpu0's volatitle Gpu-util used 98%,my code is:

pipe = FluxPipeline.from_pretrained(
model_path, torch_dtype=torch.bfloat16, device_map="balanced", max_memory={0:"20GB", 3:"20GB", 6:"22GB", 7:"22GB"}
)

@sayakpaul
Copy link
Member

#9159 should probably work better for you.

@sayakpaul
Copy link
Member

@hznnnnnn did you try it out?

Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024
@a-r-r-o-w
Copy link
Member

Gentle ping to request status of this issue. If it has been resolved, I think we can close it. As far as I understand, we were unable to replicate the problems but errors persisted on @hznnnnnn's end, which might have be due to env errors or possibly resolved with the linked PR.

@github-actions github-actions bot removed the stale Issues that haven't received updates label Nov 19, 2024
@lizamd
Copy link

lizamd commented Nov 25, 2024

hi @sayakpaul I am able to use device_map=balanced and see multiple GPUs computing, however, the performance is much worse than only using 1 GPU. I see different models are mapped to different GPUs, but does it help with performance? ideally, we should see performance improvement when turning on sharding, right? what is the underlying logic of device_map? I also see the models are mapped pretty randomly, sometimes on GPU, sometimes on even on CPU. I am using 8 AMD MI300x GPUs

@sayakpaul
Copy link
Member

I am able to use device_map=balanced and see multiple GPUs computing, however, the performance is much worse than only using 1 GPU. I see different models are mapped to different GPUs, but does it help with performance? ideally, we should see performance improvement when turning on sharding, right?

No, because the computation is not in the lines of "context-parallel". The computation is still sequential. However, the upside is that we instead of having to place all the models on a single low-memory GPU, we can use multiple low-memory ones to put them to some use.

Say we have two 16GB cards. If we were to use a single 16GB card, we couldn't have placed all the models on that card, but with GPUs, we can at least the different models and run inference. Sharding a single model across different GPUs is also possible, see: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference#model-sharding.

If you're looking for context-parallelism, then we don't support that natively (we might very soon). For now, your best options to explore xDiT and https://github.com/chengzeyi/ParaAttention.

Hope that helps.

ideally, we should see performance improvement when turning on sharding, right? what is the underlying logic of device_map? I also see the models are mapped pretty randomly, sometimes on GPU, sometimes on even on CPU. I am using 8 AMD MI300x GPUs

def _assign_components_to_devices(

@sayakpaul
Copy link
Member

I am going to close this issue because we have shown device_map works. Feel free to re-open if not.

#9127 (comment) sheds light on context-parallelism and its support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants