Skip to content

Conversation

@sayakpaul
Copy link
Member

What does this PR do?

Currently, the fast pipeline tests take about 20 minutes to fully execute. https://github.com/huggingface/diffusers/actions/runs/8430950874/artifacts/1358187212 gives an idea.

Below are the tests that take more than 10 seconds to run:

slowest durations
20.46s call     tests/pipelines/pia/test_pia.py::PIAPipelineFastTests::test_inference_batch_single_identical
20.17s call     tests/pipelines/animatediff/test_animatediff.py::AnimateDiffPipelineFastTests::test_inference_batch_single_identical
19.71s call     tests/pipelines/animatediff/test_animatediff.py::AnimateDiffPipelineFastTests::test_vae_slicing
19.43s call     tests/pipelines/stable_diffusion_2/test_stable_diffusion_attend_and_excite.py::StableDiffusionAttendAndExcitePipelineFastTests::test_karras_schedulers_shape
14.49s call     tests/pipelines/pia/test_pia.py::PIAPipelineFastTests::test_dict_tuple_outputs_equivalent
14.37s call     tests/pipelines/pia/test_pia.py::PIAPipelineFastTests::test_ip_adapter_single
14.28s call     tests/pipelines/pia/test_pia.py::PIAPipelineFastTests::test_ip_adapter_multi
13.95s call     tests/pipelines/animatediff/test_animatediff.py::AnimateDiffPipelineFastTests::test_save_load_local
13.90s call     tests/pipelines/pia/test_pia.py::PIAPipelineFastTests::test_save_load_optional_components
13.55s call     tests/pipelines/animatediff/test_animatediff.py::AnimateDiffPipelineFastTests::test_dict_tuple_outputs_equivalent
13.50s call     tests/pipelines/pia/test_pia.py::PIAPipelineFastTests::test_save_load_local
13.25s call     tests/pipelines/animatediff/test_animatediff.py::AnimateDiffPipelineFastTests::test_save_load_optional_components
10.60s call     tests/pipelines/audioldm2/test_audioldm2.py::AudioLDM2PipelineFastTests::test_audioldm2_num_waveforms_per_prompt
10.55s call     tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_adapter.py::StableDiffusionXLAdapterPipelineFastTests::test_StableDiffusionMixin_component

IMO we should try to optimize them a little. This PR is a first attempt at that. I tried a lot but couldn't meaningfully optimize the other tests that fall under this category. LMK if you you folks have any suggestions.

@sayakpaul sayakpaul requested review from DN6 and yiyixuxu March 26, 2024 10:19
@yiyixuxu
Copy link
Collaborator

yiyixuxu commented Mar 26, 2024

thanks for starting this! @sayakpaul

for ip_adapter tests, I wonder if instead of getting a baseline output by running the pipeline without the ip-adapter,

output_without_adapter = pipe(**inputs)[0]
I think maybe we can skip this and just compare it against a static slice

same can be applied in other tests e.g. test_dict_tuple_outputs_equivalent, test_inference_batch_single_identical

@sayakpaul
Copy link
Member Author

Love the idea, @yiyixuxu. Let me work on this in this PR itself. Thank you!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sayakpaul
Copy link
Member Author

@yiyixuxu I made some changes to check against static slices for "test_ip_adapter_single". If the plan looks okay to you, I will do the same for "test_ip_adapter_multi". And then in a future PR, I will tackle the following

same can be applied in other tests e.g. test_dict_tuple_outputs_equivalent, test_inference_batch_single_identical

I think this will be easier for the reviewers too.

Have left some comments in line to explain some thoughts.

LMK.

Comment on lines +115 to +122
if max_torch_print:
torch.set_printoptions(threshold=10_000)

test_name = os.environ.get("PYTEST_CURRENT_TEST")
if not torch.is_tensor(tensor):
tensor = torch.from_numpy(tensor)
if limit_to_slices:
tensor = tensor[0, -3:, -3:, -1]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convenience options. They are not harmful.

Comment on lines +134 to +136
def test_ip_adapter_single(self):
expected_pipe_slice = None
if torch_device == "cpu":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make use of torch_device which can change based on the environment. The slices below were obtained on Intel CPUs which is our default mode and used on the PR tests too.

For accelerators, the tests will run much faster anyway (even on MPS). So, I think this is the way to go here. But I welcome any other ideas too.

@DN6
Copy link
Collaborator

DN6 commented Mar 27, 2024

The block output sizes in the models can also be reduced. A lot of them use 32 and 64. These can be reduce to 4 and 8

@sayakpaul
Copy link
Member Author

The block output sizes in the models can also be reduced. A lot of them use 32 and 64. These can be reduce to 4 and 8

Agree. Could you take care of them in a separate PR? @DN6

@sayakpaul
Copy link
Member Author

From the CI:

https://github.com/huggingface/diffusers/actions/runs/8447126617/job/23136985269?pr=7477#step:6:12714

It's passing on the CPU GCP VM I am testing this on. I can increase the tolerance to 9e-4.

@yiyixuxu
Copy link
Collaborator

how come it takes 23 min now?

@sayakpaul
Copy link
Member Author

how come it takes 23 min now?

Probably because of the worker overload. Locally it takes less time is what I can confirm. Even on the CI, the timing is variable. I saw it took 18 minutes for a commit at one instance.

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice!
I saw tests/pipelines/pia/test_pia.py::PIAPipelineFastTests::test_ip_adapter_single went from 14 seconds to 9.97

expected_pipe_slice = np.array(
[0.7331, 0.5907, 0.5667, 0.6029, 0.5679, 0.5968, 0.4033, 0.4761, 0.5090]
)
return super().test_ip_adapter_single(expected_pipe_slice=expected_pipe_slice)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we pass device = "cpu" to test_ip_adapter_single when using slice? would it help precision?
I saw a test failing right now ......

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #7477 (comment).

expected_slice is always None when the torch_device is not CPU.

@yiyixuxu
Copy link
Collaborator

cc @DN6 for a final review

we can open this task to the community too!

If the plan looks okay to you, I will do the same for "test_ip_adapter_multi". And then in a future PR, I will tackle the following

same can be applied in other tests e.g. test_dict_tuple_outputs_equivalent, test_inference_batch_single_identical

Copy link
Collaborator

@DN6 DN6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool with me. Additionally, if you drop the cross_attention_dim in the dummy components it should lead to a big speedup for all these tests.

@sayakpaul
Copy link
Member Author

cc @DN6 for a final review

we can open this task to the community too!

If the plan looks okay to you, I will do the same for "test_ip_adapter_multi". And then in a future PR, I will tackle the following

same can be applied in other tests e.g. test_dict_tuple_outputs_equivalent, test_inference_batch_single_identical

@yiyixuxu I feel like this is kind of a good thing for us to completely tackle because it would greatly improve the CI experience not only just for the contributors but also for us, mantainers. So, I prefer to do this myself.

Cool with me. Additionally, if you drop the cross_attention_dim in the dummy components it should lead to a big speedup for all these tests.

@DN6 could you tackle this and #7477 (comment) in a PR?

@sayakpaul
Copy link
Member Author

Okay so, the current failing test is different from the one I saw yesterday. The tolerance isn't that much to up, I believe. LMK.

@sayakpaul sayakpaul merged commit fac7616 into main Mar 29, 2024
@sayakpaul sayakpaul deleted the speed-up-fastpipelines branch March 29, 2024 08:41
noskill pushed a commit to noskill/diffusers that referenced this pull request Apr 5, 2024
* speed up test_vae_slicing in animatediff

* speed up test_karras_schedulers_shape for attend and excite.

* style.

* get the static slices out.

* specify torch print options.

* modify

* test run with controlnet

* specify kwarg

* fix: things

* not None

* flatten

* controlnet img2img

* complete controlet sd

* finish more

* finish more

* finish more

* finish more

* finish the final batch

* add cpu check for expected_pipe_slice.

* finish the rest

* remove print

* style

* fix ssd1b controlnet test

* checking ssd1b

* disable the test.

* make the test_ip_adapter_single controlnet test more robust

* fix: simple inpaint

* multi

* disable panorama

* enable again

* panorama is shaky so leave it for now

* remove print

* raise tolerance.
sayakpaul added a commit that referenced this pull request Dec 23, 2024
* speed up test_vae_slicing in animatediff

* speed up test_karras_schedulers_shape for attend and excite.

* style.

* get the static slices out.

* specify torch print options.

* modify

* test run with controlnet

* specify kwarg

* fix: things

* not None

* flatten

* controlnet img2img

* complete controlet sd

* finish more

* finish more

* finish more

* finish more

* finish the final batch

* add cpu check for expected_pipe_slice.

* finish the rest

* remove print

* style

* fix ssd1b controlnet test

* checking ssd1b

* disable the test.

* make the test_ip_adapter_single controlnet test more robust

* fix: simple inpaint

* multi

* disable panorama

* enable again

* panorama is shaky so leave it for now

* remove print

* raise tolerance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants