[Hardware][Intel] Generate custom activation ops using torch.compile for CPU backend. #5446

bigPYJ1151 · 2024-06-12T09:12:47Z

Generate custom activation ops using torch.compile for CPU backend.

Main changes to vLLM:

~~Add _forward_native_impl to each custom ops to avoid recompilation caused by tracing self.~~

For vicuna-7b-v1.5, there is no significant regression:

# baseline
    Throughput: 1.39 requests/s, 667.72 tokens/s
# torch.compile
    Throughput: 1.38 requests/s, 662.98 tokens/s

For gpt-j-6b, vectorized math functions provide some improvement:

# baseline
    Throughput: 1.40 requests/s, 628.76 tokens/s
# torch.compile
    Throughput: 1.60 requests/s, 716.13 tokens/s

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

youkaichao · 2024-06-12T21:36:15Z

Hi, do you face the same problem as #3985 ? You are explicitly separating the self argument.

bigPYJ1151 · 2024-06-13T02:39:59Z

@youkaichao Yes, the guard logging shows different self and batchsize=1 will cause recompilation. I didn't find other more appropriate methods to hint the compiler stops the recompilation for each class instance.

youkaichao · 2024-06-13T03:32:18Z

FYI: pytorch has a flag to change the behavior. and we are working with pytorch team to make it default. hopefully you don't need it in the future.

bigPYJ1151 · 2024-06-13T07:13:03Z

@youkaichao Thanks for your information! The flag worked. It is strange only RMSNorm is affected by this problem.
The native forward of RotaryEmbedding also uses cos_sin_cache from self but there is no recompilation.

Anyway, adding _forward_native_impl to separate self is not required for the activation ops. The changes to public interfaces are reduced.

vllm/worker/cpu_worker.py

WoosukKwon

@bigPYJ1151 Thanks for the PR! QQ: Can we remove any C++ kernels after merging this PR?

WoosukKwon · 2024-06-17T06:28:37Z

vllm/worker/cpu_model_runner.py

Why do we need lazy import here?

For now, we only compile the CustomOps, so import the class for type identification.

Oh actually my question was why we import these "lazily".

No special reason. I think importing related classes at the local scope will make maintenance more convenient (for example, adding more transformation or moving the procedure to other places).

Thanks for the explanation. I personally think it's always good to avoid lazy imports whenever possible, but I agree that it can be a matter of personal preference. I'm ok with keeping it.

WoosukKwon · 2024-06-17T06:30:11Z

vllm/worker/cpu_model_runner.py

Add return here for calrity?

vllm/model_executor/layers/layernorm.py

vllm/model_executor/layers/rotary_embedding.py

vllm/worker/cpu_model_runner.py

WoosukKwon · 2024-06-17T06:36:33Z

vllm/worker/cpu_model_runner.py

What is the purpose of the profiling urn? Is the goal invoking torch.compile for different input shapes? Or is it for measuring CPU memory usage?

Yes, to invoking torch.compile for batchsize=1 and batchsize=others

Why are we profiling for those particular shapes? IIRC, torch.compile supports dynamic shapes unless some advanced features are used (e.g., CUDA graphs).

Because I noticed torch.compile will generate different code for batchsize=1 and batchsize=others under the dynamic mode. So we should invoke them all.

Got it. Thanks for the clarification.

vllm/worker/cpu_worker.py

bigPYJ1151 · 2024-06-17T10:32:25Z

Hi @WoosukKwon thanks for your review! I have fixed the code style and added some notes to solve your comments, please check them. With this PR, the C++ activation functions can be removed.

WoosukKwon

@bigPYJ1151 Thanks for updating the PR! Left more comments.

vllm/model_executor/layers/rotary_embedding.py

WoosukKwon · 2024-06-17T17:15:38Z

vllm/worker/cpu_model_runner.py

Oh actually my question was why we import these "lazily".

vllm/worker/cpu_model_runner.py

WoosukKwon · 2024-06-17T17:19:46Z

vllm/worker/cpu_model_runner.py

Why are we profiling for those particular shapes? IIRC, torch.compile supports dynamic shapes unless some advanced features are used (e.g., CUDA graphs).

WoosukKwon · 2024-06-17T17:20:24Z

vllm/worker/cpu_model_runner.py

Just wondering, can we do this in CustomOp.forward_cpu instead?

I tried this, but it didn't work.

The forward actually uses _forward_method, so we should replace _forward_method.

If we replaced _forward_method, the torch.complie will raise a error mutable rms_norm.default is not supported with cpp_wrapper. Seems cpp_wrapper is not compatible with RMSNorm.forward_cuda.

I see. This doesn't look aesthetically good to me, but I don't have an alternative solution... 😞

@youkaichao Could you please take a look at this part of code if you have time? Just wondering if you have any suggestion, as the code doesn't look ideal to me.

hmellor · 2024-09-12T13:04:20Z

@bigPYJ1151 shall we close this PR as it's ben superseded by #7110?

bigPYJ1151 force-pushed the torch_compile_act branch from bc39609 to 34c2c58 Compare June 12, 2024 14:16

WoosukKwon added the x86-cpu Related to Intel & AMD CPU label Jun 12, 2024

bigPYJ1151 force-pushed the torch_compile_act branch from 34c2c58 to b33bc09 Compare June 13, 2024 03:15

bigPYJ1151 force-pushed the torch_compile_act branch from b33bc09 to e524e0c Compare June 13, 2024 06:55

WoosukKwon self-assigned this Jun 13, 2024

bigPYJ1151 mentioned this pull request Jun 13, 2024

[RFC] Initial Support for CPUs #3654

Closed

4 tasks

zhouyuan reviewed Jun 13, 2024

View reviewed changes

vllm/worker/cpu_worker.py Outdated Show resolved Hide resolved

WoosukKwon reviewed Jun 17, 2024

View reviewed changes

bigPYJ1151 force-pushed the torch_compile_act branch from 12b4884 to f26024a Compare June 17, 2024 08:44

WoosukKwon reviewed Jun 17, 2024

View reviewed changes

bigPYJ1151 added 6 commits June 18, 2024 02:30

Generate custom activation ops using torch.compile for CPU backend

9983bb0

Fix max_prompt_len for warming up inputs

3b44ad6

Fix format

f9fb3b7

remove activation.cpp

9b8e97f

Format

e910051

rebase

1aaccff

bigPYJ1151 force-pushed the torch_compile_act branch from 9bc23c2 to 1aaccff Compare June 18, 2024 02:51

bigPYJ1151 added 2 commits June 18, 2024 02:52

fix

1b39c10

FIX

7ea50d1

WoosukKwon mentioned this pull request Aug 3, 2024

[Misc] Use torch.compile for basic custom ops #7110

Closed

bigPYJ1151 closed this Sep 20, 2024

Uh oh!

[Hardware][Intel] Generate custom activation ops using torch.compile for CPU backend. #5446

[Hardware][Intel] Generate custom activation ops using torch.compile for CPU backend. #5446

Uh oh!

Conversation

bigPYJ1151 commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

Uh oh!

youkaichao commented Jun 12, 2024

Uh oh!

bigPYJ1151 commented Jun 13, 2024

Uh oh!

youkaichao commented Jun 13, 2024

Uh oh!

bigPYJ1151 commented Jun 13, 2024

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bigPYJ1151 commented Jun 17, 2024

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmellor commented Sep 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

bigPYJ1151 commented Jun 12, 2024 •

edited

Loading