-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
[Hardware][Intel] Generate custom activation ops using torch.compile for CPU backend. #5446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bc39609
to
34c2c58
Compare
Hi, do you face the same problem as #3985 ? You are explicitly separating the |
@youkaichao Yes, the guard logging shows different |
34c2c58
to
b33bc09
Compare
FYI: pytorch has a flag to change the behavior. and we are working with pytorch team to make it default. hopefully you don't need it in the future. |
b33bc09
to
e524e0c
Compare
@youkaichao Thanks for your information! The flag worked. It is strange only Anyway, adding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bigPYJ1151 Thanks for the PR! QQ: Can we remove any C++ kernels after merging this PR?
vllm/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need lazy import here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, we only compile the CustomOps
, so import the class for type identification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually my question was why we import these "lazily".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No special reason. I think importing related classes at the local scope will make maintenance more convenient (for example, adding more transformation or moving the procedure to other places).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I personally think it's always good to avoid lazy imports whenever possible, but I agree that it can be a matter of personal preference. I'm ok with keeping it.
vllm/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add return
here for calrity?
vllm/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of the profiling urn? Is the goal invoking torch.compile
for different input shapes? Or is it for measuring CPU memory usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to invoking torch.compile
for batchsize=1
and batchsize=others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we profiling for those particular shapes? IIRC, torch.compile
supports dynamic shapes unless some advanced features are used (e.g., CUDA graphs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I noticed torch.compile
will generate different code for batchsize=1
and batchsize=others
under the dynamic mode. So we should invoke them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks for the clarification.
12b4884
to
f26024a
Compare
Hi @WoosukKwon thanks for your review! I have fixed the code style and added some notes to solve your comments, please check them. With this PR, the C++ activation functions can be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bigPYJ1151 Thanks for updating the PR! Left more comments.
vllm/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually my question was why we import these "lazily".
vllm/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we profiling for those particular shapes? IIRC, torch.compile
supports dynamic shapes unless some advanced features are used (e.g., CUDA graphs).
vllm/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering, can we do this in CustomOp.forward_cpu
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, but it didn't work.
- The
forward
actually uses_forward_method
, so we should replace_forward_method
. - If we replaced
_forward_method
, thetorch.complie
will raise a errormutable rms_norm.default is not supported with cpp_wrapper
. Seemscpp_wrapper
is not compatible withRMSNorm.forward_cuda
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. This doesn't look aesthetically good to me, but I don't have an alternative solution... 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao Could you please take a look at this part of code if you have time? Just wondering if you have any suggestion, as the code doesn't look ideal to me.
9bc23c2
to
1aaccff
Compare
@bigPYJ1151 shall we close this PR as it's ben superseded by #7110? |
Generate custom activation ops using
torch.compile
for CPU backend.Main changes to vLLM:
Add_forward_native_impl
to each custom ops to avoid recompilation caused by tracingself
.For vicuna-7b-v1.5, there is no significant regression:
For gpt-j-6b, vectorized math functions provide some improvement:
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!