Skip to content

Qualcomm AI Engine Direct - Add QNN support for to_edge_transform_and_lower #9643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

shewu-quic
Copy link
Collaborator

Summary:

  • Support to_edge_transform_and_lower

    • Replace capture_program with new API to_edge_transform_and_lower_to_qnn
    • Replace capture_program with to_edge_transform_and_lower_to_qnn for unit_test
    • Replace capture_program with to_edge_transform_and_lower_to_qnn for examples
    • Replace capture_program with to_edge_transform_and_lower_to_qnn for llama
  • Add QnnPassManager to manage all passes in different stage

    • Deprecated _transform in export_llama_lib with qnn_pass_manager
    • Add transform_for_export_pipeline for LiftConstantScalarOperands to avoid creating temporary tensors in the operation builder. However, this pass will create a get_attr node, which should be converted into a lifted tensor constant by the lift_constant_tensor_pass. If placed in the to_edge_transform_passes, it will be executed after the lift_constant_tensor_pass, causing the operation builder to fail to correctly retrieve the parameter by the get_parameter for get_attr node.
  • Refactor the passes

    • Fix the output dtype doesn't match in runtime after build quant io
    • Combine constant_i64_to_i32 and tensor_i64_to_i32 into i64_to_i32
    • Replace convert_to_linear pass with fixed_linear_keep_dim pass
      • Since QNN has no keep dims for linear op, we will need to add squeeze and unsqueeze around linear node
    • Add TagQuantIO pass to tag io nodes to avoid inserting q/dq in qnn_preprocess
    • Add prelu, leaky_relu, linear, rms_norm into decompose_table
      • Remove recompose_prelu.py
    • Remove unused variable in insert_requantize.py, and replace_index_put_input.py
  • Support aten.split_with_sizes_copy.default

  • Support leaky_relu with inplace=True

Copy link

pytorch-bot bot commented Mar 26, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9643

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3732242 with merge base 1ea101e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 26, 2025
@shewu-quic
Copy link
Collaborator Author

Hi @cccclai ,
This PR is to support the API to_edge_transform_and_lower and refactor the pass management.
Could you please help take a look?

Thanks!

@cccclai
Copy link
Contributor

cccclai commented Mar 26, 2025

Hi, thank you so much for adding the support for to_edge_transform_and_lower, we will merge it soon. One question: it looks like there are lots of changes during this refactor, is there feedback for us to reduce the engineer work to make your lives easier?

@shewu-quic
Copy link
Collaborator Author

Hi, thank you so much for adding the support for to_edge_transform_and_lower, we will merge it soon. One question: it looks like there are lots of changes during this refactor, is there feedback for us to reduce the engineer work to make your lives easier?

Thank you for your effort. We are trying to align closely with the official API instead of calling the wrapper API "to_edge_transform_and_lowering_to_qnn". Therefore, we have revisited our passes and are attempting to either remove it or move it to the QNN preprocess or QNN partitioner. I have the following points:

  1. Is there an official way to perform decomposition, such as custom decomposition, exception lists, or something similar?
    We currently have some nodes that need to decompose a group of nodes to delegate to QNN before partitioning, such as DecomposeScaledDotProductAttention, DecomposeLinalgVectorNorm, and DecomposeAny.

  2. It doesn't allow changes to the graph in qnn_partitioner.
    We would like to move some passes to qnn_partitioner to avoid the wrapper API, but there are some passes that insert or remove nodes in the graph, such as decomposition passes and i64_to_i32.

  3. Can I get the edge graph module in the new API?
    Our debugger might need to compare it with the graph module before to_backend.

  4. Is it possible to move lift_constant_tensor_pass after edge_manager.transform(transform_passes)?
    In the LiftConstantScalarOperands pass, we create a getattr node to lift scalars. However, if we apply this pass after to_edge, we will get a getattr node in qnn_partitioner, and it seems to miss the constant value in qnn_preprocess. From my investigation, lift_constant_tensor_pass in to_edge converts the get_attr node to a placeholder node and adds the constant into the state_dict in exported_program. Therefore, we need to call LiftConstantScalarOperands before that. Alternatively, is there another way to lift scalars instead of using the getattr node?

@shewu-quic
Copy link
Collaborator Author

It seems there are some conflicts. I will rebase this PR ASAP.

@shewu-quic shewu-quic force-pushed the dev1/hutton/qnn-partitioner-update branch from 7b367bd to db112a0 Compare March 27, 2025 17:34
…_lower

summary:

- Support `to_edge_transform_and_lower`
  - Replace capture_program with new API `to_edge_transform_and_lower_to_qnn`
  - Replace capture_program with to_edge_transform_and_lower_to_qnn for unit_test
  - Replace capture_program with to_edge_transform_and_lower_to_qnn for examples
  - Replace capture_program with to_edge_transform_and_lower_to_qnn for llama
- Add QnnPassManager to manage all passes in different stage
  - Deprecated _transform in export_llama_lib with qnn_pass_manager
  - Add transform_for_export_pipeline for LiftConstantScalarOperands to avoid creating temporary tensors in the operation builder.
    However, this pass will create a get_attr node, which should be converted into a lifted tensor constant by the lift_constant_tensor_pass.
    If placed in the to_edge_transform_passes, it will be executed after the lift_constant_tensor_pass,
    causing the operation builder to fail to correctly retrieve the parameter by the get_parameter for get_attr node.
- Refactor the passes
  - Fix the output dtype doesn't match in runtime after build quant io
  - Combine constant_i64_to_i32 and tensor_i64_to_i32 into i64_to_i32
  - Replace convert_to_linear pass with fixed_linear_keep_dim pass
    - Since QNN has no keep dims for linear op, we will need to add squeeze and unsqueeze around linear node
  - Add TagQuantIO pass to tag io nodes to avoid inserting q/dq in qnn_preprocess
  - Add prelu, leaky_relu, linear, rms_norm into decompose_table
    - Remove recompose_prelu.py
  - Remove unused variable in insert_requantize.py, and replace_index_put_input.py

- Support aten.split_with_sizes_copy.default
- Support leaky_relu with inplace=True
@shewu-quic shewu-quic force-pushed the dev1/hutton/qnn-partitioner-update branch from db112a0 to 3732242 Compare March 27, 2025 17:39
@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Mar 27, 2025

"I have rebased the branch. Additionally, I tested static_llama with story llama and confirmed that we get the same results both before and after this PR."

INFO:root:Results[0]:
Once upon a time, there was a little girl named Lily. She loved to play with her toys and her favorite toy was a big, red ball. One day, Lily's mom asked her to help her with the laundry. Lily was happy to help and she put all the clothes in the washing machine. 
After the clothes were washed, Lily's mom asked her to help her hang them up to dry. Lily saw a big, black iron on the counter and asked her mom what it was for. Her mom explained that it was used to make clothes smooth

@facebook-github-bot
Copy link
Contributor

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cccclai
Copy link
Contributor

cccclai commented Mar 27, 2025

Thank you for the detailed notes! We'll work on those.

@cccclai cccclai added the release notes: qualcomm Changes to the Qualcomm backend delegate label Mar 27, 2025
Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. There are some internal error and I'll send forward fix.

@cccclai cccclai merged commit 2f408dd into pytorch:main Apr 2, 2025
87 of 90 checks passed
cccclai added a commit to cccclai/executorch-1 that referenced this pull request Apr 3, 2025
Summary: forward fix for pytorch#9643

Reviewed By: kirklandsign

Differential Revision: D72353830
@cccclai cccclai mentioned this pull request Apr 3, 2025
cccclai added a commit to cccclai/executorch-1 that referenced this pull request Apr 3, 2025
Summary:
Pull Request resolved: pytorch#9864

forward fix for pytorch#9643

Reviewed By: kirklandsign

Differential Revision: D72353830
@cccclai
Copy link
Contributor

cccclai commented Apr 3, 2025

hello! Looks like there are some CI failure from this PR https://hud.pytorch.org/pytorch/executorch/commit/2f408dd79d9656c8bfb90b1e8fd990ed326ea36f, can you take a look? These are trunk jobs (longer time to run), so it wasn't triggered by the PR right away

cccclai added a commit to cccclai/executorch-1 that referenced this pull request Apr 4, 2025
Summary: As title, it's broken in pytorch#9643

Differential Revision: D72472098
cccclai added a commit to cccclai/executorch-1 that referenced this pull request Apr 4, 2025
Summary: As title, it's broken in pytorch#9643

Differential Revision: D72472098
cccclai added a commit to cccclai/executorch-1 that referenced this pull request Apr 4, 2025
Summary:

As title, it's broken in pytorch#9643

Differential Revision: D72472098
cccclai added a commit to cccclai/executorch-1 that referenced this pull request Apr 5, 2025
Summary:

As title, it's broken in pytorch#9643

Differential Revision: D72472098
shewu-quic added a commit to CodeLinaro/executorch that referenced this pull request Apr 7, 2025
cccclai pushed a commit that referenced this pull request Apr 7, 2025
kirklandsign pushed a commit that referenced this pull request Apr 11, 2025
…_lower (#9643)

Summary:

- Support `to_edge_transform_and_lower`
- Replace capture_program with new API
`to_edge_transform_and_lower_to_qnn`
- Replace capture_program with to_edge_transform_and_lower_to_qnn for
unit_test
- Replace capture_program with to_edge_transform_and_lower_to_qnn for
examples
- Replace capture_program with to_edge_transform_and_lower_to_qnn for
llama
- Add QnnPassManager to manage all passes in different stage
  - Deprecated _transform in export_llama_lib with qnn_pass_manager
- Add transform_for_export_pipeline for LiftConstantScalarOperands to
avoid creating temporary tensors in the operation builder. However, this
pass will create a get_attr node, which should be converted into a
lifted tensor constant by the lift_constant_tensor_pass. If placed in
the to_edge_transform_passes, it will be executed after the
lift_constant_tensor_pass, causing the operation builder to fail to
correctly retrieve the parameter by the get_parameter for get_attr node.
- Refactor the passes
  - Fix the output dtype doesn't match in runtime after build quant io
  - Combine constant_i64_to_i32 and tensor_i64_to_i32 into i64_to_i32
  - Replace convert_to_linear pass with fixed_linear_keep_dim pass
- Since QNN has no keep dims for linear op, we will need to add squeeze
and unsqueeze around linear node
- Add TagQuantIO pass to tag io nodes to avoid inserting q/dq in
qnn_preprocess
  - Add prelu, leaky_relu, linear, rms_norm into decompose_table
    - Remove recompose_prelu.py
- Remove unused variable in insert_requantize.py, and
replace_index_put_input.py

- Support aten.split_with_sizes_copy.default
- Support leaky_relu with inplace=True
kirklandsign pushed a commit that referenced this pull request Apr 11, 2025
keyprocedure pushed a commit to keyprocedure/executorch that referenced this pull request Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: qualcomm Changes to the Qualcomm backend delegate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants