-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[Mlas] optimize MlasConv using thread partition opt #25255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces thread partitioning optimization for MlasConv by distributing convolution work across batch and group dimensions to improve multi-threading performance in scenarios with small batch sizes or grouped convolutions.
Key changes:
- Implements a new threaded execution path for
MlasConvAlgorithmExpandThenGemmSegmentedthat partitions work by batch-group pairs - Adds dynamic working buffer size calculation based on batch count and group count
- Includes new threaded benchmark function to evaluate multi-threaded performance improvements
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| onnxruntime/test/mlas/bench/bench_sconv.cpp | Adds threaded benchmark function with 4-thread threadpool for performance testing |
| onnxruntime/core/mlas/lib/convolve.cpp | Implements batch-group partitioning logic and optimized working buffer allocation |
Comments suppressed due to low confidence (1)
onnxruntime/core/mlas/lib/convolve.cpp:1379
- Variable name 'WorkingBufferSizePreThread' should be 'WorkingBufferSizePerThread' (fix typo: 'Pre' -> 'Per').
size_t WorkingBufferSizePreThread = std::max(Parameters->OutputSize * Parameters->K,
|
|
||
| *WorkingBufferSize = TargetThreadCount * MLAS_CONV_WORKING_BUFFER_SIZE_PER_THREAD; | ||
|
|
||
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ |
Copilot
AI
Jul 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent indentation: line uses tab character while surrounding code uses spaces. This should be indented with spaces to match the existing code style.
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ | |
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ |
Copilot uses AI. Check for mistakes.
|
|
||
| *WorkingBufferSize = TargetThreadCount * MLAS_CONV_WORKING_BUFFER_SIZE_PER_THREAD; | ||
|
|
||
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ |
Copilot
AI
Jul 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after 'if' and around operators. Should be formatted as 'if (Parameters->BatchCount > 1 || Parameters->GroupCount > 1) {' to match C++ style conventions.
| if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){ | |
| if (Parameters->BatchCount > 1 || Parameters->GroupCount > 1) { |
Copilot uses AI. Check for mistakes.
| const size_t OutputGroupSize = FilterCount * OutputSize; | ||
| const size_t FilterGroupSize = FilterCount * K; | ||
|
|
||
| // std::cout << "Address of WorkBlock->WorkingBuffer" << WorkBlock->WorkingBuffer << std::endl; |
Copilot
AI
Jul 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug output statement should be removed from production code. This commented-out debug line should be deleted.
| // std::cout << "Address of WorkBlock->WorkingBuffer" << WorkBlock->WorkingBuffer << std::endl; | |
| // Line removed. |
Copilot uses AI. Check for mistakes.
| const size_t BatchGroupCount = BatchCount * GroupCount; | ||
|
|
||
| int32_t TargetThreadCount = MlasGetMaximumThreadCount(ThreadPool); | ||
| // TargetThreadCount = 16; |
Copilot
AI
Jul 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out hardcoded thread count should be removed from production code. This appears to be leftover debugging code.
| // TargetThreadCount = 16; |
Copilot uses AI. Check for mistakes.
|
Could you please address Copilot's review comments ? |
…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]>
|
Closing as its internal dupe has been merged - thanks for the contribution ! |
…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]>
…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]>
Adds the following commits to the release-1.23.2 branch for ORT 1.23.2: - [TensorRT] Fix DDS output bug during engine update - PR: #26272 - commit id: 00e85dd - Fix shape inference failure with in-memory external data - PR: #26263 - commit id: d955476 - [CUDA] replace 90a-virtual by 90-virtual for forward compatible - PR: #26230 - commit id: b58911f - [QNN-EP] Fix logic flow bug - PR: #26148 - commit id: b282379 - Internal Dupe of #25255 - [MLAS] Optimize MlasConv using thread partition opt - PR: #26103 - commit id: 7362518 - Update qMoE spec to support block quantization - PR: #25641 - commit id: 7a8ffa8 - [VitisAI] add new api to VitisAI to save graph as a string - PR: #25602 - commit id: 3361d72 - [[Build] Lock torch, onnxscript and onnx-ir versions to latest] - PR: #26315 - commit id: ea69c4d --------- Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Yateng Hong <[email protected]> Co-authored-by: Changming Sun <[email protected]> Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: quic-calvnguy <[email protected]> Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com> Co-authored-by: yifei410 <[email protected]> Co-authored-by: yifei <[email protected]>
…ead partition opt (microsoft#26103) ### Description This is an internal branch dupe of microsoft#25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in microsoft#25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]>
Description
This PR enhances
MlasConvin ONNX Runtime by introducing a thread partitioning strategy based on Batch Size (bs) and Group Count (group). This change improves multi-threading efficiency in convolution scenarios where scaling with core/thread count was previously limited.The PR also includes updates to the
bench_sconvutility to support and evaluate multi-threaded performance under the new partitioning strategy.numactl -C core_num0-core_num_1 ./onnxruntime_mlas_benchmark --benchmark_filter=TeamsCompared to the current master implementation, the optimized version exhibits nearly 3× performance improvement, showing effective scaling with thread count. In contrast, the master branch shows no meaningful performance gain when increasing the number of threads, due to insufficient parallelization in the original implementation.
Motivation and Context
MlasConvexhibited minimal performance gains when increasing the number of threads or CPU cores in scenarios with small batch sizes or grouped convolutions.bench_sconvshow a noticeable performance improvement in multi-threaded runs, especially on multi-core CPUs.Releated Issues
#25152