[Mlas] optimize MlasConv using thread partition opt #25255

zoeczy · 2025-07-02T08:39:47Z

Description

This PR enhances MlasConv in ONNX Runtime by introducing a thread partitioning strategy based on Batch Size (bs) and Group Count (group). This change improves multi-threading efficiency in convolution scenarios where scaling with core/thread count was previously limited.

The PR also includes updates to the bench_sconv utility to support and evaluate multi-threaded performance under the new partitioning strategy.

Command to run multi-threaded benchmarks under core binding: numactl -C core_num0-core_num_1 ./onnxruntime_mlas_benchmark --benchmark_filter=Teams
The following results demonstrate the performance improvement of the optimized MlasConv under a 4-thread configuration:

Compared to the current master implementation, the optimized version exhibits nearly 3× performance improvement, showing effective scaling with thread count. In contrast, the master branch shows no meaningful performance gain when increasing the number of threads, due to insufficient parallelization in the original implementation.

Motivation and Context

Previously, MlasConv exhibited minimal performance gains when increasing the number of threads or CPU cores in scenarios with small batch sizes or grouped convolutions.
This change introduces a finer-grained workload distribution across threads by splitting the work along batch and group dimensions.
Benchmarks using bench_sconv show a noticeable performance improvement in multi-threaded runs, especially on multi-core CPUs.

Releated Issues

#25152

…e groups and batch sizes.

Copilot

Pull Request Overview

This PR introduces thread partitioning optimization for MlasConv by distributing convolution work across batch and group dimensions to improve multi-threading performance in scenarios with small batch sizes or grouped convolutions.

Key changes:

Implements a new threaded execution path for MlasConvAlgorithmExpandThenGemmSegmented that partitions work by batch-group pairs
Adds dynamic working buffer size calculation based on batch count and group count
Includes new threaded benchmark function to evaluate multi-threaded performance improvements

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
onnxruntime/test/mlas/bench/bench_sconv.cpp	Adds threaded benchmark function with 4-thread threadpool for performance testing
onnxruntime/core/mlas/lib/convolve.cpp	Implements batch-group partitioning logic and optimized working buffer allocation

Comments suppressed due to low confidence (1)

onnxruntime/core/mlas/lib/convolve.cpp:1379

Variable name 'WorkingBufferSizePreThread' should be 'WorkingBufferSizePerThread' (fix typo: 'Pre' -> 'Per').

            size_t WorkingBufferSizePreThread = std::max(Parameters->OutputSize * Parameters->K,

Copilot · 2025-07-15T01:30:24Z

onnxruntime/core/mlas/lib/convolve.cpp


        *WorkingBufferSize = TargetThreadCount * MLAS_CONV_WORKING_BUFFER_SIZE_PER_THREAD;
+
+	if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){


Inconsistent indentation: line uses tab character while surrounding code uses spaces. This should be indented with spaces to match the existing code style.

Suggested change

if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){

if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){

Copilot · 2025-07-15T01:30:25Z

onnxruntime/core/mlas/lib/convolve.cpp


        *WorkingBufferSize = TargetThreadCount * MLAS_CONV_WORKING_BUFFER_SIZE_PER_THREAD;
+
+	if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){


Missing space after 'if' and around operators. Should be formatted as 'if (Parameters->BatchCount > 1 || Parameters->GroupCount > 1) {' to match C++ style conventions.

Suggested change

if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){

if (Parameters->BatchCount > 1 || Parameters->GroupCount > 1) {

Copilot · 2025-07-15T01:30:25Z

onnxruntime/core/mlas/lib/convolve.cpp

+    const size_t OutputGroupSize = FilterCount * OutputSize;
+    const size_t FilterGroupSize = FilterCount * K;
+
+    // std::cout << "Address of WorkBlock->WorkingBuffer" << WorkBlock->WorkingBuffer << std::endl;


Debug output statement should be removed from production code. This commented-out debug line should be deleted.

Suggested change

// std::cout << "Address of WorkBlock->WorkingBuffer" << WorkBlock->WorkingBuffer << std::endl;

// Line removed.

Copilot · 2025-07-15T01:30:25Z

onnxruntime/core/mlas/lib/convolve.cpp

+        const size_t BatchGroupCount = BatchCount * GroupCount;
+
+        int32_t TargetThreadCount = MlasGetMaximumThreadCount(ThreadPool);
+        // TargetThreadCount = 16;


Commented-out hardcoded thread count should be removed from production code. This appears to be leftover debugging code.

Suggested change

// TargetThreadCount = 16;

hariharans29 · 2025-08-05T21:51:05Z

Could you please address Copilot's review comments ?

@zoeczy

…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]>

hariharans29 · 2025-10-16T17:07:42Z

Closing as its internal dupe has been merged - thanks for the contribution !

@zoeczy

…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]>

@zoeczy

…tion opt (#26103) ### Description This is an internal branch dupe of #25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in #25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]>

Adds the following commits to the release-1.23.2 branch for ORT 1.23.2: - [TensorRT] Fix DDS output bug during engine update - PR: #26272 - commit id: 00e85dd - Fix shape inference failure with in-memory external data - PR: #26263 - commit id: d955476 - [CUDA] replace 90a-virtual by 90-virtual for forward compatible - PR: #26230 - commit id: b58911f - [QNN-EP] Fix logic flow bug - PR: #26148 - commit id: b282379 - Internal Dupe of #25255 - [MLAS] Optimize MlasConv using thread partition opt - PR: #26103 - commit id: 7362518 - Update qMoE spec to support block quantization - PR: #25641 - commit id: 7a8ffa8 - [VitisAI] add new api to VitisAI to save graph as a string - PR: #25602 - commit id: 3361d72 - [[Build] Lock torch, onnxscript and onnx-ir versions to latest] - PR: #26315 - commit id: ea69c4d --------- Co-authored-by: Hariharan Seshadri <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Yateng Hong <[email protected]> Co-authored-by: Changming Sun <[email protected]> Co-authored-by: Dmitri Smirnov <[email protected]> Co-authored-by: Tianlei Wu <[email protected]> Co-authored-by: quic-calvnguy <[email protected]> Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com> Co-authored-by: yifei410 <[email protected]> Co-authored-by: yifei <[email protected]>

@zoeczy

…ead partition opt (microsoft#26103) ### Description This is an internal branch dupe of microsoft#25255 + some minor cosmetic changes to account for Copilot feedback ### Motivation and Context Improve performance of NCHW Conv - Both grouped convolutions and batched inputs should benefit from this change. For a detailed understanding of perf improvement, please refer to the numbers in microsoft#25255. Credit to @zoeczy and team for this improvement and code change --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <[email protected]>

congziyi added 3 commits July 2, 2025 11:01

[mlas] Add benchmark tests for the multithreaded version of MlasConv.

e6f79b7

[mlas] Thread partitioning and implementation of MlasConv for multipl…

6153bae

…e groups and batch sizes.

rm some log

d97a589

skottmckay requested a review from Copilot July 15, 2025 01:29

Copilot AI reviewed Jul 15, 2025

View reviewed changes

hariharans29 closed this Aug 12, 2025

hariharans29 reopened this Aug 12, 2025

hariharans29 mentioned this pull request Sep 25, 2025

NEON kernels for NCHWc Convolution and Pooling #25580

Merged

snnn assigned devang-ml Sep 29, 2025

hariharans29 mentioned this pull request Oct 13, 2025

Internal Dupe of #25255 - [MLAS] Optimize MlasConv using thread partition opt #26103

Merged

hariharans29 closed this Oct 16, 2025

apsonawane mentioned this pull request Oct 17, 2025

ORT 1.23.2 cherrypick 1 #26347

Closed

apsonawane mentioned this pull request Oct 20, 2025

ORT 1.23.2 cherrypick 1 #26368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Mlas] optimize MlasConv using thread partition opt #25255

[Mlas] optimize MlasConv using thread partition opt #25255

Uh oh!

zoeczy commented Jul 2, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 15, 2025

Uh oh!

Copilot AI Jul 15, 2025

Uh oh!

Copilot AI Jul 15, 2025

Uh oh!

Copilot AI Jul 15, 2025

Uh oh!

hariharans29 commented Aug 5, 2025 •

edited

Loading

Uh oh!

hariharans29 commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		WorkingBufferSize = TargetThreadCount MLAS_CONV_WORKING_BUFFER_SIZE_PER_THREAD;

		if(Parameters->BatchCount >1 \|\| Parameters->GroupCount > 1){

	if(Parameters->BatchCount >1 \|\| Parameters->GroupCount > 1){
	if (Parameters->BatchCount > 1 \|\| Parameters->GroupCount > 1) {

	// std::cout << "Address of WorkBlock->WorkingBuffer" << WorkBlock->WorkingBuffer << std::endl;
	// Line removed.

[Mlas] optimize MlasConv using thread partition opt #25255

[Mlas] optimize MlasConv using thread partition opt #25255

Uh oh!

Conversation

zoeczy commented Jul 2, 2025

Description

Motivation and Context

Releated Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

hariharans29 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hariharans29 commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hariharans29 commented Aug 5, 2025 •

edited

Loading