Skip to content

Conversation

@zoeczy
Copy link

@zoeczy zoeczy commented Jul 2, 2025

Description

This PR enhances MlasConv in ONNX Runtime by introducing a thread partitioning strategy based on Batch Size (bs) and Group Count (group). This change improves multi-threading efficiency in convolution scenarios where scaling with core/thread count was previously limited.

The PR also includes updates to the bench_sconv utility to support and evaluate multi-threaded performance under the new partitioning strategy.

  • Command to run multi-threaded benchmarks under core binding: numactl -C core_num0-core_num_1 ./onnxruntime_mlas_benchmark --benchmark_filter=Teams
  • The following results demonstrate the performance improvement of the optimized MlasConv under a 4-thread configuration:
    image
    Compared to the current master implementation, the optimized version exhibits nearly 3× performance improvement, showing effective scaling with thread count. In contrast, the master branch shows no meaningful performance gain when increasing the number of threads, due to insufficient parallelization in the original implementation.

Motivation and Context

  • Previously, MlasConv exhibited minimal performance gains when increasing the number of threads or CPU cores in scenarios with small batch sizes or grouped convolutions.
  • This change introduces a finer-grained workload distribution across threads by splitting the work along batch and group dimensions.
  • Benchmarks using bench_sconv show a noticeable performance improvement in multi-threaded runs, especially on multi-core CPUs.

Releated Issues

#25152

@skottmckay skottmckay requested a review from Copilot July 15, 2025 01:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces thread partitioning optimization for MlasConv by distributing convolution work across batch and group dimensions to improve multi-threading performance in scenarios with small batch sizes or grouped convolutions.

Key changes:

  • Implements a new threaded execution path for MlasConvAlgorithmExpandThenGemmSegmented that partitions work by batch-group pairs
  • Adds dynamic working buffer size calculation based on batch count and group count
  • Includes new threaded benchmark function to evaluate multi-threaded performance improvements

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
onnxruntime/test/mlas/bench/bench_sconv.cpp Adds threaded benchmark function with 4-thread threadpool for performance testing
onnxruntime/core/mlas/lib/convolve.cpp Implements batch-group partitioning logic and optimized working buffer allocation
Comments suppressed due to low confidence (1)

onnxruntime/core/mlas/lib/convolve.cpp:1379

  • Variable name 'WorkingBufferSizePreThread' should be 'WorkingBufferSizePerThread' (fix typo: 'Pre' -> 'Per').
            size_t WorkingBufferSizePreThread = std::max(Parameters->OutputSize * Parameters->K,


*WorkingBufferSize = TargetThreadCount * MLAS_CONV_WORKING_BUFFER_SIZE_PER_THREAD;

if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){
Copy link

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent indentation: line uses tab character while surrounding code uses spaces. This should be indented with spaces to match the existing code style.

Suggested change
if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){
if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){

Copilot uses AI. Check for mistakes.


*WorkingBufferSize = TargetThreadCount * MLAS_CONV_WORKING_BUFFER_SIZE_PER_THREAD;

if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){
Copy link

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after 'if' and around operators. Should be formatted as 'if (Parameters->BatchCount > 1 || Parameters->GroupCount > 1) {' to match C++ style conventions.

Suggested change
if(Parameters->BatchCount >1 || Parameters->GroupCount > 1){
if (Parameters->BatchCount > 1 || Parameters->GroupCount > 1) {

Copilot uses AI. Check for mistakes.

const size_t OutputGroupSize = FilterCount * OutputSize;
const size_t FilterGroupSize = FilterCount * K;

// std::cout << "Address of WorkBlock->WorkingBuffer" << WorkBlock->WorkingBuffer << std::endl;
Copy link

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug output statement should be removed from production code. This commented-out debug line should be deleted.

Suggested change
// std::cout << "Address of WorkBlock->WorkingBuffer" << WorkBlock->WorkingBuffer << std::endl;
// Line removed.

Copilot uses AI. Check for mistakes.

const size_t BatchGroupCount = BatchCount * GroupCount;

int32_t TargetThreadCount = MlasGetMaximumThreadCount(ThreadPool);
// TargetThreadCount = 16;
Copy link

Copilot AI Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out hardcoded thread count should be removed from production code. This appears to be leftover debugging code.

Suggested change
// TargetThreadCount = 16;

Copilot uses AI. Check for mistakes.

@hariharans29
Copy link
Member

hariharans29 commented Aug 5, 2025

Could you please address Copilot's review comments ?

@hariharans29 hariharans29 reopened this Aug 12, 2025
hariharans29 added a commit that referenced this pull request Oct 16, 2025
…tion opt (#26103)

### Description
This is an internal branch dupe of
#25255 + some minor
cosmetic changes to account for Copilot feedback

### Motivation and Context
Improve performance of NCHW Conv - Both grouped convolutions and batched
inputs should benefit from this change. For a detailed understanding of
perf improvement, please refer to the numbers in
#25255.

Credit to @zoeczy and team for this improvement and code change

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Edward Chen <[email protected]>
@hariharans29
Copy link
Member

Closing as its internal dupe has been merged - thanks for the contribution !

apsonawane pushed a commit that referenced this pull request Oct 17, 2025
…tion opt (#26103)

### Description
This is an internal branch dupe of
#25255 + some minor
cosmetic changes to account for Copilot feedback

### Motivation and Context
Improve performance of NCHW Conv - Both grouped convolutions and batched
inputs should benefit from this change. For a detailed understanding of
perf improvement, please refer to the numbers in
#25255.

Credit to @zoeczy and team for this improvement and code change

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Edward Chen <[email protected]>
apsonawane pushed a commit that referenced this pull request Oct 20, 2025
…tion opt (#26103)

### Description
This is an internal branch dupe of
#25255 + some minor
cosmetic changes to account for Copilot feedback

### Motivation and Context
Improve performance of NCHW Conv - Both grouped convolutions and batched
inputs should benefit from this change. For a detailed understanding of
perf improvement, please refer to the numbers in
#25255.

Credit to @zoeczy and team for this improvement and code change

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Edward Chen <[email protected]>
apsonawane added a commit that referenced this pull request Oct 21, 2025
Adds the following commits to the release-1.23.2 branch for ORT 1.23.2:

- [TensorRT] Fix DDS output bug during engine update
  - PR: #26272
  - commit id: 00e85dd
- Fix shape inference failure with in-memory external data
   - PR: #26263
   - commit id: d955476
- [CUDA] replace 90a-virtual by 90-virtual for forward compatible 
  - PR: #26230
  - commit id: b58911f
- [QNN-EP] Fix logic flow bug
  - PR: #26148
  - commit id: b282379
- Internal Dupe of #25255 - [MLAS] Optimize MlasConv using thread
partition opt
  - PR: #26103
  - commit id: 7362518
- Update qMoE spec to support block quantization
  - PR: #25641
  - commit id: 7a8ffa8
- [VitisAI] add new api to VitisAI to save graph as a string
  - PR: #25602
  - commit id: 3361d72
- [[Build] Lock torch, onnxscript and onnx-ir versions to latest]
  - PR: #26315
  - commit id: ea69c4d

---------

Co-authored-by: Hariharan Seshadri <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Edward Chen <[email protected]>
Co-authored-by: Yateng Hong <[email protected]>
Co-authored-by: Changming Sun <[email protected]>
Co-authored-by: Dmitri Smirnov <[email protected]>
Co-authored-by: Tianlei Wu <[email protected]>
Co-authored-by: quic-calvnguy <[email protected]>
Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
Co-authored-by: yifei410 <[email protected]>
Co-authored-by: yifei <[email protected]>
JonathanC-ARM pushed a commit to JonathanC-ARM/onnxruntime that referenced this pull request Oct 24, 2025
…ead partition opt (microsoft#26103)

### Description
This is an internal branch dupe of
microsoft#25255 + some minor
cosmetic changes to account for Copilot feedback

### Motivation and Context
Improve performance of NCHW Conv - Both grouped convolutions and batched
inputs should benefit from this change. For a detailed understanding of
perf improvement, please refer to the numbers in
microsoft#25255.

Credit to @zoeczy and team for this improvement and code change

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Edward Chen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants