Sync msft 1.22 base #651

ankitm3k · 2025-04-10T06:55:14Z

Description

Sync with 1.22 release candidate commit -
microsoft@39e585f

… testing (microsoft#23801) Summary of changes: - Changed openVINO test case to use --enable_generic_interface - changed tensorRT test case to use --enable_generic_interface - Fixed ORT builds to USE_FULL_PROTOBUF as openVINO/TensorRT requires them - Fixed pre-processor macro definition which accidently got removed when ORT is build w/o EP ### Description  ### Motivation and Context  Co-authored-by: Karim Vadsariya <[email protected]>

…icrosoft#23825) ### Description Increase [npm package pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1080&_a=summary) ReactNative_CI_iOS timeout to 120 mins ### Motivation and Context

### Description In GemmBatch, target matrix is cut into blocks to dispatch to multiple threads for intra-op parallelism. Currently the block size hard-coded to 16. If the CPU has > 16 cores, cores are not fully utilized in one op. This change unblocks the number of blocks in various MatMul. __Benchmark results__ Model: llmlingua-2-bert-base-multilingual-cased-meetingbank--add-force-token-100--max-seq-len-512-CPU-INT8.onnx set up: 96 core x86 linux Before: Setting intra_op_num_threads to 64 Overriding dimension with name, batch_size, to 3 Session creation time cost: 0.485097 s First inference time cost: 356 ms Total inference time cost: 17.731 s Total inference requests: 50 __Average inference time cost: 354.619 ms__ Total inference run time: 17.7312 s Number of inferences per second: 2.81989 Avg CPU usage: 65 % Peak working set size: 542265344 bytes Avg CPU usage:65 Peak working set size:542265344 After: Setting intra_op_num_threads to 32 Overriding dimension with name, batch_size, to 3 Session creation time cost: 0.523394 s First inference time cost: 316 ms Total inference time cost: 12.2739 s Total inference requests: 50 __Average inference time cost: 245.478 ms__ Total inference run time: 12.2741 s Number of inferences per second: 4.07362 Avg CPU usage: 33 % Peak working set size: 611241984 bytes Avg CPU usage:33 Peak working set size:611241984 Setting intra_op_num_threads to 64 Overriding dimension with name, batch_size, to 3 Session creation time cost: 0.497698 s First inference time cost: 289 ms Total inference time cost: 9.49205 s Total inference requests: 50 __Average inference time cost: 189.841 ms__ Total inference run time: 9.49226 s Number of inferences per second: 5.26745 Avg CPU usage: 65 % Peak working set size: 548470784 bytes Avg CPU usage:65 Peak working set size:548470784 Runs:50 ### Motivation and Context This issue is reported by M365 research team.

### Description  ### Motivation and Context

### Description This change fixes GQA for Flash Attention on Nvidia GPUs. The root cause appears to be `k_start + capped_sg_id < seq_causal_length` check. This is either because, a. seq_causal_length varies per lane, so the check becomes non uniform control flow, which is having interactions with subgroupShuffle. or b. The check itself is incorrect and is wiping out values of v based on the source lane's seq_causal_length. While in actualness values of v need to be causal as per the lane that is going to multiply it with qkt. qkt is already causal because earlier values of qk for out of bounds k are set to min_value, and exp(<-4) are 0. This fix works by removing that causal check and relying on the qk being wiped out earlier. The documentation for causality behavior for GQA is missing to determine which of this reason is the true reason. Prior to this prompts with sequence length > 16 < 32 or 1k would break with Phi 4 but smaller prompts would work. Tested on Intel Alderlake, Nvidia 4070.

### Description  Supports creating a model programmatically using the ORT C or C++ API. Supports augmenting an existing model to add nodes. ### Motivation and Context

### Description  Fixed a typo in function names related to the Upsample CUDA kernel. Changed incorrect spelling Upample to Upsample across relevant functions. ### Motivation and Context  This change is necessary to maintain consistency and prevent potential confusion caused by incorrect function names.

) ### Description  Fix typos in csharp/src/Microsoft.ML.OnnxRuntime/ ### Motivation and Context

…vior (microsoft#23856)

…oft#23788) Change the logic to generate the default ep context file name ### Description Applies to all EPs: replace the .onnx to _ctx.onnx, instead of directly append extra string _ctx.onnx to existing model path. In QNN EP, also make the context binary .bin file shorter by removing QNNExecutionProvider_ from the file name.

### Description Make [QNN_Nuget_Windows](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1234)1ES compliant ### Motivation and Context

…osoft#23827) ### Description Resolve microsoft#23817 ### Motivation and Context

This PR fixes the errors in the ConvTranspose optimization and adds tests to ensure the correctness of the implementation.

### Description Fix a warning with std::move usage ### Motivation and Context Possibly allow building without --compile_no_warning_as_error flag

To be compatible with the latest GSL library. Without this fix we will get: ``` onnxruntime\core\providers\cpu\controlflow\loop.cc(247): error C4996: 'gsl::byte': Use std::byte instead. ```

### Description #### Background From code search, the following EPs use `onnxruntime::GetCpuPreferredNodes()` in their `GetCapabilities()` methods: - CANN - CUDA - DML - JS - ROCM - WebGPU However, the source file that implements `onnxruntime::GetCpuPreferredNodes()` is excluded when minimal build is ON: https://github.com/microsoft/onnxruntime/blob/6df0973e58ba5399fcaa98686f70ed9a9e59aaef/cmake/onnxruntime_framework.cmake#L38-L42 This means that all EPs mentioned above is not able to compile with minimal build. #### Solution The excluded file `core/framework/fallback_cpu_capability.cc` cannot build in minimal build because some of its dependencies are not included in the minimal build. However, in extended minimal build mode, all dependencies are available. This PR looses the restrict and allows to compile this file when it is extended minimal build. After this change, those EPs are able to compile in extended minimal build.

### Description Add `dawn` to ThirdPartyNotices.

…3702) ### Description Enable QNN EP weight sharing generation using public API instead of internal interfaces, so that user can integrate into their own toolchain. The change is to share the QnnBackendManager across ORT sessions if ep.share_ep_contexts is enabled. And there is extra option to end the share so that we know when to remove the shared QnnBackendManager from the singleton. Change the tool name from onnxruntime_qnn_ctx_gen to ep_weight_sharing_ctx_gen, so that it can be shared for other EPs.

…microsoft#23892) ### Description When using the enable_htp_shared_memory feature, we see that the address of the buffer passed to rpcmem_free is incorrect. So the rpc buffers are not freed leading to memory exhaustion. ### Motivation and Context When using the enable_htp_shared_memory_allocator feature for QNN in GenAI extensions, it leads to inference failures during the second prompt. As GenAI memory asks are higher, it surfaces sooner in gen AI use cases. Co-authored-by: Ashish Garg <[email protected]>

The build option --enable_pix_capture is broken. This fixes the problem. --------- Co-authored-by: wp <[email protected]>

### Description  ### Motivation and Context

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…t#23887) ### Description * Add dynamo export for Sam2 image encoder * Verify fp32 onnx model with CPU EP (to avoid error message from TRT EP). * Update benchmark script: - output ORT profiling - output torch compiled code and unique kernel name for compiled kernel - add an option for nightly package installation - uninstall existing ort packages before installing The node metadata of dynamo exported model can help mapping node in onnx model back to pytorch modeling script. Currently, the graph optimization is not done on dynamo exported model, so it is experimental right now. ### Motivation and Context To support profiling of torch compiled CUDA kernel.

@xenova

### Description This PR improves the workaround for bundlers in onnxruntime-web. Specifically, the following changes have been made: - Use [this workaround](xenova@9c50aa2) as suggested by @xenova in huggingface/transformers.js#1161 (comment) - Use `url > "file:" && url < "file;"` instead of `url.startsWith("file:")` to allow minifiers to remove dead code correctly. This change allows to remove unnecessary dependencies of file parsed from `new URL("ort.bundle.min.js", import.meta.url)` in Vite, and optimize code like `if("file://filepath.js".startsWith("file:")) {do_sth1(); } else {do_sth2();}` into `do_sth1()` for webpack/terser usages. Resolves huggingface/transformers.js#1161

) ### Description This change restores the MatMulNBits workgroup size from (8, 8, 1) back to (16, 8, 1) to resolve a performance regression observed on Intel iGPUs during token generation (M=1). ### Motivation and Context As above. Signed-off-by: Jianhui Dai <[email protected]>

### Description  ### Motivation and Context

…icrosoft#23894) Float16Array is now shipping and WebNN Chromium implementation has accepted it. We should allow it in WebNN EP as well.

…icrosoft#23888) ### Description CMake 4.0 release candidate 2.0 is available, and it cannot compile all of OnnxRuntime out-of-the-box. There's portions of the OnnxRuntime codebase that specify a `cmake_minimum_required` version of 3.0, and CMake 4.0 has removed support for compatibility with CMake < 3.5 - the following error is reported: ``` CMake Error at winml_sdk_helpers.cmake:4 (cmake_minimum_required): Compatibility with CMake < 3.5 has been removed from CMake. Update the VERSION argument <min> value. Or, use the <min>...<max> syntax to tell CMake that the project requires at least <min> but has been updated to work with policies introduced by <max> or earlier. Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway. ``` Since CMake 3.5 appears to have shipped in 2016, it seems reasonable to set that as a minimum version to fix the error. The root CMakeLists.txt does ask for a minimum version of 3.28, so we could snap to that, but I'm still ramping up on the build, so wanted to propose a minimally sufficient fix. ### Motivation and Context Being able to build with the latest CMake - when it ships - reduces the barrier to entry to building OnnxRuntime, and allows the OnnxRuntime to leverage the latest and greatest tooling.

…icrosoft#23898) This PR removes the deprecated subgroups-f16 from WebGPU native and JS EP, and also remove the unused deviceInfo in WebGPU JS EP.

…microsoft#24315) ### Description This PR is one of a series of changes for optimization of Dawn API usage. See microsoft#24281 Reduce the calls to wgpuBufferAddRef and wgpuBufferRelease (part 1).

### Description SessionOptions now have a new property - load_cancelation_flag. This flag if set to true causes the model to abort load and initialization for huge models. ### Motivation and Context Some users request an ability to abandon model loading and initialization if that exceeds certain time limits.

) ### Description  ### Motivation and Context  Co-authored-by: Yulong Wang <[email protected]>

…24327) ### Description Exclude WebGPU from Conv3D tests ### Motivation and Context Fix failing tests in packaging pipelines.

### Description [VitisAI EP] export InferShapes to VitisAIEP --------- Co-authored-by: Wang Chunye <[email protected]> Co-authored-by: Zhenze <[email protected]>

This PR adds the flash decoding support to optimization the generation speed when the total sequence length is large. Previously, when the total sequence length is big enough, the softmax and softmax * v shaders will become the bottleneck since it only uses limited gpu cores. In this changes, we add the flash decoding support to split the present key/value based on the total sequence length, then do reduce to get the final result. On NV RTX 2000 Ada, the TPS becomes 41.4 from 34.4 for 1K tokens for phi4 static kv cache On Meteor Lake, the TPS becomes 19 from 16 for 1K tokens for phi4 static kv cache Side effect of this PR: It adds two extra buffers to store 1) metadata (max and exp_sum in each split), 2) the splited qkv results with shape [B, N, split_k, H], which increase the memory size. TODO: Ideally, there should only be two shaders, which can also reduce the intermediate memory. The computeQKT can be merged into split shader and do the final softmax adjustment in the reduce shader. However, I meet some issues that when the total sequence length exceeds some value, the result will become garbage. Since I can't resolve it in a short time, leave it in as TODO to fix it in future.

### Description Use wasm_f32x4_relaxed_max and wasm_f32x4_relaxed_min in WASM relaxed SIMD build. ### Motivation and Context This PR replaces wasm_f32x4_min/max with the relaxed SIMD counterparts wasm_f32x4_relaxed_min/max in WASM relaxed SIMD build. According to [relaxed SIMD proposal](https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md#relaxed-min-and-max), the wasm_f32x4_relaxed_min/max allow implementation-defined behavior on NaN propagation and -0.0 vs +0.0. This enables WASM runtimes to use minps/maxps on x64 platforms and improves the performance. e.g. for wasm_f32x4_max -> wasm_f32x4_relaxed_max wasm_f32x4_max: [implementation in V8](https://source.chromium.org/chromium/chromium/src/+/main:v8/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.cc;l=231) wasm_f32x4_relaxed_max: maxps This change would affect kernel functions rely on MlasMaximumFloat32x4 and MlasMinimumFloat32x4, including various activations and reduced min/max kernels. In mlas micro bench "COMPUTESOFTMAXINPLACE...", this change provides a performance improvement of up to 60% on x64 devices.

webgpu support for DequantizeLinear

### Description  ### Motivation and Context

### Description `nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH` [has been deprecated since 10.0](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/c-api/namespacenvinfer1.html#aa8f406be96c14b7dbea548cf19f09a08) and is always implicitly set for versions 10.0+. Change the EP code to only set this flag for TRT versions 8 and below. ### Motivation and Context Removes deprecated API usages in the TRT EP code. Signed-off-by: Kevin Chen <[email protected]>

"channels" should be validated before divided by "components". "components" should be passed to program inputs and outputs. Rename "input" to "x" to match "ErfImpl". Correct the last dimension of output shape.

"channels" should be validated before divided by "components". "components" should be passed to program inputs and outputs.

If the sizes of batch_size and sequence_length are ones, split the hidden_size to improve parallelism. ### Description  ### Motivation and Context

hipClang does not support -Wno-interference-size. Hence remove the option to avoid build error.

@Eldow

### Description This PR revised the flag `ort.env.wasm.simd` to enhance its usage so that more use scenarios are covered. - Allow setting to `false` explicitly to disable SIMD checking. resolves microsoft#24292 (@Eldow) - Allow setting to `'relaxed'` to enable Relaxed SIMD checking. Relaxed SIMD is introduced first in microsoft#22794 (@jing-bao) - Behavior is not changed when not setting (ie. `undefined`) or setting to `true` - Added a warning message when setting to unknown value, and reset to `false` in this case

Cherry-pick the following changes into [rel-1.21.0](https://github.com/microsoft/onnxruntime/tree/rel-1.21.0). - (microsoft#23791) - (microsoft#23710) - (microsoft#23789) - (microsoft#23829) --------- Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Yifan Li <[email protected]> Co-authored-by: Ankit Maheshkar <[email protected]> Co-authored-by: n1harika <[email protected]> Co-authored-by: Changming Sun <[email protected]>

jatinwadhwa921

lgtm

jambayk and others added 30 commits February 27, 2025 09:30

Quant tool: Add nodes_to_exclude in get_qnn_qdq_config (microsoft…

5ab953c

…#23779)

Revert changes onn mac-react-native-ci-pipeline.yml (microsoft#23845)

2a4cfab

### Description  ### Motivation and Context

Quant tool: Consistent get_qdq_config and get_qnn_qdq_config beha…

daf9565

…vior (microsoft#23856)

[js/common] allows using Uint16Array as data for float16 tensor (micr…

1872527

…osoft#23827) ### Description Resolve microsoft#23817 ### Motivation and Context

[js/webgpu] Reland the optimization of ConvTranspose (microsoft#23858)

325ee30

This PR fixes the errors in the ConvTranspose optimization and adds tests to ensure the correctness of the implementation.

[OpenVINO] Fix a build warning (microsoft#23877)

30c6825

### Description Fix a warning with std::move usage ### Motivation and Context Possibly allow building without --compile_no_warning_as_error flag

Change gsl::byte to std::byte (microsoft#23872)

bde4fbe

To be compatible with the latest GSL library. Without this fix we will get: ``` onnxruntime\core\providers\cpu\controlflow\loop.cc(247): error C4996: 'gsl::byte': Use std::byte instead. ```

Add dawn to ThirdPartyNotices (microsoft#23876)

813bdaa

### Description Add `dawn` to ThirdPartyNotices.

Fix enable_pix_capture build for WebGPU (microsoft#23857)

8aed920

The build option --enable_pix_capture is broken. This fixes the problem. --------- Co-authored-by: wp <[email protected]>

[WebGPU-EP Native] Add ReduceMean (microsoft#23860)

834adde

### Description  ### Motivation and Context

[WebGPU EP] introduce BiasAdd contrib op (microsoft#23861)

cfb0a72

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

[webgpu] support Pad operator (microsoft#23141)

95225dd

### Description  ### Motivation and Context

[WebNN] Accept Float16Array for float16 data type if it is available (m…

b524229

…icrosoft#23894) Float16Array is now shipping and WebNN Chromium implementation has accepted it. We should allow it in WebNN EP as well.

WebGPU: Remove deprecated subgroups-f16 from WebGPU native and JS EP (m…

54b2d64

…icrosoft#23898) This PR removes the deprecated subgroups-f16 from WebGPU native and JS EP, and also remove the unused deviceInfo in WebGPU JS EP.

fs-eire and others added 26 commits April 7, 2025 06:37

[webgpu][dawn API optimization] reduce number of calls to buffer APIs (…

73676fc

…microsoft#24315) ### Description This PR is one of a series of changes for optimization of Dawn API usage. See microsoft#24281 Reduce the calls to wgpuBufferAddRef and wgpuBufferRelease (part 1).

[Native WebGPU] Exclude WebGPU EP from ConvFp16 3D tests. (microsoft#…

b803429

…24327) ### Description Exclude WebGPU from Conv3D tests ### Motivation and Context Fix failing tests in packaging pipelines.

[VitisAI EP] export InferShapes to VitisAIEP (microsoft#23881)

554fb4a

### Description [VitisAI EP] export InferShapes to VitisAIEP --------- Co-authored-by: Wang Chunye <[email protected]> Co-authored-by: Zhenze <[email protected]>

webgpu support for DequantizeLinear (microsoft#24268)

f83e661

webgpu support for DequantizeLinear

[webgpu] fix the reflect mode issue of Pad (microsoft#24202)

10e51d2

### Description  ### Motivation and Context

[webgpu] Fix bias_split_gelu (microsoft#24342)

2265613

"channels" should be validated before divided by "components". "components" should be passed to program inputs and outputs. Rename "input" to "x" to match "ErfImpl". Correct the last dimension of output shape.

[webgpu] fix bias-add (microsoft#24336)

34abb8b

"channels" should be validated before divided by "components". "components" should be passed to program inputs and outputs.

ROCm: Remove -Wno-interference-size compiler flag (microsoft#24326)

d7a38a5

hipClang does not support -Wno-interference-size. Hence remove the option to avoid build error.

[OVEP] Config 1 Commits

5f2d93b

Stateful PoC: Perform stateless -> stateful conversion of ov::Model

72bf43b

update: Update enable_causallm provider option for stateful poc

f4cc9f3

fix: Refactor Stateful Code

cfbfc0c

update: Refactor, fix & enable EPContext Import for XML & BIN

1ba8b20

Support chat-mode for NPU

c858abc

Support KVCache rewind for stateful LLMs via SetEpDynamicOptions

a0c11c7

ov_interface.cc: Fix compilation errors in Debug build (#642)

7a8dbe2

fix: disable ep_sdk_version check (#645)

1ff02d1

fix: Refactor read_model() to accept XML path (#646)

360ddf5

ankitm3k requested a review from jatinwadhwa921 April 10, 2025 07:05

fix: resolve merge conflicts

dd7cb8f

jatinwadhwa921 approved these changes Apr 10, 2025

View reviewed changes

jatinwadhwa921 merged commit 4f97b3b into msb_release Apr 10, 2025
3 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync msft 1.22 base #651

Sync msft 1.22 base #651

Uh oh!

ankitm3k commented Apr 10, 2025

Uh oh!

jatinwadhwa921 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Sync msft 1.22 base #651

Sync msft 1.22 base #651

Uh oh!

Conversation

ankitm3k commented Apr 10, 2025

Description

Uh oh!

jatinwadhwa921 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants