Skip to content

Conversation

@ankitm3k
Copy link

Description

Sync with 1.22 release candidate commit -
microsoft@39e585f

jambayk and others added 30 commits February 27, 2025 09:30
… testing (microsoft#23801)

Summary of changes:
- Changed openVINO test case to use --enable_generic_interface
- changed tensorRT test case to use --enable_generic_interface
- Fixed ORT builds to USE_FULL_PROTOBUF as openVINO/TensorRT requires
them
- Fixed pre-processor macro definition which accidently got removed when
ORT is build w/o EP

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Karim Vadsariya <[email protected]>
…icrosoft#23825)

### Description
Increase [npm package
pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1080&_a=summary)
ReactNative_CI_iOS timeout to 120 mins



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description

In GemmBatch, target matrix is cut into blocks to dispatch to multiple
threads for intra-op parallelism.

Currently the block size hard-coded to 16. If the CPU has > 16 cores,
cores are not fully utilized in one op.

This change unblocks the number of blocks in various MatMul.

__Benchmark results__

Model:
llmlingua-2-bert-base-multilingual-cased-meetingbank--add-force-token-100--max-seq-len-512-CPU-INT8.onnx
set up: 96 core x86 linux

Before: 
Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.485097 s
First inference time cost: 356 ms
Total inference time cost: 17.731 s
Total inference requests: 50
__Average inference time cost: 354.619 ms__
Total inference run time: 17.7312 s
Number of inferences per second: 2.81989
Avg CPU usage: 65 %
Peak working set size: 542265344 bytes
Avg CPU usage:65
Peak working set size:542265344

After:

Setting intra_op_num_threads to 32
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.523394 s
First inference time cost: 316 ms
Total inference time cost: 12.2739 s
Total inference requests: 50
__Average inference time cost: 245.478 ms__
Total inference run time: 12.2741 s
Number of inferences per second: 4.07362
Avg CPU usage: 33 %
Peak working set size: 611241984 bytes
Avg CPU usage:33
Peak working set size:611241984


Setting intra_op_num_threads to 64
Overriding dimension with name, batch_size, to 3
Session creation time cost: 0.497698 s
First inference time cost: 289 ms
Total inference time cost: 9.49205 s
Total inference requests: 50
__Average inference time cost: 189.841 ms__
Total inference run time: 9.49226 s
Number of inferences per second: 5.26745
Avg CPU usage: 65 %
Peak working set size: 548470784 bytes
Avg CPU usage:65
Peak working set size:548470784
Runs:50

### Motivation and Context
This issue is reported by M365 research team.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This change fixes GQA for Flash Attention on Nvidia GPUs. The root cause
appears to be
`k_start + capped_sg_id < seq_causal_length`
check. This is either because, 
a. seq_causal_length varies per lane, so the check becomes non uniform
control flow, which is having interactions with subgroupShuffle.
or 
b. The check itself is incorrect and is wiping out values of v based on
the source lane's seq_causal_length. While in actualness values of v
need to be causal as per the lane that is going to multiply it with qkt.

qkt is already causal because earlier values of qk for out of bounds k
are set to min_value, and exp(<-4) are 0.

This fix works by removing that causal check and relying on the qk being
wiped out earlier. The documentation for causality behavior for GQA is
missing to determine which of this reason is the true reason.

Prior to this prompts with sequence length > 16 < 32 or 1k would break
with Phi 4 but smaller prompts would work.
Tested on Intel Alderlake, Nvidia 4070.
### Description
<!-- Describe your changes. -->
Supports creating a model programmatically using the ORT C or C++ API. 
Supports augmenting an existing model to add nodes.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Fixed a typo in function names related to the Upsample CUDA kernel.
Changed incorrect spelling Upample to Upsample across relevant
functions.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is necessary to maintain consistency and prevent potential
confusion caused by incorrect function names.
)

### Description
<!-- Describe your changes. -->
Fix typos in csharp/src/Microsoft.ML.OnnxRuntime/


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…oft#23788)

Change the logic to generate the default ep context file name

### Description
Applies to all EPs: replace the .onnx to _ctx.onnx, instead of directly append extra string _ctx.onnx to existing model path. In QNN EP, also make the context binary .bin file shorter by removing QNNExecutionProvider_ from the file name.
### Description
Make
[QNN_Nuget_Windows](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1234)1ES
compliant



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…osoft#23827)

### Description

Resolve microsoft#23817



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This PR fixes the errors in the ConvTranspose optimization and adds
tests to ensure the correctness of the implementation.
### Description
Fix a warning with std::move usage



### Motivation and Context
Possibly allow building without --compile_no_warning_as_error flag
To be compatible with the latest GSL library. Without this fix we will
get:

```
onnxruntime\core\providers\cpu\controlflow\loop.cc(247): error C4996: 'gsl::byte': Use std::byte instead.
```
### Description

#### Background

From code search, the following EPs use
`onnxruntime::GetCpuPreferredNodes()` in their `GetCapabilities()`
methods:
- CANN
- CUDA
- DML
- JS
- ROCM
- WebGPU

However, the source file that implements
`onnxruntime::GetCpuPreferredNodes()` is excluded when minimal build is
ON:
https://github.com/microsoft/onnxruntime/blob/6df0973e58ba5399fcaa98686f70ed9a9e59aaef/cmake/onnxruntime_framework.cmake#L38-L42

This means that all EPs mentioned above is not able to compile with
minimal build.

#### Solution

The excluded file `core/framework/fallback_cpu_capability.cc` cannot
build in minimal build because some of its dependencies are not included
in the minimal build. However, in extended minimal build mode, all
dependencies are available.

This PR looses the restrict and allows to compile this file when it is
extended minimal build. After this change, those EPs are able to compile
in extended minimal build.
### Description

Add `dawn` to ThirdPartyNotices.
…3702)

### Description
Enable QNN EP weight sharing generation using public API instead of internal interfaces, so that user can integrate into their own toolchain. The change is to share the QnnBackendManager across ORT sessions if ep.share_ep_contexts is enabled. And there is extra option to end the share so that we know when to remove the shared QnnBackendManager from the singleton.

Change the tool name from onnxruntime_qnn_ctx_gen to ep_weight_sharing_ctx_gen, so that it can be shared for other EPs.
…microsoft#23892)

### Description
When using the enable_htp_shared_memory feature, we see that the address
of the buffer passed to rpcmem_free is incorrect. So the rpc buffers are
not freed leading to memory exhaustion.

### Motivation and Context
When using the enable_htp_shared_memory_allocator feature for QNN in
GenAI extensions, it leads to inference failures during the second
prompt. As GenAI memory asks are higher, it surfaces sooner in gen AI
use cases.

Co-authored-by: Ashish Garg <[email protected]>
The build option --enable_pix_capture is broken. This fixes the problem.

---------

Co-authored-by: wp <[email protected]>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…t#23887)

### Description
* Add dynamo export for Sam2 image encoder
* Verify fp32 onnx model with CPU EP (to avoid error message from TRT
EP).
* Update benchmark script:
  - output ORT profiling
- output torch compiled code and unique kernel name for compiled kernel
  - add an option for nightly package installation
  - uninstall existing ort packages before installing

The node metadata of dynamo exported model can help mapping node in onnx
model back to pytorch modeling script. Currently, the graph optimization
is not done on dynamo exported model, so it is experimental right now.

### Motivation and Context

To support profiling of torch compiled CUDA kernel.
### Description
This PR improves the workaround for bundlers in onnxruntime-web.
Specifically, the following changes have been made:

- Use [this
workaround](xenova@9c50aa2)
as suggested by @xenova in
huggingface/transformers.js#1161 (comment)

- Use `url > "file:" && url < "file;"` instead of
`url.startsWith("file:")` to allow minifiers to remove dead code
correctly.

This change allows to remove unnecessary dependencies of file parsed
from `new URL("ort.bundle.min.js", import.meta.url)` in Vite, and
optimize code like `if("file://filepath.js".startsWith("file:"))
{do_sth1(); } else {do_sth2();}` into `do_sth1()` for webpack/terser
usages.

Resolves huggingface/transformers.js#1161
)

### Description
This change restores the MatMulNBits workgroup size from (8, 8, 1) back
to (16, 8, 1) to resolve a performance regression observed on Intel
iGPUs during token generation (M=1).

### Motivation and Context
As above.

Signed-off-by: Jianhui Dai <[email protected]>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…icrosoft#23894)

Float16Array is now shipping and WebNN Chromium implementation has
accepted it. We should allow it in WebNN EP as well.
…icrosoft#23888)

### Description
CMake 4.0 release candidate 2.0 is available, and it cannot compile all
of OnnxRuntime out-of-the-box. There's portions of the OnnxRuntime
codebase that specify a `cmake_minimum_required` version of 3.0, and
CMake 4.0 has removed support for compatibility with CMake < 3.5 - the
following error is reported:

```
CMake Error at winml_sdk_helpers.cmake:4 (cmake_minimum_required):
  Compatibility with CMake < 3.5 has been removed from CMake.

  Update the VERSION argument <min> value.  Or, use the <min>...<max> syntax
  to tell CMake that the project requires at least <min> but has been updated
  to work with policies introduced by <max> or earlier.

  Or, add -DCMAKE_POLICY_VERSION_MINIMUM=3.5 to try configuring anyway.
```

Since CMake 3.5 appears to have shipped in 2016, it seems reasonable to
set that as a minimum version to fix the error. The root CMakeLists.txt
does ask for a minimum version of 3.28, so we could snap to that, but
I'm still ramping up on the build, so wanted to propose a minimally
sufficient fix.

### Motivation and Context
Being able to build with the latest CMake - when it ships - reduces the
barrier to entry to building OnnxRuntime, and allows the OnnxRuntime to
leverage the latest and greatest tooling.
…icrosoft#23898)

This PR removes the deprecated subgroups-f16 from WebGPU native and JS
EP, and also remove the unused deviceInfo in WebGPU JS EP.
fs-eire and others added 26 commits April 7, 2025 06:37
…microsoft#24315)

### Description

This PR is one of a series of changes for optimization of Dawn API
usage. See microsoft#24281

Reduce the calls to wgpuBufferAddRef and wgpuBufferRelease (part 1).
### Description
SessionOptions now have a new property - load_cancelation_flag.
This flag if set to true causes the model to abort load and
initialization for huge models.

### Motivation and Context
Some users request an ability to abandon model loading and
initialization if that exceeds certain time limits.
)

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Co-authored-by: Yulong Wang <[email protected]>
…24327)

### Description
Exclude WebGPU from Conv3D tests 



### Motivation and Context
Fix failing tests in packaging pipelines.
### Description
[VitisAI EP] export InferShapes to VitisAIEP

---------

Co-authored-by: Wang Chunye <[email protected]>
Co-authored-by: Zhenze <[email protected]>
This PR adds the flash decoding support to optimization the generation
speed when the total sequence length is large. Previously, when the
total sequence length is big enough, the softmax and softmax * v shaders
will become the bottleneck since it only uses limited gpu cores. In this
changes, we add the flash decoding support to split the present
key/value based on the total sequence length, then do reduce to get the
final result.

On NV RTX 2000 Ada, the TPS becomes 41.4 from 34.4 for 1K tokens for
phi4 static kv cache
On Meteor Lake, the TPS becomes 19 from 16 for 1K tokens for phi4 static
kv cache

Side effect of this PR:
It adds two extra buffers to store 1) metadata (max and exp_sum in each
split), 2) the splited qkv results with shape [B, N, split_k, H], which
increase the memory size.

TODO:
Ideally, there should only be two shaders, which can also reduce the
intermediate memory. The computeQKT can be merged into split shader and
do the final softmax adjustment in the reduce shader. However, I meet
some issues that when the total sequence length exceeds some value, the
result will become garbage. Since I can't resolve it in a short time,
leave it in as TODO to fix it in future.
### Description
Use wasm_f32x4_relaxed_max and wasm_f32x4_relaxed_min in WASM relaxed
SIMD build.


### Motivation and Context
This PR replaces wasm_f32x4_min/max with the relaxed SIMD counterparts
wasm_f32x4_relaxed_min/max in WASM relaxed SIMD build.

According to [relaxed SIMD
proposal](https://github.com/WebAssembly/relaxed-simd/blob/main/proposals/relaxed-simd/Overview.md#relaxed-min-and-max),
the wasm_f32x4_relaxed_min/max allow implementation-defined behavior on
NaN propagation and -0.0 vs +0.0. This enables WASM runtimes to use
minps/maxps on x64 platforms and improves the performance.

e.g. for wasm_f32x4_max -> wasm_f32x4_relaxed_max
wasm_f32x4_max: [implementation in
V8](https://source.chromium.org/chromium/chromium/src/+/main:v8/src/codegen/shared-ia32-x64/macro-assembler-shared-ia32-x64.cc;l=231)
wasm_f32x4_relaxed_max: maxps

This change would affect kernel functions rely on MlasMaximumFloat32x4
and MlasMinimumFloat32x4, including various activations and reduced
min/max kernels. In mlas micro bench "COMPUTESOFTMAXINPLACE...", this
change provides a performance improvement of up to 60% on x64 devices.
webgpu support for DequantizeLinear
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
`nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH` [has been
deprecated since
10.0](https://docs.nvidia.com/deeplearning/tensorrt/latest/_static/c-api/namespacenvinfer1.html#aa8f406be96c14b7dbea548cf19f09a08)
and is always implicitly set for versions 10.0+. Change the EP code to
only set this flag for TRT versions 8 and below.

### Motivation and Context

Removes deprecated API usages in the TRT EP code.

Signed-off-by: Kevin Chen <[email protected]>
"channels" should be validated before divided by "components".
"components" should be passed to program inputs and outputs. Rename
"input" to "x" to match "ErfImpl".
Correct the last dimension of output shape.
"channels" should be validated before divided by "components". 
"components" should be passed to program inputs and outputs.
If the sizes of batch_size and sequence_length are ones, split the
hidden_size to improve parallelism.

### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
hipClang does not support -Wno-interference-size.
Hence remove the option to avoid build error.
### Description

This PR revised the flag `ort.env.wasm.simd` to enhance its usage so
that more use scenarios are covered.
- Allow setting to `false` explicitly to disable SIMD checking. resolves
microsoft#24292 (@Eldow)
- Allow setting to `'relaxed'` to enable Relaxed SIMD checking. Relaxed
SIMD is introduced first in microsoft#22794 (@jing-bao)
- Behavior is not changed when not setting (ie. `undefined`) or setting
to `true`
- Added a warning message when setting to unknown value, and reset to
`false` in this case
Cherry-pick the following changes into
[rel-1.21.0](https://github.com/microsoft/onnxruntime/tree/rel-1.21.0).
- (microsoft#23791)
- (microsoft#23710)
- (microsoft#23789)
- (microsoft#23829)

---------

Co-authored-by: Edward Chen <[email protected]>
Co-authored-by: Yifan Li <[email protected]>
Co-authored-by: Ankit Maheshkar <[email protected]>
Co-authored-by: n1harika <[email protected]>
Co-authored-by: Changming Sun <[email protected]>
@ankitm3k ankitm3k requested a review from jatinwadhwa921 April 10, 2025 07:05
Copy link

@jatinwadhwa921 jatinwadhwa921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jatinwadhwa921 jatinwadhwa921 merged commit 4f97b3b into msb_release Apr 10, 2025
3 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.