Skip to content

Conversation

@jatinwadhwa921
Copy link

Backmerging with msft commits

chuteng-quic and others added 30 commits March 20, 2025 09:26
…24026)

### Description
- Add the new run option called lora_config to feed the information from lora binary
- Parse and apply the lora binary in OnRunStart

### Motivation and Context
- Support Lora Adapter Binary with QNN Context Binary Usage
### Description
<!-- Describe your changes. -->
* Update to trt10.9
* oss parser tested (here's testing method https://onnxruntime.ai/docs/build/eps.html#note-to-ort-1210-open-sourced-parser-users)

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This pr applies DP4A to generation shader. And also support any
block_size % 32 = 0.
### Description
Add slide window support in cutlass fused attention

### Motivation and Context

The change was previously created by Ye:
microsoft#21926
I merged the change and resolved some conflictions. Also reversed some
Ye's change in kernel_forward.h, so that our code is consistent with
pytorch code.
…icrosoft#24103)

### Description
Rename class HIPPinnedAllocator to MIGraphXPinnedAllocator

### Motivation and Context
To align allocators' naming for the MIGraphX EP
…t#24104)

### Description
For a newer CMake, suppress warnings about incorrect letter cases in
package names.

### Motivation and Context
To avoid reporting for newer CMake that a package name contains capital
letters when small letters are required.
…rosoft#23852)

Description
To honor SessionOption API Contract the ordering of AddConfigOption and
AppendExecutionProvider_OpenVINO should not matter. This PR is fixing
that issue

Motivation and Context
This PR fixes a regression happened during last PR in ordering of
SessionOptions.
This adds Max and Average pool operators for webgpu-native. Basically,
this is a rewrite of the corresponding JSEP operators with some
improvements:
1) 'dilations' support
2) Pooling with kernelShape.length > 2 for NHWC format
3) code cleanup

However, there are still a few missing features:
1) ceil 'ceil_mode'
2) column major 'storage_order'
3) 'Indices' output for Max pools.
…ft#24122)

### Description

put `GetMaxComponents` and `SumVector` to one place.

fix a bug in `SumVector`:

```diff
-      return "(" + x + ".x + " + x + ".y + " + x + ".w + " + x + ".z" + ")";
+      return "(" + x + ".x + " + x + ".y + " + x + ".z + " + x + ".w" + ")";
```
### Description
It is not common that dev machine have MPI installed. Skip the test if
MPI is not installed.

### Motivation and Context

Make it easy to run pytest in dev machine without the need to skip the
test manually.
### Description
This PR integrates Arm® KleidiAI™ to provide optimized assembly kernels
for matrix multiplication with 4-bit quantized weights. These changes
target the MlasQNBitGemm functions, and can be utilized via the
MatMulNBits operator.
### Description

This PR enables web tests (NPM suite tests) for WebGPU EP.

There are some test failures expected, so the specific job is marked as
"continueOnError".

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…t#24118)

### Description
<!-- Describe your changes. -->
This PR continues the work started at
microsoft#19401.

### Motivation and Context
An overridable initializer should not have a fixed value included in an
WebNN model as it could be changed at runtime. The current check doesn't
include validating that the initializer is constant.
### Description
Deleted the constant SKIP_CUDA_TEST_WITH_DML. It does not seem to be
used anywhere.

### Motivation and Context
The constant SKIP_CUDA_TEST_WITH_DML prohibits onnxruntime to be
compiled when both of the flags -use_cuda and -use_dml are set.

Co-authored-by: Andreas Hussing <[email protected]>
Previously, the encoder onnx model adds extra initialization for decoder
to generate kv cache from prompt. It is not necessary. Here we redesign
onnx export for T5 model to output two separate models for encode and
decoder.

Move Linear that generates cross features based on encoder_hidden_states
to encoder onnx model. In this way, the encoder does not need output
encoder_hidden_states, and only need output the features for cross
attention used in decoder.

Major changes:
 -[x] update t5 onnx export script
 -[x] update convert_generation script
-[x] update beam search to support changes of inputs and outputs (detail
can be found below).
-[x] add a tiny t5 model, and enable the generation test for T5 in Linux
CI pipelines.

Example change in inputs and outputs for one layer model:
**Encoder Inputs**:
- encoder_input_ids: int32 (B, encode_sequence_length)
- encoder_attention_mask: int32 (B, encode_sequence_length)
- ~~decoder_input_ids: int32 (B, 1)~~

**Encoder Outputs**:
- ~~logits: (B, 1, vocab_size)~~
- ~~encoder_hidden_states: (B, encode_sequence_length,
encoder_hidden_size)~~
- ~~present_key_self_0: (B, num_heads, 1, head_size)~~
- ~~present_value_self_0: (B, num_heads, 1, head_size)~~
- present_key_cross_0: (B, num_heads, encode_sequence_length, head_size)
- present_value_cross_0: (B, num_heads, encode_sequence_length,
head_size)

**Decoder Inputs**:
- input_ids: int32 (B, 1)
- ~~encoder_input_ids: int32 (B, encode_sequence_length) (optional for
old format; removed in new format)~~
- encoder_attention_mask: int32 (B, encode_sequence_length)
- ~~encoder_hidden_states: (B, encode_sequence_length,
encoder_hidden_size) (optional for old format; removed in new format)~~
- past_key_self_0: (B, num_heads, past_decode_sequence_length,
head_size)
- past_value_self_0: (B, num_heads, past_decode_sequence_length,
head_size)
- past_key_cross_0: (B, num_heads, encode_sequence_length, head_size)
- past_value_cross_0: (B, num_heads, encode_sequence_length, head_size)

**Decoder Outputs**:
- logits: (B, 1, vocab_size)
- present_key_self_0: (B, num_heads, past_decode_sequence_length + 1,
head_size)
- present_value_self_0: (B, num_heads, past_decode_sequence_length + 1,
head_size)

Known issues:
- Some postprocessing (like converting to use decoder masked MHA, past
and present buffer sharing) is not done. Could be a future work item to
integrate with onnxruntime-genai.

### Motivation and Context

Make the encoder onnx model simpler and more efficient in inference (no
need to output encoder_hidden_states).
### Description
Adding back missing dist folder




### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
This PR moves the CUDA memcpy for the QK output when type `T` is equal
to type `QK` from `attention_impl.cu` into `attention_qk.cu`.

### Motivation and Context
This PR fixes a linkage error when type `T` and type `QK` are the same
in `attention_qk.cu`.
…/nextjs-default (microsoft#24132)

Bumps [next](https://github.com/vercel/next.js) from 15.1.2 to 15.2.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/vercel/next.js/releases">next's
releases</a>.</em></p>
<blockquote>
<h2>v15.2.3</h2>
<blockquote>
<p>[!NOTE]<br />
This release is backporting bug fixes. It does <strong>not</strong>
include all pending features/changes on canary.</p>
</blockquote>
<h3>Core Changes</h3>
<ul>
<li>Update default allowed origins list (<a
href="https://redirect.github.com/vercel/next.js/issues/77212">#77212</a>)</li>
<li>unify allowed origin detection handling (<a
href="https://redirect.github.com/vercel/next.js/issues/77053">#77053</a>)</li>
<li>Add dev warning for cross-origin and stabilize allowedDevOrigins (<a
href="https://redirect.github.com/vercel/next.js/issues/77044">#77044</a>)</li>
<li>Ensure deploymentId is used for CSS preloads (<a
href="https://redirect.github.com/vercel/next.js/issues/77210">#77210</a>)</li>
<li>Update middleware request header (<a
href="https://redirect.github.com/vercel/next.js/issues/77201">#77201</a>)</li>
<li>[metadata] remove the default segement check for metadata rendering
(<a
href="https://redirect.github.com/vercel/next.js/issues/77119">#77119</a>)</li>
<li>[ts-hint] fix vscode type hint plugin enabling (<a
href="https://redirect.github.com/vercel/next.js/issues/77099">#77099</a>)</li>
<li>[metadata] re-insert icons to head for streamed metadata (<a
href="https://redirect.github.com/vercel/next.js/issues/76915">#76915</a>)</li>
</ul>
<h3>Credits</h3>
<p>Huge thanks to <a
href="https://github.com/ijjk"><code>@​ijjk</code></a>, <a
href="https://github.com/ztanner"><code>@​ztanner</code></a>, and <a
href="https://github.com/huozhi"><code>@​huozhi</code></a> for
helping!</p>
<h2>v15.2.2</h2>
<h3>Core Changes</h3>
<ul>
<li>[dev-overlay] fix styling on overflow error messages, add button
hover state: <a
href="https://redirect.github.com/vercel/next.js/issues/76771">#76771</a></li>
<li>Fix: respond 405 status code on OPTIONS request to SSG page: <a
href="https://redirect.github.com/vercel/next.js/issues/76767">#76767</a></li>
<li>[dev-overlay] Always show relative paths: <a
href="https://redirect.github.com/vercel/next.js/issues/76742">#76742</a></li>
<li>[metadata] remove the duplicate metadata in the error boundary: <a
href="https://redirect.github.com/vercel/next.js/issues/76791">#76791</a></li>
<li>Upgrade React from <code>d55cc79b-20250228</code> to
<code>443b7ff2-20250303</code>: <a
href="https://redirect.github.com/vercel/next.js/issues/76804">#76804</a></li>
<li>[dev-overlay] Ignore animations on page load: <a
href="https://redirect.github.com/vercel/next.js/issues/76834">#76834</a></li>
<li>fix: remove useless set-cookie in action-handler: <a
href="https://redirect.github.com/vercel/next.js/issues/76839">#76839</a></li>
<li>Turbopack: handle task cancelation: <a
href="https://redirect.github.com/vercel/next.js/issues/76831">#76831</a></li>
<li>Upgrade React from <code>443b7ff2-20250303</code> to
<code>e03ac20f-20250305</code>: <a
href="https://redirect.github.com/vercel/next.js/issues/76842">#76842</a></li>
<li>add types for <code>__next_app__</code> module loading functions: <a
href="https://redirect.github.com/vercel/next.js/issues/74566">#74566</a></li>
<li>fix duplicated noindex when server action is triggered: <a
href="https://redirect.github.com/vercel/next.js/issues/76847">#76847</a></li>
<li>fix: don't drop queued actions when navigating: <a
href="https://redirect.github.com/vercel/next.js/issues/75362">#75362</a></li>
<li>[dev-overlay]: remove dependency on platform for focus trapping: <a
href="https://redirect.github.com/vercel/next.js/issues/76849">#76849</a></li>
<li>Turbopack: Add <strong>turbopack_load_by_url</strong>: <a
href="https://redirect.github.com/vercel/next.js/issues/76814">#76814</a></li>
<li>Add handling of origin in dev mode: <a
href="https://redirect.github.com/vercel/next.js/issues/76880">#76880</a></li>
<li>[dev-overlay] Stop grouping callstack frames into ignored vs. not
ignored: <a
href="https://redirect.github.com/vercel/next.js/issues/76861">#76861</a></li>
<li>Upgrade React from <code>e03ac20f-20250305</code> to
<code>029e8bd6-20250306</code>: <a
href="https://redirect.github.com/vercel/next.js/issues/76870">#76870</a></li>
<li>[dev-overlay] Increase padding if no <code>x</code> button present:
<a
href="https://redirect.github.com/vercel/next.js/issues/76898">#76898</a></li>
<li>fix: prevent incorrect searchParams being applied on certain navs:
<a
href="https://redirect.github.com/vercel/next.js/issues/76914">#76914</a></li>
<li>[dev-overlay] Dim ignore-listed callstack frames when shown: <a
href="https://redirect.github.com/vercel/next.js/issues/76862">#76862</a></li>
</ul>
<h3>Example Changes</h3>
<ul>
<li>chore(cna): update tailwind styles to be closer to non-tw cna: <a
href="https://redirect.github.com/vercel/next.js/issues/76647">#76647</a></li>
</ul>
<h3>Misc Changes</h3>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/vercel/next.js/commit/535e26d3c69de49df8bd17618a424cbe65ec897b"><code>535e26d</code></a>
v15.2.3</li>
<li><a
href="https://github.com/vercel/next.js/commit/2fcae1d7e3079874ff633b5b8311adb584c80ce6"><code>2fcae1d</code></a>
Update default allowed origins list (<a
href="https://redirect.github.com/vercel/next.js/issues/77212">#77212</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/adf5462b5f269963395b0a2ef12a1b66e8cadabc"><code>adf5462</code></a>
unify allowed origin detection handling (<a
href="https://redirect.github.com/vercel/next.js/issues/77053">#77053</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/5e59da1f5c8b9e8b3a759048bd371efcd77813ae"><code>5e59da1</code></a>
Add dev warning for cross-origin and stabilize allowedDevOrigins (<a
href="https://redirect.github.com/vercel/next.js/issues/77044">#77044</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/8151cb6ce921cb1b9faeab6fb88551146dc206b7"><code>8151cb6</code></a>
Ensure deploymentId is used for CSS preloads (<a
href="https://redirect.github.com/vercel/next.js/issues/77210">#77210</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/52a078da3884efe6501613c7834a3d02a91676d2"><code>52a078d</code></a>
Update middleware request header (<a
href="https://redirect.github.com/vercel/next.js/issues/77201">#77201</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/4698ad6478cc85a7283a8c41edfbba023dadf57d"><code>4698ad6</code></a>
[metadata] remove the default segement check for metadata rendering (<a
href="https://redirect.github.com/vercel/next.js/issues/77119">#77119</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/1e1ff403a28703b08e68758cfcbb7b6c97c4bd2a"><code>1e1ff40</code></a>
[ts-hint] fix vscode type hint plugin enabling (<a
href="https://redirect.github.com/vercel/next.js/issues/77099">#77099</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/88deb12b03c90f5146b1270cd7bea3517cf90083"><code>88deb12</code></a>
[metadata] re-insert icons to head for streamed metadata (<a
href="https://redirect.github.com/vercel/next.js/issues/76915">#76915</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/f4552826e1ed15fbeb951be552d67c5a08ad0672"><code>f455282</code></a>
v15.2.2</li>
<li>Additional commits viewable in <a
href="https://github.com/vercel/next.js/compare/v15.1.2...v15.2.3">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=next&package-manager=npm_and_yarn&previous-version=15.1.2&new-version=15.2.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
Dependabot will merge this PR once CI passes on it, as requested by
@fs-eire.

[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ps (microsoft#24090)

### Description
<!-- Describe your changes. -->

Support shape inference for QLinearAdd and QLinearMul ops which were
missing in symbolic_shape_infer.py

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This change is required to enable shape inference for models with
"QLinearAdd" ops which are defined in com.microsoft domain and the
shapes of which cannot be inferred using onnx shape_inference alone.

Fixes issue microsoft#24028

---------

Signed-off-by: Praveen G <[email protected]>
…ine (microsoft#23580)

### Description
Follow-up to microsoft#23551 

Adds the BrowserStack testing stage for Android to the NuGet packaging
pipeline.

This test tests that the NuGet package produced will be imported and
work correctly on an Android device

[Pipeline run that shows what a failing unit test would look
like](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=670961&view=results)

---------

Co-authored-by: Edward Chen <[email protected]>
### Description
Add fp16 support to sparse attention


### Motivation and Context
Generalize models for CPU and GPU
### Description

This PR refactors the mac CI pipeline:

- Use composite action and reusable workflow to put together duplicated
code
- separate each EP
### Description
Create a separate template overloads to address Windows Debug build
warning 'unreachable code'.
…oft#24115)

### Description
This PR introduced a new WebGPU EP option `preserveDevice`.

Before this change, a WebGPU device will be destroyed when no inference
session uses it. The destroy of a WebGPU device will cleanup both buffer
cache and shader cache.

After this option is introduced, when the option is ON (default value is
OFF), the device will no longer be destroyed and will be always keep
alive. This is helpful in 2 scenarios:
- A server that will be always on
- unittest so that bugs of incorrect shader cache may be detected.
(thanks to @qjia7 for the suggestion)
…oft#24014)

### Description
<!-- Describe your changes. -->

This gives a way for webapp developers to customize the bundler behavior
regarding whether to bundle the wasm.

To avoid treating ort-wasm-threaded-simd.jsep.mjs and
ort-wasm-threaded-simd.jsep.wasm as dependencies during the process of
bundler build, use import condition `onnxruntime-web-use-extern-wasm`.

For webpack:
```
module.exports = {
  //...
  resolve: {
    conditionNames: ['onnxruntime-web-use-extern-wasm', 'import', 'module'],
  },
};

```

For esbuild:
```
await esbuild.build({
  //...
  conditions: ['onnxruntime-web-use-extern-wasm', 'import', 'module'],
})
```

For rollup:
```
import { nodeResolve } from '@rollup/plugin-node-resolve';

export default {
  //...
  plugins: [nodeResolve({
    exportConditions: ['onnxruntime-web-use-extern-wasm', 'import', 'module', 'development|production']
  })]
};
```


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

- microsoft#24009
…oft#23937)

### Description

Add API for accessing metadata of a model's input/output.

Currently, The implementation is only applied to web assembly backend
and nodejs binding. For webgl, there is so far no plan to implement this
API; for react-native, the implementation will be done later and is not
included in this PR.

#### Example usage:

```js
const mySession = await ort.InferenceSession.create( ... );

console.log(`there are ${mySession.inputMetadata.length} inputs:`);
for (let i = 0; i < mySession.inputMetadata.length; i++) {
  let info;
  if (mySession.inputMetadata[i].isTensor) {
    info = `tensor: ${mySession.inputMetadata[i].type}, shape: ${mySession.inputMetadata[i].shape}`;
  } else {
    info = `non-tensor`;
  }
  console.log(`input ${i}: ${mySession.inputMetadata[i].name}: ${info}`);
}

```

possible output:
```
there are 1 inputs:
input 0: input: tensor: float32, shape: [batch, 3, 224, 224]

```

Resolves:
- microsoft#22682
- microsoft#22949
### Description

add cache "onnxnodetests" for node tests

This fixes the random download network error for onnx node tests data.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Add Native Matmul (`MatMulNaive`, `MatMulPacked` and `MatMulPackedVec4`
)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Big model pipeline are still using cuda 11.8. This update the pipeline
to use cuda 12.x.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
tianleiwu and others added 3 commits March 25, 2025 07:44
…icrosoft#24151)

### Description
Show proper error message when fp16 model is used for Beam Search in
CPU.

Before:
```
2025-02-15 20:15:02.999160115 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running BeamSearch node. Name:'beam_search' Status Message: bad_function_call
```

After:
```
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running BeamSearch node. Name:'beam_search' Status Message: onnxruntime/onnxruntime/contrib_ops/cpu/transformers/beam_search.cc:309 virtual onnxruntime::common::Status onnxruntime::contrib::transformers::BeamSearch::Compute(onnxruntime::OpKernelContext*) const BeamSearch does not support float16 model on CPU execution provider. Use float32 model or CUDA execution provider instead.
```

### Motivation and Context
microsoft#23728
### Description
As titled.



### Motivation and Context
We have the last MatMul in phi-4-mini onnx which is b_shape = {3072,
200064}
packed_b_size = MlasGemmPackBSize(N, K);
 it is `3072*200064*sizeof(float)=2458386432`
This is larger than 2,147,483,647, it is out of the int boundary on a
32-bit system. Then len is negative.
So we change the type to size_t, and the model can be loaded
successfully after the change.
@jatinwadhwa921 jatinwadhwa921 requested a review from ankitm3k March 25, 2025 16:34
@jatinwadhwa921 jatinwadhwa921 merged commit 2c61a3a into ovep-develop Mar 26, 2025
6 of 11 checks passed
@jatinwadhwa921 jatinwadhwa921 deleted the sync_msft_25_3_25 branch April 15, 2025 05:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.