[webgpu] Migrate sparseToDense to the atomic-based kernel #6552

hujiajie · 2022-06-21T04:47:51Z

PERF

The sparseToDense op takes an optional default value. Unlike scatterNd, the output cannot be initialized with fill(), since the default value is a scalar tensor (which could be the result of a previous op) than a scalar number. The (horrible!) workaround here is to broadcast the value with tile().

The other challenge is if the kernel should discard the original value at index or accumulate on that. The magic is performed by splitting the op into two "scatter" steps: 1) replace the default value with 0, and 2) add the input sparse values to 0 or whatever. This avoids a bitmap for recording whether the output element at index has been updated by another invocation.

Closes #6525

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.

This change is

qjia7 · 2022-06-21T07:25:48Z

tfjs-backend-webgpu/src/scatter_optimized_webgpu.ts

-     }
-     `;
+    const atomicRMW = (ptr: string, val: string) => {
+      const atomicAddSnippet = `


Maybe keep the original atomicAdd path, which is better if the data is int32?

const atomicAddSnippet = this.type === 'int32' ? `atomicAdd(${ptr}, i32(val));` : ` { var oldBits = 0; var newBits = bitcast<i32>(${val}); loop { let info = atomicCompareExchangeWeak(${ptr}, oldBits, newBits); if (info.exchanged) { break; } oldBits = info.old_value; let oldValue = bitcast<${mapToWgslTypes(this.type, false)}>(oldBits); let newValue = oldValue + (${val}); newBits = bitcast<i32>(newValue); } } `;

Let me wait for more inputs to see if it worth the complexity.

I also think the i32 path should be kept for performance.

qjia7 · 2022-06-21T07:30:24Z

tfjs-backend-webgpu/src/scatter_optimized_webgpu.ts

-        let d0 = index / uniforms.updatesShape[1];
-        let d1 = index - d0 * uniforms.updatesShape[1];
+        let d0 = index / uniforms.outShape[1];
+        let d1 = index - d0 * uniforms.outShape[1];


Why changes it from uniforms.updatesShape to uniforms.outShape[1]?

This is a hack. When updates is a scalar tensor, coords can no longer be derived from its shape as the tensor is "compressed", but outShape[1] happens to provide the slice size here.

How do you ensure that uniforms.updatesShape[1] is equal to uniforms.outShape[1]?

It seems tensors are reshaped this way by design:

ScatterNd SparseToDense

indices shape [numUpdates, sliceRank] [numUpdates, sliceRank]

updates shape [numUpdates, sliceSize] [numUpdates, sliceSize] (or [ ])

result shape [outputSize / sliceSize, sliceSize] [outputSize / sliceSize, sliceSize]

It's weird to use uniforms.outShape[1] instead of uniforms.updatesShape[1] for fn getUpdatesCoordsFromFlatIndex.
When updates is a scalar tensor, you probably need to do it like below. In your hack, when updates is a scalar tensor, getUpdatesCoordsFromFlatIndex's result is even not needed.

switch((this.updatesRank) { case 0: outCoordsString = 'flattenedIndex'; getUpdatesCoordsFromFlatIndex = ''; case 1: outCoordsString = 'flattenedIndex'; getUpdatesCoordsFromFlatIndex = ` fn getUpdatesCoordsFromFlatIndex(index : i32) -> i32 { return index; } case 2: outCoordsString = 'vec2<i32>(flattenedIndex, coords[1])'; getUpdatesCoordsFromFlatIndex = ` fn getUpdatesCoordsFromFlatIndex(index : i32) -> vec2<i32> { let d0 = index / uniforms.updatesShape[1]; let d1 = index - d0 * uniforms.updatesShape[1]; return vec2<i32>(d0, d1); } default: throw error }

flattenedIndex also depends on coords[0]. Can you clarify how to handle this?

Please change it as we discussed offline.

qjia7 · 2022-06-21T07:37:21Z

tfjs-backend-webgpu/src/scatter_optimized_webgpu.ts

+          }
+        }
+      `;
+      const atomicStoreSnippet = `atomicStore(${ptr}, bitcast<i32>(${val}));`;


Why not just store 0 so that we don't need to upload a zero tensor as input?

This fits my taste better: the kernel does something (accumulation or overwriting) according to indices and values given by the user.

qjia7 · 2022-06-22T06:09:28Z

tfjs-backend-webgpu/src/kernels/SparseToDense.ts

-      program, [sparseValues, sparseIndices, defaultValue], sparseValues.dtype,
-      uniformData);
+  {
+    // First replace the default value with 0 at indices.


In API doc, it says
If sparseIndices is scalar dense[i] = (i == sparseIndices ? sparseValues : defaultValue).
In this case, we don't need three trips to finish sparseToDense. Instead, we can iterate output dense and check whether the current index is equal to sparseIndices, and store the right value to output. Can you add a comment that there is an optimization opportunity for such kind of situation in case someone wants to do it in future?

I would discourage making things too fragmented, but this reminds me a change like this:

Fill output tensor with the default value.

Replace the default value with ~~zero~~ sparseValues at indices.

qjia7 · 2022-06-22T06:09:56Z

tfjs-backend-webgpu/src/scatter_optimized_webgpu.ts

-        let d0 = index / uniforms.updatesShape[1];
-        let d1 = index - d0 * uniforms.updatesShape[1];
+        let d0 = index / uniforms.outShape[1];
+        let d1 = index - d0 * uniforms.outShape[1];


Please change it as we discussed offline.

PERF The sparseToDense op takes an optional default value. Unlike scatterNd, the output cannot be initialized with fill(), since the default value is a scalar tensor (which could be the result of a previous op) than a scalar number. The (horrible!) workaround here is to broadcast the value with tile(). The other challenge is if the kernel should discard the original value at index or accumulate on that. The magic is performed by splitting the op into two "scatter" steps: 1) replace the default value with 0, and 2) add the input sparse values to 0 or whatever. This avoids a bitmap for recording whether the output element at index has been updated by another invocation. Closes tensorflow#6525

PERF The sparseToDense op takes an optional default value. Unlike scatterNd, the output cannot be initialized with fill(), since the default value is a scalar tensor (which could be the result of a previous op) than a scalar number. The (horrible!) workaround here is to broadcast the value with tile(). The other challenge is if the kernel should discard the original value at index or accumulate on that. The magic is performed by splitting the op into two "scatter" steps: 1) replace the default value with 0 or add 0 to 0, and 2) add the input sparse values to 0 or whatever. This avoids a bitmap for recording whether the output element at index has been updated by another invocation. Closes tensorflow#6552

qjia7

LGTM with one nit.

qjia7 · 2022-06-24T03:27:35Z

tfjs-backend-webgpu/src/kernels/SparseToDense.ts

+    case 0:
+      break;
+    case 1:
+      if (true) {


nit: Can you remove if (true) and also some other similar places?

Auto format looks bad to me without this.

gyagp

LGTM

hujiajie force-pushed the for-master branch from 35350ef to e2534d9 Compare June 21, 2022 05:13

hujiajie marked this pull request as ready for review June 21, 2022 05:26

qjia7 reviewed Jun 21, 2022

View reviewed changes

qjia7 requested a review from gyagp June 21, 2022 07:42

qjia7 reviewed Jun 22, 2022

View reviewed changes

hujiajie force-pushed the for-master branch from e2534d9 to 3fe1ff1 Compare June 24, 2022 01:38

qjia7 approved these changes Jun 24, 2022

View reviewed changes

gyagp approved these changes Jun 24, 2022

View reviewed changes

gyagp merged commit bdfb9d7 into tensorflow:master Jun 24, 2022

hujiajie deleted the for-master branch July 27, 2022 10:54

	ScatterNd	SparseToDense
indices shape	[numUpdates, sliceRank]	[numUpdates, sliceRank]
updates shape	[numUpdates, sliceSize]	[numUpdates, sliceSize] (or [ ])
result shape	[outputSize / sliceSize, sliceSize]	[outputSize / sliceSize, sliceSize]

[webgpu] Migrate sparseToDense to the atomic-based kernel #6552

[webgpu] Migrate sparseToDense to the atomic-based kernel #6552

Uh oh!

Conversation

hujiajie commented Jun 21, 2022 • edited by nsthorat Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gyagp left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hujiajie commented Jun 21, 2022 •

edited by nsthorat

Loading