-
Notifications
You must be signed in to change notification settings - Fork 2k
[webgpu] Migrate sparseToDense to the atomic-based kernel #6552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
| `; | ||
| const atomicRMW = (ptr: string, val: string) => { | ||
| const atomicAddSnippet = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep the original atomicAdd path, which is better if the data is int32?
const atomicAddSnippet = this.type === 'int32' ?
`atomicAdd(${ptr}, i32(val));` :
`
{
var oldBits = 0;
var newBits = bitcast<i32>(${val});
loop {
let info = atomicCompareExchangeWeak(${ptr}, oldBits, newBits);
if (info.exchanged) {
break;
}
oldBits = info.old_value;
let oldValue =
bitcast<${mapToWgslTypes(this.type, false)}>(oldBits);
let newValue = oldValue + (${val});
newBits = bitcast<i32>(newValue);
}
}
`;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me wait for more inputs to see if it worth the complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think the i32 path should be kept for performance.
| let d0 = index / uniforms.updatesShape[1]; | ||
| let d1 = index - d0 * uniforms.updatesShape[1]; | ||
| let d0 = index / uniforms.outShape[1]; | ||
| let d1 = index - d0 * uniforms.outShape[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why changes it from uniforms.updatesShape to uniforms.outShape[1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a hack. When updates is a scalar tensor, coords can no longer be derived from its shape as the tensor is "compressed", but outShape[1] happens to provide the slice size here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you ensure that uniforms.updatesShape[1] is equal to uniforms.outShape[1]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems tensors are reshaped this way by design:
| ScatterNd | SparseToDense | |
|---|---|---|
| indices shape | [numUpdates, sliceRank] | [numUpdates, sliceRank] |
| updates shape | [numUpdates, sliceSize] | [numUpdates, sliceSize] (or [ ]) |
| result shape | [outputSize / sliceSize, sliceSize] | [outputSize / sliceSize, sliceSize] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird to use uniforms.outShape[1] instead of uniforms.updatesShape[1] for fn getUpdatesCoordsFromFlatIndex.
When updates is a scalar tensor, you probably need to do it like below. In your hack, when updates is a scalar tensor, getUpdatesCoordsFromFlatIndex's result is even not needed.
switch((this.updatesRank) {
case 0:
outCoordsString = 'flattenedIndex';
getUpdatesCoordsFromFlatIndex = '';
case 1:
outCoordsString = 'flattenedIndex';
getUpdatesCoordsFromFlatIndex = `
fn getUpdatesCoordsFromFlatIndex(index : i32) -> i32 {
return index;
}
case 2:
outCoordsString = 'vec2<i32>(flattenedIndex, coords[1])';
getUpdatesCoordsFromFlatIndex = `
fn getUpdatesCoordsFromFlatIndex(index : i32) -> vec2<i32> {
let d0 = index / uniforms.updatesShape[1];
let d1 = index - d0 * uniforms.updatesShape[1];
return vec2<i32>(d0, d1);
}
default: throw error
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flattenedIndex also depends on coords[0]. Can you clarify how to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change it as we discussed offline.
| } | ||
| } | ||
| `; | ||
| const atomicStoreSnippet = `atomicStore(${ptr}, bitcast<i32>(${val}));`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just store 0 so that we don't need to upload a zero tensor as input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fits my taste better: the kernel does something (accumulation or overwriting) according to indices and values given by the user.
| program, [sparseValues, sparseIndices, defaultValue], sparseValues.dtype, | ||
| uniformData); | ||
| { | ||
| // First replace the default value with 0 at indices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In API doc, it says
If sparseIndices is scalar dense[i] = (i == sparseIndices ? sparseValues : defaultValue).
In this case, we don't need three trips to finish sparseToDense. Instead, we can iterate output dense and check whether the current index is equal to sparseIndices, and store the right value to output. Can you add a comment that there is an optimization opportunity for such kind of situation in case someone wants to do it in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would discourage making things too fragmented, but this reminds me a change like this:
- Fill output tensor with the default value.
- Replace the default value with
zerosparseValuesat indices.
| let d0 = index / uniforms.updatesShape[1]; | ||
| let d1 = index - d0 * uniforms.updatesShape[1]; | ||
| let d0 = index / uniforms.outShape[1]; | ||
| let d1 = index - d0 * uniforms.outShape[1]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change it as we discussed offline.
PERF The sparseToDense op takes an optional default value. Unlike scatterNd, the output cannot be initialized with fill(), since the default value is a scalar tensor (which could be the result of a previous op) than a scalar number. The (horrible!) workaround here is to broadcast the value with tile(). The other challenge is if the kernel should discard the original value at index or accumulate on that. The magic is performed by splitting the op into two "scatter" steps: 1) replace the default value with 0, and 2) add the input sparse values to 0 or whatever. This avoids a bitmap for recording whether the output element at index has been updated by another invocation. Closes tensorflow#6525
PERF The sparseToDense op takes an optional default value. Unlike scatterNd, the output cannot be initialized with fill(), since the default value is a scalar tensor (which could be the result of a previous op) than a scalar number. The (horrible!) workaround here is to broadcast the value with tile(). The other challenge is if the kernel should discard the original value at index or accumulate on that. The magic is performed by splitting the op into two "scatter" steps: 1) replace the default value with 0 or add 0 to 0, and 2) add the input sparse values to 0 or whatever. This avoids a bitmap for recording whether the output element at index has been updated by another invocation. Closes tensorflow#6552
qjia7
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one nit.
| case 0: | ||
| break; | ||
| case 1: | ||
| if (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you remove if (true) and also some other similar places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto format looks bad to me without this.
gyagp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PERF
The sparseToDense op takes an optional default value. Unlike scatterNd, the output cannot be initialized with fill(), since the default value is a scalar tensor (which could be the result of a previous op) than a scalar number. The (horrible!) workaround here is to broadcast the value with tile().
The other challenge is if the kernel should discard the original value at index or accumulate on that. The magic is performed by splitting the op into two "scatter" steps: 1) replace the default value with 0, and 2) add the input sparse values to 0 or whatever. This avoids a bitmap for recording whether the output element at index has been updated by another invocation.
Closes #6525
To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.
This change is