Replies: 1 comment
-
|
If the sparse tensor is compressed, tunning with backward also requires combined search space. For example, a sparse linear operator contains:
The latency should be defined as: |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
I'd like to list some requirements for tuners that're necessary for finding the best config during code specialization.
Nested (Conditional) search space support
It's common that there are multiple possible implements for a specific function (kernel) with different tunable parameters.
Note: below is conceptual, not real code
As shown above, sparse linear operator may have two implementations (
AandB), whereBhas tunable parameters (BM, BK, BN, TM, TK, TN) andAhas no tunable parameters (its BM and BN are constant values for succeeding usage).Sparse softmax operator has only one implementation (
A) with tunable parameters (BH, BW, RT).Reference (Share) random variables
Users can construct advanced functions with simple operators and then tune them together. For example, users could implement a sparse attention by two sparse linear (
linear_qkandlinear_v) and one softmax operator. Then the search space of sparse attention is a combined search space with the three individual search spaces.(Advanced) Exploiting the Additive structure of latency feedback
Following the above sparse attention example, after sharing the common variables, the three sparse operators (
linear_qk,softmax, andlinear_v) are still very independent. For example, the choice oflinear_qk.TMis not relevant to the choice ofsoftmax.RTorlinear_v.TN. Besides reporting the total sparse attention latency, it is possible to report the three sparse operators' individual latency and tune in the search space independently.some references:
Here is an illustration of such task. Let$L$ is the total objective function (latency) while $L_1$ , $L_2$ , and $L_3$ are the latencies of these three operators.
abe the shared parameters (e.g., shared block size) of these three operators, whileb,c,dare the non-shared tunable parameter for these three operators.Beta Was this translation helpful? Give feedback.
All reactions