Requirements of tuners #14

mydmdm · 2022-09-28T02:53:54Z

mydmdm
Sep 28, 2022
Maintainer

Motivation

I'd like to list some requirements for tuners that're necessary for finding the best config during code specialization.

Nested (Conditional) search space support

It's common that there are multiple possible implements for a specific function (kernel) with different tunable parameters.

Note: below is conceptual, not real code

sparse_linear_search_space = choice([
    dict(_name='A', BM=32, BN=32),
    dict(_name='B', BM=choice([32,64]), BK=choice([32,64]), BN=choice([32,64]), 
           TM=choice([8,16,32]), TN=choice([8,16,32]), TK=choice([8,16,32])),
])
sparse_softmax_search_space = dict(_name='A', BH=choice([32,64]), BW=choice([32,64]), RT=choice([32,64]))

As shown above, sparse linear operator may have two implementations (A and B), where B has tunable parameters (BM, BK, BN, TM, TK, TN) and A has no tunable parameters (its BM and BN are constant values for succeeding usage).
Sparse softmax operator has only one implementation (A) with tunable parameters (BH, BW, RT).

Reference (Share) random variables

Users can construct advanced functions with simple operators and then tune them together. For example, users could implement a sparse attention by two sparse linear (linear_qk and linear_v) and one softmax operator. Then the search space of sparse attention is a combined search space with the three individual search spaces.

sparse_attn_search_space = dict(
    linear_qk = sparse_linear_search_space, 
    softmax = sparse_softmax_search_space,
    linear_v = sparse_linear_search_space
)
# Parameter could be the referenced or shared (or use constraint_equal)
share_value(linear_qk.BM, softmax.BH, linear_v.BM) # these three variables should be the same value
share_value(linear_qk.BN, softmax.BW, linear_v.BK) # these three variables should be the same value

(Advanced) Exploiting the Additive structure of latency feedback

Following the above sparse attention example, after sharing the common variables, the three sparse operators (linear_qk, softmax, and linear_v) are still very independent. For example, the choice of linear_qk.TM is not relevant to the choice of softmax.RT or linear_v.TN. Besides reporting the total sparse attention latency, it is possible to report the three sparse operators' individual latency and tune in the search space independently.

some references:

Discovering and Exploiting Additive Structure for Bayesian Optimization

Here is an illustration of such task. Let a be the shared parameters (e.g., shared block size) of these three operators, while b , c, d are the non-shared tunable parameter for these three operators. $L$ is the total objective function (latency) while $L_1$, $L_2$, and $L_3$ are the latencies of these three operators.

$$ L(a,b,c,d) = L_1(a,b) + L_2(a,c) + L_3(a,d) \rightarrow \min_{a,b,c,d} L = \min_a (\min_b L_1(a,b) + \min_c L_2(a,c) + \min_d L_3(a,d)) $$

Starmys · 2022-10-12T06:09:01Z

Starmys
Oct 12, 2022

If the sparse tensor is compressed, tunning with backward also requires combined search space.

For example, a sparse linear operator contains:

Function	Direction	Defination	Kernel	Tunable Parameters
$f()$	forward	$C = A \cdot B_{sparse}^T$	sparse_matmul_ds_d_nt	BM=c_0, BK=b, BN=a, TM=c_1, TK=c_2, TN=c_3
$g_1()$	backward	$\nabla A = \nabla C \cdot B_{sparse}$	sparse_matmul_ds_d_nn	BM=d_0, BK=a, BN=b, TM=d_1, TK=d_2, TN=d_3
$g_2()$	backward	$\nabla B_{sparse} = \nabla C^T \cdot A$	sparse_matmul_dd_s_tn	BM=a, BK=e_0, BN=b, TM=e_1, TK=e_2, TN=e_3

The latency should be defined as:
$$L(a, b, c, d, e) = L_f(a, b, c) + \lambda(L_{g_1}(a, b, d) + L_{g_2}(a, b, e))$$
Where $\lambda$ is the weight of backward latency.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Requirements of tuners #14

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Requirements of tuners #14

Uh oh!

Uh oh!

mydmdm Sep 28, 2022 Maintainer

Motivation

Nested (Conditional) search space support

Reference (Share) random variables

(Advanced) Exploiting the Additive structure of latency feedback

Replies: 1 comment

Uh oh!

Uh oh!

Starmys Oct 12, 2022

mydmdm
Sep 28, 2022
Maintainer

Starmys
Oct 12, 2022