Open
Description
Currently, the following 2 single-layer MLP have worst performance compared with GC v1.
<style> </style>dtype | batch size | hidden list | GC V1 | 8c55a05 remove brgemm read lock |
---|---|---|---|---|
bf16 | 128 | 1024x1024 | 0.0286 | 0.0828 |
bf16 | 128 | 1024x512 | 0.0204 | 0.0670 |
We performed detailed breakdown as follows:
<style> </style>128x1024x1024 | GC v1 | 8c55a05 |
---|---|---|
matmul only | 0.01766 | 0.01989 |
tiled pack (or reorder) | 0.02634 | 0.04632 |
total | 0.04418 | 0.077969 |
and
<style> </style>128x1024x512 | GC v1 | 8c55a05 |
---|---|---|
matmul only | 0.01587 | 0.01591 |
tiled pack (or reorder) | 0.01278 | 0.0398 |
total | 0.02881 | 0.06917 |
Are there any further optimization opportunity for vnni pack?