Skip to content

bf16 matmul's corresponding tensor.pack not properly optimized #320

Open
@yifeizh2

Description

@yifeizh2

Currently, the following 2 single-layer MLP have worst performance compared with GC v1.

<style> </style>
dtype batch size hidden list GC V1 8c55a05 remove brgemm read lock
bf16 128 1024x1024 0.0286 0.0828
bf16 128 1024x512 0.0204 0.0670

We performed detailed breakdown as follows:

<style> </style>
128x1024x1024 GC v1 8c55a05
matmul only 0.01766 0.01989
tiled pack (or reorder) 0.02634 0.04632
total 0.04418 0.077969

and

<style> </style>
128x1024x512 GC v1 8c55a05
matmul only 0.01587 0.01591
tiled pack (or reorder) 0.01278 0.0398
total 0.02881 0.06917

Are there any further optimization opportunity for vnni pack?

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions