Skip to content

binary add become slower than 2.6 #1423

@jianyizh

Description

@jianyizh

Run this with unitrace -d --conditional-collection python test_add.py
Current pytorch main gets 2342624ns on average, pytorch 2.6 gets 951728ns on max 1550.

I noticed that Register File Size Per Thread increased from 128 to 256

import torch
import os
 
input_tensor_16 = torch.randn(16,12,512,512, dtype=torch.bfloat16, device="xpu")
input_tensor_32 = torch.randn(16,1,1,512, device="xpu")
 
for _ in range(10):
    _ = input_tensor_16 + input_tensor_32
 
torch.xpu.synchronize()
os.environ['PTI_ENABLE_COLLECTION']="1"
for _ in range(10):
    _ = input_tensor_16 + input_tensor_32
torch.xpu.synchronize()

Metadata

Metadata

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions