-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Run this with unitrace -d --conditional-collection python test_add.py
Current pytorch main gets 2342624ns on average, pytorch 2.6 gets 951728ns on max 1550.
I noticed that Register File Size Per Thread increased from 128 to 256
import torch
import os
input_tensor_16 = torch.randn(16,12,512,512, dtype=torch.bfloat16, device="xpu")
input_tensor_32 = torch.randn(16,1,1,512, device="xpu")
for _ in range(10):
_ = input_tensor_16 + input_tensor_32
torch.xpu.synchronize()
os.environ['PTI_ENABLE_COLLECTION']="1"
for _ in range(10):
_ = input_tensor_16 + input_tensor_32
torch.xpu.synchronize()