-
Notifications
You must be signed in to change notification settings - Fork 372
Closed
Labels
Description
❓ Question
I am comparing the performances of the torchscript model and the torch-tensorrt compiled model, when I am running in float32 mode, the average batch time is more for torch-tensorrt model. Is this expected?1. I am running the below code to compare torchscript model and torch-tensorrt compiled models,
class LeNetFeatExtractor(nn.Module):
def __init__(self):
super(LeNetFeatExtractor, self).__init__()
self.conv1 = nn.Conv2d(1, 128, 3)
self.conv2 = nn.Conv2d(128, 16, 3)
def forward(self, x):
x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
return x
class LeNetClassifier(nn.Module):
def __init__(self):
super(LeNetClassifier, self).__init__()
self.fc1 = nn.Linear(16 * 6 * 6, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = torch.flatten(x,1)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
class LeNet(nn.Module):
def __init__(self):
super(LeNet, self).__init__()
self.feat = LeNetFeatExtractor()
self.classifer = LeNetClassifier()
def forward(self, x):
x = self.feat(x)
x = self.classifer(x)
return x
def benchmark(model, input_shape=(1024, 1, 32, 32), dtype='fp32', nwarmup=50, nruns=100):
input_data = torch.randn(input_shape)
input_data = input_data.to("cuda")
if dtype=='fp16':
input_data = input_data.half()
print("Warm up ...")
with torch.no_grad():
for _ in range(nwarmup):
features = model(input_data)
torch.cuda.synchronize()
print("Start timing ...")
timings = []
with torch.no_grad():
for i in range(1, nruns+1):
start_time = time.time()
features = model(input_data)
torch.cuda.synchronize()
end_time = time.time()
timings.append(end_time - start_time)
if i%100==0:
print('Iteration %d/%d, ave batch time %.2f ms'%(i, nruns, np.mean(timings)*1000))
print("Input shape:", input_data.size())
print("Output features size:", features.size())
print('Average batch time: %.2f ms'%(np.mean(timings)*1000))
model = LeNet()
model.to("cuda").eval()
benchmark(model, dtype="fp32")
inpt = torch.empty([1,1,32,32]).to("cuda")
traced_model = torch.jit.trace(model, inpt)
benchmark(traced_model, dtype="fp32")
script_model = torch.jit.script(model)
benchmark(script_model, dtype="fp32")
compile_settings = {
"inputs": [torch_tensorrt.Input(
min_shape=[1024, 1, 32, 32],
opt_shape=[1024, 1, 33, 33],
max_shape=[1024, 1, 34, 34],
dtype=torch.float
)],
"enabled_precisions": {torch.float} # Run with FP16
}
trt_ts_module = torch_tensorrt.compile(traced_model, **compile_settings)
benchmark(trt_ts_module, input_shape=(1024, 1, 32, 32), dtype="fp32")
- Check below my performance comparison results:
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 39.72 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 39.72 ms
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 39.74 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 39.74 ms
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 39.77 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 39.77 ms
WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter
WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Detected invalid timing cache, setup a local cache instead
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Max value of this profile is not valid
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 57.29 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 57.29 ms
Environment
Build information about Torch-TensorRT can be found by turning on debug messages
- PyTorch Version (e.g., 1.0): 1.10
- CPU Architecture: x86_64
- OS (e.g., Linux): Ubuntu 18.04
- How you installed PyTorch (
conda
,pip
,libtorch
, source): pip - Build command you used (if compiling from source): python3 setup.py install
- Are you using local sources or building from archives: local
- Python version: python 3.6.8
- CUDA version: 11.3
- Any other relevant information:
Additional context
Is the above results are expected? Is torch-tensorrt compiled performs better only for fp16 mode?