❓ [Question] More average batch time for torch-tensorrt compiled model than torchscript model (fp32 mode).

## ❓ Question
I am comparing the performances of the torchscript model and the torch-tensorrt compiled model, when I am running in float32 mode, the average batch time is more for torch-tensorrt model. Is this expected?1. I am running the below code to compare torchscript model and torch-tensorrt compiled models, 

```
class LeNetFeatExtractor(nn.Module):
    def __init__(self):
        super(LeNetFeatExtractor, self).__init__()
        self.conv1 = nn.Conv2d(1, 128, 3)
        self.conv2 = nn.Conv2d(128, 16, 3)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        return x

class LeNetClassifier(nn.Module):
    def __init__(self):
        super(LeNetClassifier, self).__init__()
        self.fc1 = nn.Linear(16 * 6 * 6, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = torch.flatten(x,1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.feat = LeNetFeatExtractor()
        self.classifer = LeNetClassifier()

    def forward(self, x):
        x = self.feat(x)
        x = self.classifer(x)
        return x

def benchmark(model, input_shape=(1024, 1, 32, 32), dtype='fp32', nwarmup=50, nruns=100):
    input_data = torch.randn(input_shape)
    input_data = input_data.to("cuda")
    if dtype=='fp16':
        input_data = input_data.half()
        
    print("Warm up ...")
    with torch.no_grad():
        for _ in range(nwarmup):
            features = model(input_data)
    torch.cuda.synchronize()
    print("Start timing ...")
    timings = []
    with torch.no_grad():
        for i in range(1, nruns+1):
            start_time = time.time()
            features = model(input_data)
            torch.cuda.synchronize()
            end_time = time.time()
            timings.append(end_time - start_time)
            if i%100==0:
                print('Iteration %d/%d, ave batch time %.2f ms'%(i, nruns, np.mean(timings)*1000))

    print("Input shape:", input_data.size())
    print("Output features size:", features.size())
    
    print('Average batch time: %.2f ms'%(np.mean(timings)*1000))
    
model = LeNet()
model.to("cuda").eval()
benchmark(model, dtype="fp32")
inpt = torch.empty([1,1,32,32]).to("cuda")
traced_model = torch.jit.trace(model, inpt)
benchmark(traced_model, dtype="fp32")
script_model = torch.jit.script(model)
benchmark(script_model, dtype="fp32")

compile_settings = {
    "inputs": [torch_tensorrt.Input(
            min_shape=[1024, 1, 32, 32],
            opt_shape=[1024, 1, 33, 33],
            max_shape=[1024, 1, 34, 34],
            dtype=torch.float
        )],
    "enabled_precisions": {torch.float} # Run with FP16
}

trt_ts_module = torch_tensorrt.compile(traced_model, **compile_settings)
benchmark(trt_ts_module, input_shape=(1024, 1, 32, 32), dtype="fp32")
```
2. Check below my performance comparison results:
```
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 39.72 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 39.72 ms
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 39.74 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 39.74 ms
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 39.77 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 39.77 ms
WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter
WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Detected invalid timing cache, setup a local cache instead
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Max value of this profile is not valid
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 57.29 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 57.29 ms

```

## Environment

> Build information about Torch-TensorRT can be found by turning on debug messages

 - PyTorch Version (e.g., 1.0): 1.10
 - CPU Architecture: x86_64
 - OS (e.g., Linux): Ubuntu 18.04
 - How you installed PyTorch (`conda`, `pip`, `libtorch`, source): pip 
 - Build command you used (if compiling from source): python3 setup.py install
 - Are you using local sources or building from archives: local
 - Python version: python 3.6.8
 - CUDA version: 11.3
 - Any other relevant information:

## Additional context
Is the above results are expected? Is torch-tensorrt compiled performs better only for fp16 mode?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

❓ [Question] More average batch time for torch-tensorrt compiled model than torchscript model (fp32 mode). #732

❓ Question

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

❓ [Question] More average batch time for torch-tensorrt compiled model than torchscript model (fp32 mode). #732

Description

❓ Question

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions