Skip to content

❓ [Question] More average batch time for torch-tensorrt compiled model than torchscript model (fp32 mode). #732

@harishkool

Description

@harishkool

❓ Question

I am comparing the performances of the torchscript model and the torch-tensorrt compiled model, when I am running in float32 mode, the average batch time is more for torch-tensorrt model. Is this expected?1. I am running the below code to compare torchscript model and torch-tensorrt compiled models,

class LeNetFeatExtractor(nn.Module):
    def __init__(self):
        super(LeNetFeatExtractor, self).__init__()
        self.conv1 = nn.Conv2d(1, 128, 3)
        self.conv2 = nn.Conv2d(128, 16, 3)

    def forward(self, x):
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        return x

class LeNetClassifier(nn.Module):
    def __init__(self):
        super(LeNetClassifier, self).__init__()
        self.fc1 = nn.Linear(16 * 6 * 6, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = torch.flatten(x,1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.feat = LeNetFeatExtractor()
        self.classifer = LeNetClassifier()

    def forward(self, x):
        x = self.feat(x)
        x = self.classifer(x)
        return x

def benchmark(model, input_shape=(1024, 1, 32, 32), dtype='fp32', nwarmup=50, nruns=100):
    input_data = torch.randn(input_shape)
    input_data = input_data.to("cuda")
    if dtype=='fp16':
        input_data = input_data.half()
        
    print("Warm up ...")
    with torch.no_grad():
        for _ in range(nwarmup):
            features = model(input_data)
    torch.cuda.synchronize()
    print("Start timing ...")
    timings = []
    with torch.no_grad():
        for i in range(1, nruns+1):
            start_time = time.time()
            features = model(input_data)
            torch.cuda.synchronize()
            end_time = time.time()
            timings.append(end_time - start_time)
            if i%100==0:
                print('Iteration %d/%d, ave batch time %.2f ms'%(i, nruns, np.mean(timings)*1000))

    print("Input shape:", input_data.size())
    print("Output features size:", features.size())
    
    print('Average batch time: %.2f ms'%(np.mean(timings)*1000))
    
model = LeNet()
model.to("cuda").eval()
benchmark(model, dtype="fp32")
inpt = torch.empty([1,1,32,32]).to("cuda")
traced_model = torch.jit.trace(model, inpt)
benchmark(traced_model, dtype="fp32")
script_model = torch.jit.script(model)
benchmark(script_model, dtype="fp32")

compile_settings = {
    "inputs": [torch_tensorrt.Input(
            min_shape=[1024, 1, 32, 32],
            opt_shape=[1024, 1, 33, 33],
            max_shape=[1024, 1, 34, 34],
            dtype=torch.float
        )],
    "enabled_precisions": {torch.float} # Run with FP16
}

trt_ts_module = torch_tensorrt.compile(traced_model, **compile_settings)
benchmark(trt_ts_module, input_shape=(1024, 1, 32, 32), dtype="fp32")
  1. Check below my performance comparison results:
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 39.72 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 39.72 ms
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 39.74 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 39.74 ms
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 39.77 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 39.77 ms
WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter
WARNING: [Torch-TensorRT] - Dilation not used in Max pooling converter
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Detected invalid timing cache, setup a local cache instead
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Max value of this profile is not valid
WARNING: [Torch-TensorRT TorchScript Conversion Context] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
WARNING: [Torch-TensorRT] - TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 10.2.2
Warm up ...
Start timing ...
Iteration 100/100, ave batch time 57.29 ms
Input shape: torch.Size([1024, 1, 32, 32])
Output features size: torch.Size([1024, 10])
Average batch time: 57.29 ms

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

  • PyTorch Version (e.g., 1.0): 1.10
  • CPU Architecture: x86_64
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, libtorch, source): pip
  • Build command you used (if compiling from source): python3 setup.py install
  • Are you using local sources or building from archives: local
  • Python version: python 3.6.8
  • CUDA version: 11.3
  • Any other relevant information:

Additional context

Is the above results are expected? Is torch-tensorrt compiled performs better only for fp16 mode?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions