Skip to content

thread control for pytorch backend to fix the issue of PyTorch very slow inference on multi-core CPUs #6896

@yongbinfeng

Description

@yongbinfeng

Is your feature request related to a problem? Please describe.

For now the Tensorflow and ONNX backends in Triton support thread controls (here and here). Would like to have similar features for PyTorch as well.

This is useful because in several cases we have seen PyTorch inference runs (super) slow on multi-core CPU machines. In O(100) core machines we have even seen one inference takes several minutes, despite being a small model. This might be due to PyTorch internal problem, but a temporary solution seems to be to configure the number of intra-op parallism to 1.

See examples here, here, and also previously in Triton issues here. In our cases we have found out seting the number of instances is NOT enough to fix the problem, and we need to set both the number of model instances and the number of Intra-op parallelism to 1. Tested with some examples and have confirmed this can fix the slow inference problem for PyTorch on CPUs.

This can be done with at::set_num_threads(1) when loading the models https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html#runtime-api

We have one implemenation fixing this problem here. If this solution sounds good to you we can open one pull request on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions