-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem? Please describe.
For now the Tensorflow and ONNX backends in Triton support thread controls (here and here). Would like to have similar features for PyTorch as well.
This is useful because in several cases we have seen PyTorch inference runs (super) slow on multi-core CPU machines. In O(100) core machines we have even seen one inference takes several minutes, despite being a small model. This might be due to PyTorch internal problem, but a temporary solution seems to be to configure the number of intra-op parallism to 1.
See examples here, here, and also previously in Triton issues here. In our cases we have found out seting the number of instances is NOT enough to fix the problem, and we need to set both the number of model instances and the number of Intra-op parallelism to 1. Tested with some examples and have confirmed this can fix the slow inference problem for PyTorch on CPUs.
This can be done with at::set_num_threads(1) when loading the models https://pytorch.org/docs/stable/notes/cpu_threading_torchscript_inference.html#runtime-api
We have one implemenation fixing this problem here. If this solution sounds good to you we can open one pull request on this.