Skip to content

Conversation

@yongbinfeng
Copy link
Contributor

As noted in the issue here: triton-inference-server/server#6896 we have found out the number of threads can affect the pytorch inference performance a lot. In some cases we have seen PyTorch inference runs (super) slow on multi-core CPU machines, and just setting number of instance is not enough to handle the problem. We have tested using at::set_num_threads(1) and confirmed this fixes the slow inference issue.

This PR allows to configure intra_op_thread_count and inter_op_thread_count for pytorch models, similar to other backends such as TF and ONNX, with syntax such as

parameters { key: "INTRA_OP_THREAD_COUNT" value: { string_value: "1" } }
parameters { key: "INTER_OP_THREAD_COUNT" value: { string_value: "1" } }

@Pascualex
Copy link

Thank you for this, we are experiencing the same problem and are being forced to convert our models to ONNX, which in turn is causing other issues.

I'm not a maintainer so I can only validate that this is a real issue for us too.

@tanmayv25 tanmayv25 self-assigned this Apr 15, 2024
@tanmayv25
Copy link
Contributor

@yongbinfeng Can you submit Triton CLA?

@yongbinfeng
Copy link
Contributor Author

Triton CLA

I think I've done already that, through my affiliation (fermilab) and my affiliation email. (The other PR is already merged: #120 so hopefully it should be fine I guess?)

@tanmayv25 tanmayv25 self-requested a review April 18, 2024 21:29
@tanmayv25 tanmayv25 merged commit c50d65b into triton-inference-server:main Apr 18, 2024
@tanmayv25
Copy link
Contributor

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants