-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[Clang] Add timeout for GPU detection utilities #94751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-clang @llvm/pr-subscribers-clang-driver Author: Joseph Huber (jhuber6) ChangesSummary: This patch adds a ten second timeout period for these utilities before Full diff: https://github.com/llvm/llvm-project/pull/94751.diff 4 Files Affected:
diff --git a/clang/include/clang/Driver/ToolChain.h b/clang/include/clang/Driver/ToolChain.h
index a4f9cad98aa8b..87a5034dfd78b 100644
--- a/clang/include/clang/Driver/ToolChain.h
+++ b/clang/include/clang/Driver/ToolChain.h
@@ -205,7 +205,7 @@ class ToolChain {
/// Executes the given \p Executable and returns the stdout.
llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>>
- executeToolChainProgram(StringRef Executable) const;
+ executeToolChainProgram(StringRef Executable, unsigned Timeout = 0) const;
void setTripleEnvironment(llvm::Triple::EnvironmentType Env);
diff --git a/clang/lib/Driver/ToolChain.cpp b/clang/lib/Driver/ToolChain.cpp
index 0e86bc07e0ea2..8c746ac8066cb 100644
--- a/clang/lib/Driver/ToolChain.cpp
+++ b/clang/lib/Driver/ToolChain.cpp
@@ -104,7 +104,8 @@ ToolChain::ToolChain(const Driver &D, const llvm::Triple &T,
}
llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>>
-ToolChain::executeToolChainProgram(StringRef Executable) const {
+ToolChain::executeToolChainProgram(StringRef Executable,
+ unsigned Timeout) const {
llvm::SmallString<64> OutputFile;
llvm::sys::fs::createTemporaryFile("toolchain-program", "txt", OutputFile);
llvm::FileRemover OutputRemover(OutputFile.c_str());
@@ -115,9 +116,8 @@ ToolChain::executeToolChainProgram(StringRef Executable) const {
};
std::string ErrorMessage;
- if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects,
- /* SecondsToWait */ 0,
- /*MemoryLimit*/ 0, &ErrorMessage))
+ if (llvm::sys::ExecuteAndWait(Executable, {}, {}, Redirects, Timeout,
+ /*MemoryLimit=*/0, &ErrorMessage))
return llvm::createStringError(std::error_code(),
Executable + ": " + ErrorMessage);
diff --git a/clang/lib/Driver/ToolChains/AMDGPU.cpp b/clang/lib/Driver/ToolChains/AMDGPU.cpp
index 9ffea57b005de..92895d8186e83 100644
--- a/clang/lib/Driver/ToolChains/AMDGPU.cpp
+++ b/clang/lib/Driver/ToolChains/AMDGPU.cpp
@@ -877,7 +877,7 @@ AMDGPUToolChain::getSystemGPUArchs(const ArgList &Args) const {
else
Program = GetProgramPath("amdgpu-arch");
- auto StdoutOrErr = executeToolChainProgram(Program);
+ auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
if (!StdoutOrErr)
return StdoutOrErr.takeError();
diff --git a/clang/lib/Driver/ToolChains/Cuda.cpp b/clang/lib/Driver/ToolChains/Cuda.cpp
index bbc8be91fd70b..47dac0e439f10 100644
--- a/clang/lib/Driver/ToolChains/Cuda.cpp
+++ b/clang/lib/Driver/ToolChains/Cuda.cpp
@@ -826,7 +826,7 @@ NVPTXToolChain::getSystemGPUArchs(const ArgList &Args) const {
else
Program = GetProgramPath("nvptx-arch");
- auto StdoutOrErr = executeToolChainProgram(Program);
+ auto StdoutOrErr = executeToolChainProgram(Program, /*Timeout=*/10);
if (!StdoutOrErr)
return StdoutOrErr.takeError();
|
No active test because I have no clue how you would, but I intentionally made it time out and it returns a 'Child timed out` error as expected. |
Summary: The utilities `nvptx-arch` and `amdgpu-arch` are used to support `--offload-arch=native` among other utilities in clang. However, these rely on the GPU drivers to query the features. In certain cases these drivers can become locked up, which will lead to indefinate hangs on any compiler jobs running in the meantime. This patch adds a ten second timeout period for these utilities before it kills the job and errors out.
Ooh... I think I know exactly what may be causing this. On machines where NVIDIA GPUs are used for compute only (e.g. a headless server machine), NVIDIA drivers are not always loaded by default and may not have driver persistence enabled. The drivers get loaded when GPU is accessed, and then released and unloaded when there are no GPU users remaining. A parallel compilation with Adding a timeout here would help, sort of, but it would be much better if we could figure out a way to either detect that GPU probing takes too long (and likely causes the driver to load/unload), or cache probing results somehow, so we do not have to run the same detection over and over again. This is a point towards pushing the detection out of clang into the build system, which would be the better place to do it. For the GPU detection, we may be able to work around the issue by leaving the detection app running for the duration of the compilation, and prevent driver unloading, but it's a rather gross hack. |
I've observed this a few times. For my case it's usually when some application hangs on the GPU and no one notices, then these tools hang forever and it takes awhile to notice. Figured an error is friendlier since I highly doubt these tools will take over ten seconds to run even in the worst case.
What's the config to set this by default without any graphics? Would be nice to not need to worry about it on my dev machine.
I know for AMD stuff we used to just probe the PCI connections, but that leaked a lot of information so this is the easier way to do it. I wonder what |
https://docs.nvidia.com/deploy/driver-persistence/index.html I usually use "nvidia-smi -i -pm ENABLED" to force the driver to be loaded permanently. As for
|
Summary:
The utilities
nvptx-arch
andamdgpu-arch
are used to support--offload-arch=native
among other utilities in clang. However, theserely on the GPU drivers to query the features. In certain cases these
drivers can become locked up, which will lead to indefinate hangs on any
compiler jobs running in the meantime.
This patch adds a ten second timeout period for these utilities before
it kills the job and errors out.