A command-line diagnostic tool for GPU health monitoring and troubleshooting. This tool helps identify and diagnose common GPU issues, including memory leaks, hardware failures, and performance degradation.
- Real-time GPU health monitoring
- Memory leak detection
- Hardware failure diagnosis
- Performance metrics analysis
- Mock testing capabilities for development
Run tests in docker container:
make testRun tests locally and generate coverage report:
make test-localIf you are developing on MacOS, you can consider using a docker container for compilation.
Taking the ubuntu:22.04 image as an example, you need to install the following dependencies in the container and mount the project into the container for compilation.
- Start the container
docker run --platform=linux/amd64 -itd -v ./ai-accelerator-tool:/git/src/github.com/aibrix/ai-accelerator-tool/ ubuntu:22.04- Install dependencies in the container
apt update
apt install -y vim cmake clang libnvidia-ml-dev git wget
wget https://go.dev/dl/go1.23.2.linux-amd64.tar.gz
tar xvf go1.23.2.linux-amd64.tar.gz
echo "export PATH=$PATH:/go/bin" >> ~/.bashrc
source ~/.bashrc- Compile the project in the container
cd /git/src/github.com/aibrix/ai-accelerator-tool
git submodule update --init --recursive
make lib-injection
cp lib/build/lib/libdevso-injection.so pkg/mock/resources/injectiond.soGOOS=linux GOARCH=amd64 make buildThe binary will be generated in bin/.
# Set the number of GPU cards in the machine, for example, 4.
export GPU_CARD_COUNT=4
# Run the diagnosis.
ai-accelerator-tool diagnoseNote:
- This tool requires the
nvidia-smicommand to be installed.
You can refer to the comments in hack/gpu_mock_conf.toml to configure the fault scenario.
ai-accelerator-tool mock --config /PATH/TO/gpu_mock_conf.tomlmkdir -p /opt/gpu_mock && cd /opt/gpu_mock/
cp /PATH/TO/nvml_injectiond.so /opt/gpu_mock/
cp /PATH/TO/gpu_mock_conf.toml /opt/gpu_mock/
echo "/opt/gpu_mock/nvml_injectiond.so" >> /etc/ld.so.preload