Skip to content

CUDA not detected when running without --gpus all #1882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
nandhiniramanan5 opened this issue Dec 25, 2024 · 3 comments
Closed
4 tasks done

CUDA not detected when running without --gpus all #1882

nandhiniramanan5 opened this issue Dec 25, 2024 · 3 comments

Comments

@nandhiniramanan5
Copy link

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am trying to run a llama-cpp-python model within a Docker container based on the nvidia/cuda:12.5.0-devel-ubuntu22.04 image. I expect CUDA to be detected and the model to utilize the GPU for inference without needing to specify --gpus all when running the container.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

RuntimeError: Failed to load shared library '/usr/local/lib/python3.10/dist-packages/llama-cpp/lib/libllama.so': libcuda.so.1: cannot open shared object file: No such file or directory

$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping: 7
CPU MHz: 2200.186
BogoMIPS: 4400.37
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 384 KiB
L1i cache: 384 KiB
L2 cache: 12 MiB
L3 cache: 38.5 MiB
NUMA node0 CPU(s): 0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW
sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx f
xsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology
nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2ap
ic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpci
d_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bm
i2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx
512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabili
ties

$ uname -a
Linux dl-big-poc 5.10.0-33-cloud-amd64 #1 SMP Debian 5.10.226-1 (2024-10-03) x86_64 GNU/Linux

  • SDK version:
    $ python3 --version
    Python 3.10.14

$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  

$ g++ --version
g++ (Debian 10.2.1-6) 10.2.1 20210110
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Failure Information (for bugs)

This appears to be a bug related to CUDA detection within the Docker container when not using --gpus all.

Steps to Reproduce

  1. Build the Docker image using the provided Dockerfile:
FROM nvidia/cuda:12.5.0-devel-ubuntu22.04

SHELL ["/bin/bash", "-c"]

# Set the working directory *before* copying files
WORKDIR /workspace 

# Install necessary build tools
RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y git build-essential \
    python3 python3-pip gcc wget \
    ocl-icd-opencl-dev opencl-headers clinfo \
    libclblast-dev libopenblas-dev \
    && mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd


ENV CUDA_DOCKER_ARCH=all
ENV GGML_CUDA=1

RUN python3 -m pip install --upgrade pip pytest cmake pydantic uvicorn fastapi

# Install llama-cpp-python with CUDA support
RUN CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1 \
    pip install llama-cpp-python==0.3.2 --no-cache-dir --force-reinstall --upgrade

# Set Gunicorn timeout
ENV GUNICORN_CMD_ARGS="--workers 1 --timeout 300"

# Set default environment variables
ENV MODEL_PATH="./model/test-llama-8B-abliterated.Q6_K.gguf"
ENV N_CTX="8192"
ENV N_GPU_LAYERS="-1"
ENV MAIN_GPU="1"
ENV N_THREADS="4"
ENV MAX_TOKENS="512"
ENV TEMPERATURE="0.0"

# Copy files
COPY main.py ./
COPY inference.py ./
COPY ./model ./model

# Run your FastAPI app on container startup
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] 
Run the container:
Bash

docker run -p 8000:8000 -e GUNICORN_CMD_ARGS="--workers 1 --timeout 300" -e MODEL_PATH="./model/test-llama-8B-abliterated.Q6_K.gguf" -e N_CTX=8192 -e N_GPU_LAYERS=-1 -e MAIN_GPU=1 -e TEMPERATURE=0.0 -e N_THREADS=20 -e MAX_TOKENS=512 local-llama-fastapi 


Failure Logs
docker run   -p 8000:8000   -e GUNICORN_CMD_ARGS="--workers 1 --timeout 300"   -e MODEL_PATH="./model/test-llama-8B-abliterated.Q6_K.gguf"   -e N_CTX=8192   -e N_GPU_LAYERS=-1   -e MAIN_GPU=1   -e TEMPERATURE=0.0   -e N_THREADS=20   -e MAX_TOKENS=512   local-llama-fastapi

==========
== CUDA ==
==========

CUDA Version 12.5.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_ctypes_extensions.py", line 67, in load_shared_library
    return ctypes.CDLL(str(lib_path), **cdll_args)  # type: ignore
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/uvicorn", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/main.py", line 412, in main
    run(
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/main.py", line 579, in run
    server.run()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/server.py", line 66, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/server.py", line 70, in serve
    await self._serve(sockets)
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/server.py", line 77, in _serve
    config.load()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/config.py", line 435, in load
    self.loaded_app = import_from_string(self.app)
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/importer.py", line 19, in import_from_string
    module = importlib.import_module(module_str)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/workspace/main.py", line 8, in <module>
    from llama_cpp import Llama
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/__init__.py", line 1, in <module>
    from .llama_cpp import *
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama_cpp.py", line 38, in <module>
    _lib = load_shared_library(_lib_base_name, _base_path)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_ctypes_extensions.py", line 69, in load_shared_library
    raise RuntimeError(f"Failed to load shared library '{lib_path}': {e}")
RuntimeError: Failed to load shared library '/usr/local/lib/python3.10/dist-packages/llama_cpp/lib/libllama.so': libcuda.so.1: cannot open shared object file: No such file or directory
@sergey21000
Copy link

I expect CUDA to be detected and the model to utilize the GPU for inference without needing to specify --gpus all when running the container.

The --gpus all flag is required to expose GPU devices to the container, even when using NVIDIA CUDA base images - without it, the container won't have access to the GPU hardware.

@nandhiniramanan5
Copy link
Author

@sergey21000 thanks for your prompt response. I am new to this. Then if I must specify the gpus flag all when running the docker, how do I specify this when I upload this to vertex ai. I don’t see an option to upload gcloud model and soecify it in the parameters. How does it ensure it picks all gpus

@AleefBilal
Copy link

@nandhiniramanan5 Don't know about vertex.ai, but on local --gpus all is essential for docker to access cuda. However, on severs like runpod you can just host your docker container and it will utilize the GPUs by default.
Hope this helps a little, if not, let me know how you fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants