CUDA not detected when running without --gpus all

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/abetlen/llama-cpp-python/blob/main/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/abetlen/llama-cpp-python/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

I am trying to run a llama-cpp-python model within a Docker container based on the `nvidia/cuda:12.5.0-devel-ubuntu22.04` image. I expect CUDA to be detected and the model to utilize the GPU for inference without needing to specify `--gpus all` when running the container.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

RuntimeError: Failed to load shared library '/usr/local/lib/python3.10/dist-packages/llama-cpp/lib/libllama.so': libcuda.so.1: cannot open shared object file: No such file or directory

$ lscpu
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        46 bits physical, 48 bits virtual
CPU(s):                               24
On-line CPU(s) list:                  0-23
Thread(s) per core:                   2
Core(s) per socket:                   12
Socket(s):                            1
NUMA node(s):                         1
Vendor ID:                            GenuineIntel
CPU family:                           6
Model:                                85
Model name:                           Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                             7
CPU MHz:                              2200.186
BogoMIPS:                             4400.37
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            384 KiB
L1i cache:                            384 KiB
L2 cache:                             12 MiB
L3 cache:                             38.5 MiB
NUMA node0 CPU(s):                    0-23
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW 
                                      sequence
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx f
                                      xsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology 
                                      nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2ap
                                      ic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpci
                                      d_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bm
                                      i2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx
                                      512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabili
                                      ties

$ uname -a
Linux dl-big-poc 5.10.0-33-cloud-amd64 #1 SMP Debian 5.10.226-1 (2024-10-03) x86_64 GNU/Linux

* **SDK version:**
$ python3 --version
Python 3.10.14

$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.   

$ g++ --version
g++ (Debian 10.2.1-6) 10.2.1 20210110
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

# Failure Information (for bugs)

This appears to be a bug related to CUDA detection within the Docker container when not using `--gpus all`.

# Steps to Reproduce

1. Build the Docker image using the provided Dockerfile:
```dockerfile
FROM nvidia/cuda:12.5.0-devel-ubuntu22.04

SHELL ["/bin/bash", "-c"]

# Set the working directory *before* copying files
WORKDIR /workspace 

# Install necessary build tools
RUN apt-get update && apt-get upgrade -y \
    && apt-get install -y git build-essential \
    python3 python3-pip gcc wget \
    ocl-icd-opencl-dev opencl-headers clinfo \
    libclblast-dev libopenblas-dev \
    && mkdir -p /etc/OpenCL/vendors && echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd


ENV CUDA_DOCKER_ARCH=all
ENV GGML_CUDA=1

RUN python3 -m pip install --upgrade pip pytest cmake pydantic uvicorn fastapi

# Install llama-cpp-python with CUDA support
RUN CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_ARCHITECTURES=all-major" FORCE_CMAKE=1 \
    pip install llama-cpp-python==0.3.2 --no-cache-dir --force-reinstall --upgrade

# Set Gunicorn timeout
ENV GUNICORN_CMD_ARGS="--workers 1 --timeout 300"

# Set default environment variables
ENV MODEL_PATH="./model/test-llama-8B-abliterated.Q6_K.gguf"
ENV N_CTX="8192"
ENV N_GPU_LAYERS="-1"
ENV MAIN_GPU="1"
ENV N_THREADS="4"
ENV MAX_TOKENS="512"
ENV TEMPERATURE="0.0"

# Copy files
COPY main.py ./
COPY inference.py ./
COPY ./model ./model

# Run your FastAPI app on container startup
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] 
Run the container:
Bash

docker run -p 8000:8000 -e GUNICORN_CMD_ARGS="--workers 1 --timeout 300" -e MODEL_PATH="./model/test-llama-8B-abliterated.Q6_K.gguf" -e N_CTX=8192 -e N_GPU_LAYERS=-1 -e MAIN_GPU=1 -e TEMPERATURE=0.0 -e N_THREADS=20 -e MAX_TOKENS=512 local-llama-fastapi 


Failure Logs
docker run   -p 8000:8000   -e GUNICORN_CMD_ARGS="--workers 1 --timeout 300"   -e MODEL_PATH="./model/test-llama-8B-abliterated.Q6_K.gguf"   -e N_CTX=8192   -e N_GPU_LAYERS=-1   -e MAIN_GPU=1   -e TEMPERATURE=0.0   -e N_THREADS=20   -e MAX_TOKENS=512   local-llama-fastapi

==========
== CUDA ==
==========

CUDA Version 12.5.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_ctypes_extensions.py", line 67, in load_shared_library
    return ctypes.CDLL(str(lib_path), **cdll_args)  # type: ignore
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/uvicorn", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/main.py", line 412, in main
    run(
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/main.py", line 579, in run
    server.run()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/server.py", line 66, in run
    return asyncio.run(self.serve(sockets=sockets))
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/server.py", line 70, in serve
    await self._serve(sockets)
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/server.py", line 77, in _serve
    config.load()
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/config.py", line 435, in load
    self.loaded_app = import_from_string(self.app)
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/importer.py", line 19, in import_from_string
    module = importlib.import_module(module_str)
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/workspace/main.py", line 8, in <module>
    from llama_cpp import Llama
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/__init__.py", line 1, in <module>
    from .llama_cpp import *
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama_cpp.py", line 38, in <module>
    _lib = load_shared_library(_lib_base_name, _base_path)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/_ctypes_extensions.py", line 69, in load_shared_library
    raise RuntimeError(f"Failed to load shared library '{lib_path}': {e}")
RuntimeError: Failed to load shared library '/usr/local/lib/python3.10/dist-packages/llama_cpp/lib/libllama.so': libcuda.so.1: cannot open shared object file: No such file or directory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA not detected when running without --gpus all #1882

Expected Behavior

Failure Information (for bugs)

Steps to Reproduce

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CUDA not detected when running without --gpus all #1882

Description

Expected Behavior

Failure Information (for bugs)

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions