-
09/08/2023: First version of llama.cpp android tutorial is available
-
(NEW!) 04/27/2025: 2025 version of tutorial is available
- Also including
llama-cpp-python
+ custom-builtllama.cpp
tutorial
- Also including
- Deploying
llama.cpp
on an Android device and running it using the Adreno GPU. - Utilizing
llama-cpp-python
with a custom-builtllama.cpp
version that supports Adreno GPU with OpenCL:- Enables large-scale inference evaluation directly on Android.
- Provides a solid foundation for developing your own Android LLM applications.
- An Android device with a Qualcomm Snapdragon SoC (Snapdragon 8 Gen 1, Gen 2, Gen 3, or 8 Elite)
- Recommended (but not required): more than 12 GB of RAM
- Optional: an Ubuntu or Mac computer for building
This tutorial is based on:
- Android Phone: OnePlus 13
- SoC: Qualcomm Snapdragon 8 Elite
- RAM: 16 GB
- Ubuntu Laptop: Kubuntu 24.04
- Note #1: this guide is based on Ubuntu, but the steps are similar on MacOS and have been tested on a MacBook Pro.
- Note #2: If you don't have a Linux laptop, then virtual machine also works
Install Termux via F-Droid:
(Optional) To improve download speeds, you can change the repository:
termux-change-repo
Important: Avoid installing Termux from the Google Play Store.
Required software:
python
,cmake
,make
, andninja
- Recommended: Install Python via
miniconda
- Guide: Installing Miniconda
- Recommended: Install Python via
- A C/C++ compiler (e.g.,
clang
) - Java Development Kit (in this case,
openjdk-21-jdk
)
Install them using the following command:
sudo apt install cmake make ninja-build clang openjdk-21-jdk
First, update termux packages:
pkg update && pkg upgrade
Then, download required libraries:
pkg install clinfo ocl-icd opencl-headers
Add dynamic link library libOpenCL.so
path into LD_LIBRARY_PATH
variable
vim .bashrc
# In .bashrc file, add:
export LD_LIBRARY_PATH=/vendor/lib64:$PREFIX/lib
For some older phone, the location of libOpenCL.so
might be:
export LD_LIBRARY_PATH=/system/vendor/lib64:$PREFIX/lib
After that:
source ~/.bashrc
Test if it works:
clinfo
If it returns:
Number of platforms 1
Platform Name QUALCOMM Snapdragon(TM)
...
It means OpenCL driver is loaded successfully
If it returns:
Number of platforms 0
...
It means OpenCL driver is not detected, and you may follow Option 2 below
Copy both libOpenCL.so
and libOpenCL_adreno.so
into your termux app
# The destination can be any other directory in your termux
cp /vendor/lib64/{libOpenCL.so, libOpenCL_adreno.so} ~/
Add the directory path containing these two libraries to the LD_LIBRARY_PATH
environment variable
vim ~/.bashrc
# In my case, $HOME is the location of your directory containing two .so files
# In .bashrc file, add:
export LD_LIBRARY_PATH=$HOME:$LD_LIBRARY_PATH
source ~/.bashrc
Now, test it again:
clinfo
If it returns:
Number of platforms 1
Platform Name QUALCOMM Snapdragon(TM)
...
It means OpenCL driver is loaded successfully
There are two options available:
- Option 1: Build on Laptop and send it to Android phone
- Option 2: Build on Android phone directly
- As of April 27, 2025,
llama-cpp-python
does not natively support buildingllama.cpp
with OpenCL for Android platforms. This means you'll have to compilellama.cpp
separately on Android phone and then integrate it withllama-cpp-python
. - It's important to note that
llama-cpp-python
serves as a Python wrapper around thellama.cpp
library. This option allows you to customize and replace the underlyingllama.cpp
implementation to suit your specific needs.
- As of April 27, 2025,
Run these commands first:
cd ~ && \
wget https://dl.google.com/android/repository/commandlinetools-linux-8512546_latest.zip && \
unzip commandlinetools-linux-8512546_latest.zip && \
mkdir -p ~/android-sdk/cmdline-tools && \
mv cmdline-tools latest && \
mv latest ~/android-sdk/cmdline-tools/ && \
rm -rf commandlinetools-linux-8512546_latest.zip
Check your OpenJDK location (please making sure that you download the OpenJDK-21 )
readlink -f $(which java)
Set JAVA_HOME
:
vim ~/.bashsrc
export JAVA_HOME=/your/openjdk/directory
source ~/.bashrc
Finally, run this:
yes | ~/android-sdk/cmdline-tools/latest/bin/sdkmanager "ndk;26.3.11579264"
# You can choose your own directory, here's an example from Qualcomm's official tutorial:
mkdir -p ~/dev/llm && cd ~/dev/llm
git clone https://github.com/KhronosGroup/OpenCL-Headers
cd OpenCL-Headers && cp -r CL ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include
For MacOS, it will be:
~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/darwin-x86_64/sysroot/usr/include
Then:
cd ~/dev/llm
# If you are using MacOS, you may need to change your OPENCL_ICD_LOADER_HEADERS_DIR
git clone https://github.com/KhronosGroup/OpenCL-ICD-Loader && \
cd OpenCL-ICD-Loader && \
mkdir build_ndk26 && cd build_ndk26 && \
cmake .. -G Ninja -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=~/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \
-DOPENCL_ICD_LOADER_HEADERS_DIR=~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/include \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=24 \
-DANDROID_STL=c++_shared
ninja && cp libOpenCL.so ~/android-sdk/ndk/26.3.11579264/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android
cd ~/dev/llm
git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
mkdir build-android && cd build-android
cmake .. -G Ninja \
-DCMAKE_TOOLCHAIN_FILE=$HOME/android-sdk/ndk/26.3.11579264/build/cmake/android.toolchain.cmake \
-DANDROID_ABI=arm64-v8a \
-DANDROID_PLATFORM=android-28 \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_OPENCL=ON
ninja
Finally, copy the dev
folder into the termux
- Making sure you run
termux-setup-storage
on termux
Now, the executive files is available under llama.cpp/build-android/bin
Run llama-bench
to test:
./llama-bench -m /your/model/directory -ngl 99
Result should looks like this:
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 830'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.21
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
...
Note: As of 04/27/2025, building from newest version of llama.cpp will cause segmentation fault. Please select a version before b5028. For more details, please visit: ggml-org/llama.cpp#9289 (comment)
Clone llama.cpp source code:
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
If the segmentation fault is not fixed, then we need to switch to older version:
git reset --hard b5026
Note: set BUILD_SHARED_LIBS
to ON
if you want to use llama-cpp-python
later:
cmake -B build-android \
-DBUILD_SHARED_LIBS=ON \
-DGGML_OPENCL=ON \
-DGGML_OPENCL_EMBED_KERNELS=ON \
-DGGML_OPENCL_USE_ADRENO_KERNELS=ON
Build the llama.cpp:
cmake --build build-android --config Release
Same as previous section, test by:
./llama-bench -m /your/model/directory -ngl 99
Note: Making sure you have set BUILD_SHARED_LIBS=ON
when building the llama.cpp
First, you need to set the eviromental variable:
vim ~/.bashrc
Export the following path:
# The path of libllama.so is under llama.cpp/build-android/bin
export LLAMA_CPP_LIB_PATH=/home/.../llama.cpp/build-android/bin
source ~/.bashrc
Now, install llama-cpp-python with LLAMA_BUILD=OFF
CMAKE_ARGS="-DLLAMA_BUILD=OFF" pip install llama-cpp-python --force-reinstall
To test it, run:
python -c "from llama_cpp import Llama; llm = Llama(model_path="your/model/path", n_gpu_layers=99, verbose=True)"
If you return verbose information like this:
...
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 434.29 MiB
load_tensors: OpenCL model buffer size = 1780.40 MiB
...
This means that llama-cpp-python
is using the custom-built llama.cpp
Device: OnePlus 13
model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|
llama 3.2 3B Q4_0 | 1.78 GiB | 3.21 B | OpenCL | 99 | pp512 | 168.47 ± 0.98 |
llama 3.2 3B Q4_0 | 1.78 GiB | 3.21 B | OpenCL | 99 | tg128 | 17.57 ± 1.40 |
llama 3.2 3B Q4_0 | 1.78 GiB | 3.21 B | OpenCL | 99 | tg256 | 15.45 ± 0.59 |
llama 3.2 3B Q4_0 | 1.78 GiB | 3.21 B | OpenCL | 99 | tg512 | 13.60 ± 0.79 |
gemma 2 2B Q4_0 | 1.51 GiB | 2.61 B | OpenCL | 99 | pp512 | 187.26 ± 0.49 |
gemma 2 2B Q4_0 | 1.51 GiB | 2.61 B | OpenCL | 99 | tg128 | 8.11 ± 0.13 |
gemma 2 2B Q4_0 | 1.51 GiB | 2.61 B | OpenCL | 99 | tg256 | 8.06 ± 0.07 |
gemma 2 2B Q4_0 | 1.51 GiB | 2.61 B | OpenCL | 99 | tg512 | 8.02 ± 0.11 |
phi 3 3.5 mini Q4_0 | 2.03 GiB | 3.82 B | OpenCL | 99 | pp512 | 110.60 ± 11.96 |
phi 3 3.5 mini Q4_0 | 2.03 GiB | 3.82 B | OpenCL | 99 | tg128 | 16.76 ± 0.37 |
phi 3 3.5 mini Q4_0 | 2.03 GiB | 3.82 B | OpenCL | 99 | tg256 | 16.10 ± 0.53 |
phi 3 3.5 mini Q4_0 | 2.03 GiB | 3.82 B | OpenCL | 99 | tg512 | 14.89 ± 0.16 |