Skip to content

Commit 3977eea

Browse files
authored
Merge pull request #310 from gjmulder/auto-docker
Auto docker v2 - dockerised Open Llama 3B image w/OpenBLAS enabled server
2 parents 71f4582 + 30d32e9 commit 3977eea

File tree

9 files changed

+128
-40
lines changed

9 files changed

+128
-40
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,3 +164,6 @@ cython_debug/
164164
# and can be added to the global gitignore or merged into this file. For a more nuclear
165165
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
166166
.idea/
167+
168+
# downloaded model .bin files
169+
docker/open_llama/*.bin

docker/README.md

Lines changed: 45 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,66 @@
1-
# Dockerfiles for building the llama-cpp-python server
2-
- `Dockerfile.openblas_simple` - a simple Dockerfile for non-GPU OpenBLAS
3-
- `Dockerfile.cuda_simple` - a simple Dockerfile for CUDA accelerated CuBLAS
4-
- `hug_model.py` - a Python utility for interactively choosing and downloading the latest `5_1` quantized models from [huggingface.co/TheBloke]( https://huggingface.co/TheBloke)
5-
- `Dockerfile` - a single OpenBLAS and CuBLAS combined Dockerfile that automatically installs a previously downloaded model `model.bin`
6-
7-
# Get model from Hugging Face
8-
`python3 ./hug_model.py`
1+
# Install Docker Server
2+
3+
**Note #1:** This was tested with Docker running on Linux. If you can get it working on Windows or MacOS, please update this `README.md` with a PR!
4+
5+
[Install Docker Engine](https://docs.docker.com/engine/install)
6+
7+
**Note #2:** NVidia GPU CuBLAS support requires a NVidia GPU with sufficient VRAM (approximately as much as the size in the table below) and Docker NVidia support (see [container-toolkit/install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html))
98

9+
# Simple Dockerfiles for building the llama-cpp-python server with external model bin files
10+
## openblas_simple - a simple Dockerfile for non-GPU OpenBLAS, where the model is located outside the Docker image
11+
```
12+
cd ./openblas_simple
13+
docker build -t openblas_simple .
14+
docker run -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t openblas_simple
15+
```
16+
where `<model-root-path>/<model-path>` is the full path to the model file on the Docker host system.
17+
18+
## cuda_simple - a simple Dockerfile for CUDA accelerated CuBLAS, where the model is located outside the Docker image
19+
```
20+
cd ./cuda_simple
21+
docker build -t cuda_simple .
22+
docker run -e USE_MLOCK=0 -e MODEL=/var/model/<model-path> -v <model-root-path>:/var/model -t cuda_simple
23+
```
24+
where `<model-root-path>/<model-path>` is the full path to the model file on the Docker host system.
25+
26+
# "Open-Llama-in-a-box"
27+
## Download an Apache V2.0 licensed 3B paramter Open Llama model and install into a Docker image that runs an OpenBLAS-enabled llama-cpp-python server
28+
```
29+
$ cd ./open_llama
30+
./build.sh
31+
./start.sh
32+
```
33+
34+
# Manually choose your own Llama model from Hugging Face
35+
`python3 ./hug_model.py -a TheBloke -t llama`
1036
You should now have a model in the current directory and `model.bin` symlinked to it for the subsequent Docker build and copy step. e.g.
1137
```
1238
docker $ ls -lh *.bin
13-
-rw-rw-r-- 1 user user 4.8G May 23 18:30 <downloaded-model-file>.q5_1.bin
14-
lrwxrwxrwx 1 user user 24 May 23 18:30 model.bin -> <downloaded-model-file>.q5_1.bin
39+
-rw-rw-r-- 1 user user 4.8G May 23 18:30 <downloaded-model-file>q5_1.bin
40+
lrwxrwxrwx 1 user user 24 May 23 18:30 model.bin -> <downloaded-model-file>q5_1.bin
1541
```
1642
**Note #1:** Make sure you have enough disk space to download the model. As the model is then copied into the image you will need at least
1743
**TWICE** as much disk space as the size of the model:
1844

1945
| Model | Quantized size |
2046
|------:|----------------:|
47+
| 3B | 3 GB |
2148
| 7B | 5 GB |
2249
| 13B | 10 GB |
23-
| 30B | 25 GB |
50+
| 33B | 25 GB |
2451
| 65B | 50 GB |
2552

2653
**Note #2:** If you want to pass or tune additional parameters, customise `./start_server.sh` before running `docker build ...`
2754

28-
# Install Docker Server
29-
30-
**Note #3:** This was tested with Docker running on Linux. If you can get it working on Windows or MacOS, please update this `README.md` with a PR!
31-
32-
[Install Docker Engine](https://docs.docker.com/engine/install)
33-
34-
# Use OpenBLAS
55+
## Use OpenBLAS
3556
Use if you don't have a NVidia GPU. Defaults to `python:3-slim-bullseye` Docker base image and OpenBLAS:
36-
## Build:
37-
`docker build --build-arg -t openblas .`
38-
## Run:
57+
### Build:
58+
`docker build -t openblas .`
59+
### Run:
3960
`docker run --cap-add SYS_RESOURCE -t openblas`
4061

41-
# Use CuBLAS
42-
Requires a NVidia GPU with sufficient VRAM (approximately as much as the size above) and Docker NVidia support (see [container-toolkit/install-guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html))
43-
## Build:
62+
## Use CuBLAS
63+
### Build:
4464
`docker build --build-arg IMAGE=nvidia/cuda:12.1.1-devel-ubuntu22.04 -t cublas .`
45-
## Run:
65+
### Run:
4666
`docker run --cap-add SYS_RESOURCE -t cublas`

docker/Dockerfile.cuda_simple renamed to docker/cuda_simple/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
ARG CUDA_IMAGE="12.1.1-devel-ubuntu22.04"
2-
FROM ${CUDA_IMAGE}
2+
FROM nvidia/cuda:${CUDA_IMAGE}
33

44
# We need to set the host to 0.0.0.0 to allow outside access
55
ENV HOST 0.0.0.0
@@ -10,7 +10,7 @@ COPY . .
1010
RUN apt update && apt install -y python3 python3-pip
1111
RUN python3 -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette
1212

13-
RUN LLAMA_CUBLAS=1 python3 setup.py develop
13+
RUN LLAMA_CUBLAS=1 pip install llama-cpp-python
1414

1515
# Run the server
1616
CMD python3 -m llama_cpp.server
File renamed without changes.

docker/open_llama/build.sh

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
#!/bin/sh
2+
3+
MODEL="open_llama_3b"
4+
# Get open_llama_3b_ggml q5_1 quantization
5+
python3 ./hug_model.py -a SlyEcho -s ${MODEL} -f "q5_1"
6+
ls -lh *.bin
7+
8+
# Build the default OpenBLAS image
9+
docker build -t $MODEL .
10+
docker images | egrep "^(REPOSITORY|$MODEL)"
11+
12+
echo
13+
echo "To start the docker container run:"
14+
echo "docker run -t -p 8000:8000 $MODEL"

docker/hug_model.py renamed to docker/open_llama/hug_model.py

Lines changed: 34 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import json
33
import os
44
import struct
5+
import argparse
56

67
def make_request(url, params=None):
78
print(f"Making request to {url}...")
@@ -69,21 +70,30 @@ def get_user_choice(model_list):
6970

7071
return None
7172

72-
import argparse
73-
7473
def main():
7574
# Create an argument parser
76-
parser = argparse.ArgumentParser(description='Process the model version.')
75+
parser = argparse.ArgumentParser(description='Process some parameters.')
76+
77+
# Arguments
7778
parser.add_argument('-v', '--version', type=int, default=0x0003,
78-
help='an integer for the version to be used')
79+
help='hexadecimal version number of ggml file')
80+
parser.add_argument('-a', '--author', type=str, default='TheBloke',
81+
help='HuggingFace author filter')
82+
parser.add_argument('-t', '--tag', type=str, default='llama',
83+
help='HuggingFace tag filter')
84+
parser.add_argument('-s', '--search', type=str, default='',
85+
help='HuggingFace search filter')
86+
parser.add_argument('-f', '--filename', type=str, default='q5_1',
87+
help='HuggingFace model repository filename substring match')
7988

8089
# Parse the arguments
8190
args = parser.parse_args()
8291

8392
# Define the parameters
8493
params = {
85-
"author": "TheBloke", # Filter by author
86-
"tags": "llama"
94+
"author": args.author,
95+
"tags": args.tag,
96+
"search": args.search
8797
}
8898

8999
models = make_request('https://huggingface.co/api/models', params=params)
@@ -100,17 +110,30 @@ def main():
100110

101111
for sibling in model_info.get('siblings', []):
102112
rfilename = sibling.get('rfilename')
103-
if rfilename and 'q5_1' in rfilename:
113+
if rfilename and args.filename in rfilename:
104114
model_list.append((model_id, rfilename))
105115

106-
model_choice = get_user_choice(model_list)
116+
# Choose the model
117+
model_list.sort(key=lambda x: x[0])
118+
if len(model_list) == 0:
119+
print("No models found")
120+
exit(1)
121+
elif len(model_list) == 1:
122+
model_choice = model_list[0]
123+
else:
124+
model_choice = get_user_choice(model_list)
125+
107126
if model_choice is not None:
108127
model_id, rfilename = model_choice
109128
url = f"https://huggingface.co/{model_id}/resolve/main/{rfilename}"
110-
download_file(url, rfilename)
111-
_, version = check_magic_and_version(rfilename)
129+
dest = f"{model_id.replace('/', '_')}_{rfilename}"
130+
download_file(url, dest)
131+
_, version = check_magic_and_version(dest)
112132
if version != args.version:
113-
print(f"Warning: Expected version {args.version}, but found different version in the file.")
133+
print(f"Warning: Expected version {args.version}, but found different version in the file.")
134+
else:
135+
print("Error - model choice was None")
136+
exit(2)
114137

115138
if __name__ == '__main__':
116139
main()

docker/open_llama/start.sh

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#!/bin/sh
2+
3+
MODEL="open_llama_3b"
4+
5+
# Start Docker container
6+
docker run --cap-add SYS_RESOURCE -p 8000:8000 -t $MODEL &
7+
sleep 10
8+
echo
9+
docker ps | egrep "(^CONTAINER|$MODEL)"
10+
11+
# Test the model works
12+
echo
13+
curl -X 'POST' 'http://localhost:8000/v1/completions' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
14+
"prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n",
15+
"stop": [
16+
"\n",
17+
"###"
18+
]
19+
}' | grep Paris
20+
if [ $? -eq 0 ]
21+
then
22+
echo
23+
echo "$MODEL is working!!"
24+
else
25+
echo
26+
echo "ERROR: $MODEL not replying."
27+
exit 1
28+
fi

docker/start_server.sh renamed to docker/open_llama/start_server.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/bin/sh
22

3-
# For mmap support
3+
# For mlock support
44
ulimit -l unlimited
55

66
if [ "$IMAGE" = "python:3-slim-bullseye" ]; then

docker/Dockerfile.openblas_simple renamed to docker/openblas_simple/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ COPY . .
99
RUN apt update && apt install -y libopenblas-dev ninja-build build-essential
1010
RUN python -m pip install --upgrade pip pytest cmake scikit-build setuptools fastapi uvicorn sse-starlette
1111

12-
RUN LLAMA_OPENBLAS=1 python3 setup.py develop
12+
RUN LLAMA_OPENBLAS=1 pip install llama_cpp_python --verbose
1313

1414
# Run the server
1515
CMD python3 -m llama_cpp.server

0 commit comments

Comments
 (0)