[Feat] Multi model support #931

D4ve-R · 2023-11-21T16:17:12Z

Added multi model support

Changes

Llama class handles chat_handler initialization based on chat_format internal in init()
Llama class handles cache initialization internal in init()
Change Request parameter model in api routes from optional to mandatory
Add MultiLlama class that handles multiple models
Settings in own file

Description

MultiLlama class handles multiple models in the given dir.
It loads models lazy based the model request param, e.g. will take some time at first api call or after model change.

Usage

/models
|__mistral-orca.gguf
|__mistral-instruct-v0.1.gguf

$ python3 -m llama_cpp.server --model /models

# works too, but model request param still mandatory
$ python3 -m llama_cpp.server --model /models/mistral-orca.gguf

from openai import OpenAI
client = OpenAI(api_key="sk-oss4win")
chat_completion = client.chat.completions.create(
    model="mistral-orca", # or mistral-instruct-v0.1,
    messages=[...],
)

D4ve-R · 2023-11-21T16:20:36Z

Closes #906

limoncc · 2023-11-22T08:23:30Z

this is that i want

abetlen · 2023-11-22T11:28:15Z

Hi @D4ve-R thanks for the PR. Just a couple of notes:

Avoid the llama.py changes that add the clip path parameter, I think the chat handler approach is best right now while the api stabilises.
Don't change model to mandatory, this will cause the server to break for many users, it's better to have a default model set by the user or default to the first model.
For loading multiple models I think it would make sense to use a json file with parameters and 1 settings object per model

Something like:

{
   "host": "...",
   "port": "...",
   "models": [{
       "model": "models/mistral-7b/...",
       "chat_format": "mistral"
   }]
}

This way we can re-use the pydantic settings object inside some kind of config file settings object that just parses the json.

D4ve-R · 2023-11-22T15:26:12Z

Hi @abetlen, thanks for the feedback & awesome project!

Avoid the llama.py changes that add the clip path parameter, I think the chat handler approach is best right now while the api stabilises.

Don't change model to mandatory, this will cause the server to break for many users, it's better to have a default model set by the user or default to the first model.

I will revert the changes concerning the clip_model & model param.

For loading multiple models I think it would make sense to use a json file with parameters and 1 settings object per model

Something like:
{
   "host": "...",
   "port": "...",
   "models": [{
       "model": "models/mistral-7b/...",
       "chat_format": "mistral"
   }]
}
This way we can re-use the pydantic settings object inside some kind of config file settings object that just parse

I was thinking about something like this too, i will try to make something work

D4ve-R · 2023-11-22T15:45:53Z

Alternatively/additionally the app could expose a new endpoint "/settings", to set settings via api, @abetlen what are your thoughts on that?

D4ve-R · 2023-11-22T19:15:20Z

Okay, so loading individual settings as json works now.
If loading [model].json fails, it will fallback to using the global settings.

Usage:

/models
|__mistral-orca.gguf
|__mistral-orca.json
|__mistral-instruct-v0.1.gguf

[model].json

{
  "n_gpu_layers": 0,
  "main_gpu": 0,
  "tensor_split": null,
  "vocab_only": false,
  "use_mmap": true,
  "use_mlock": true,
  "seed": 4294967295,
  "n_ctx": 2048,
  "n_batch": 512,
  "n_threads": 4,
  "n_threads_batch": 4,
  "rope_scaling_type": -1,
  "rope_freq_base": 0.0,
  "rope_freq_scale": 0.0,
  "yarn_ext_factor": -1.0,
  "yarn_attn_factor": 1.0,
  "yarn_beta_fast": 32.0,
  "yarn_beta_slow": 1.0,
  "yarn_orig_ctx": 0,
  "mul_mat_q": true,
  "f16_kv": true,
  "logits_all": true,
  "embedding": true,
  "last_n_tokens_size": 64,
  "lora_base": null,
  "lora_path": null,
  "numa": false,
  "chat_format": "llama-2",
  "clip_model_path": null,
  "cache": false,
  "cache_type": "ram",
  "cache_size": 2147483648,
  "verbose": true
}

abetlen · 2023-11-22T22:36:14Z

Great work.

Alternatively/additionally the app could expose a new endpoint "/settings", to set settings via api, @abetlen what are your thoughts on that?

Admin routes should be easy to set up given that all the settings are just pydantic models but probably out of scope for this PR.

Okay, so loading individual settings as json works now. If loading [model].json fails, it will fallback to using the global settings.

Usage:

/models
|__mistral-orca.gguf
|__mistral-orca.json
|__mistral-instruct-v0.1.gguf

[model].json

{
  "n_gpu_layers": 0,
  "main_gpu": 0,
  "tensor_split": null,
  "vocab_only": false,
  "use_mmap": true,
  "use_mlock": true,
  "seed": 4294967295,
  "n_ctx": 2048,
  "n_batch": 512,
  "n_threads": 4,
  "n_threads_batch": 4,
  "rope_scaling_type": -1,
  "rope_freq_base": 0.0,
  "rope_freq_scale": 0.0,
  "yarn_ext_factor": -1.0,
  "yarn_attn_factor": 1.0,
  "yarn_beta_fast": 32.0,
  "yarn_beta_slow": 1.0,
  "yarn_orig_ctx": 0,
  "mul_mat_q": true,
  "f16_kv": true,
  "logits_all": true,
  "embedding": true,
  "last_n_tokens_size": 64,
  "lora_base": null,
  "lora_path": null,
  "numa": false,
  "chat_format": "llama-2",
  "clip_model_path": null,
  "cache": false,
  "cache_type": "ram",
  "cache_size": 2147483648,
  "verbose": true
}

I think we should stick to a single config file that can act as a replacement for the server's cli arguments. ie you either specify settings for a single model on the cli or you pass a --config <path> which points to a single config.json

So something more like

> /models
> |__mistral-7b.gguf
> |__open-hermes-2.5-7b.gguf

config.json

{
    "models": [{
         "model": "models/mistral-7b.gguf",
         "model_alias": "text-davinci-003"
   }, {
         "model": "models/open-hermes-2.5-7b.gguf",
         "model_alias": "gpt-3.5-turbo"
   }]
}

The advantage is that you only need to manage / change a single config file so you can swap configs easily and we can also start to add server global options at the root of this document.

This reverts commit bc5cf51.

pyproject.toml

bioshazard · 2023-11-30T20:51:57Z

Thank you for working on this!!

…R/main

abetlen · 2023-12-22T10:41:29Z

Okay I've got all the merge conflicts resolved and I did some general refactoring of the server submodule. It's a lot lighter than the original PR but that was necessary to avoid any breaking changes and I plan to reintegrate the additional features one at a time.

Overview

You can now pass a --config_file or use the CONFIG_FILE environment variable to load a server config file that can include multiple models. Here's an example of the config file

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
            "model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
            "model_alias": "gpt-3.5-turbo",
            "chat_format": "chatml",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
            "model_alias": "gpt-4",
            "chat_format": "chatml",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf",
            "model_alias": "gpt-4-vision-preview",
            "chat_format": "llava-1-5",
            "clip_model_path": "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf",
            "model_alias": "text-davinci-003",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf",
            "model_alias": "copilot-codex",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 1024,
            "n_ctx": 9216
        }
    ]
}

The server selects the appropriate model based on the model key passed to it and the model_alias defined in the config fiile.
If the selected model is not loaded already it's loaded on-the-fly.
The server currently only handles one single loaded model at a time and will automatically.

Next steps

Support for setting default model fields to make the config file less verbose.
Support for additional config file formats (yaml / toml / ini) so the config can support comments and multiline strings.
Support a models_dir or auto-generating a config file from a script to make it easier to get up and running with the config files.
Automatically unload a model after some period of time.
Documentation and examples.

abetlen · 2023-12-22T10:51:13Z

Here's a handy bash one-liner to traverse all of the .gguf files in a directory and generate a skeleton config.

find path/to/directory -name '*.gguf' -size +1k -printf '%p\n' | jq -R -s -c '{"models": [split("\n")[:-1][] | {"model": .}]}'

The 'model' parameter has been supported since abetlen#931. Its placement in this section was copied from an older version of the file, and hasn't been corrected since. Correcting this will make it clearer what parameters are supported by llama-cpp-python.

D4ve-R added 10 commits November 21, 2023 14:41

Update Llama class to handle chat_format & caching

3ec00d0

Add settings.py

6e68a4b

Add util.py & update __main__.py

e63cffb

multimodel

55e33ab

Merge branch 'multimodel'

39a07d6

update settings.py

5ab0010

cleanup

45bfa07

delete util.py

76c0168

Fix /v1/models endpoint

97a6a21

MultiLlama now iterable, app check-alive on "/"

fb2a1e7

D4ve-R added 2 commits November 21, 2023 17:54

instant model init if file is given

3f150ac

backward compability

e71946c

D4ve-R marked this pull request as draft November 21, 2023 17:34

D4ve-R marked this pull request as ready for review November 21, 2023 17:55

D4ve-R marked this pull request as draft November 22, 2023 15:16

D4ve-R added 2 commits November 22, 2023 16:30

revert model param mandatory

55a9767

Merge branch 'main' of https://github.com/abetlen/llama-cpp-python

bb1857a

D4ve-R added 2 commits November 22, 2023 18:19

fix error

3c4b526

handle individual model config json

10a2d32

D4ve-R added 3 commits November 22, 2023 20:18

refactor

ee71f20

revert chathandler/clip_model changes

ea0fcca

handle chat_handler in MulitLlama()

6f5e60a

split settings into server/llama

d9d696d

Revert "whitespace"

5fd9892

This reverts commit bc5cf51.

abetlen reviewed Nov 30, 2023

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

D4ve-R and others added 20 commits December 1, 2023 14:03

remove exe_name

7b1c17b

Merge branch 'main' into D4ve-R/main

a94b0de

Merge branch 'main' of github.com:abetlen/llama_cpp_python into D4ve-…

ec8265a

…R/main

Fix merge bugs

ba36629

Fix type annotations

315a82f

Fix type annotations

c5051be

Fix uvicorn app factory

7a3e11a

Fix settings

4f99ac6

Refactor server

3f2e6c1

Remove formatting fix

3472b6f

Format

310e2e6

Use default model if not found in model settings

5c9c35e

Fix

950f721

Merge branch 'main' into D4ve-R/main

3d6c479

Cleanup

8347a78

Fix

02ab0e2

Fix

fd1bf64

Remove unnused CommandLineSettings

ecd8434

Cleanup

5286146

Support default name for copilot-codex models

1b322b4

abetlen merged commit 12b7f2f into abetlen:main Dec 22, 2023

This was referenced Dec 22, 2023

[Feature] Dynamic Model Loading and Model Endpoint in FastAPI #17

Closed

Serve multiple models with [server] #906

Closed

domdomegg mentioned this pull request Oct 5, 2024

server types: Move 'model' parameter to clarify it is used #1786

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Multi model support #931

[Feat] Multi model support #931

D4ve-R commented Nov 21, 2023

D4ve-R commented Nov 21, 2023

limoncc commented Nov 22, 2023

abetlen commented Nov 22, 2023 •

edited

Loading

D4ve-R commented Nov 22, 2023

D4ve-R commented Nov 22, 2023

D4ve-R commented Nov 22, 2023 •

edited

Loading

abetlen commented Nov 22, 2023 •

edited

Loading

Usage:

bioshazard commented Nov 30, 2023 •

edited

Loading

abetlen commented Dec 22, 2023

abetlen commented Dec 22, 2023

[Feat] Multi model support #931

[Feat] Multi model support #931

Conversation

D4ve-R commented Nov 21, 2023

Added multi model support

Changes

Description

Usage

D4ve-R commented Nov 21, 2023

limoncc commented Nov 22, 2023

abetlen commented Nov 22, 2023 • edited Loading

D4ve-R commented Nov 22, 2023

D4ve-R commented Nov 22, 2023

D4ve-R commented Nov 22, 2023 • edited Loading

Usage:

abetlen commented Nov 22, 2023 • edited Loading

Usage:

bioshazard commented Nov 30, 2023 • edited Loading

abetlen commented Dec 22, 2023

Overview

Next steps

abetlen commented Dec 22, 2023

abetlen commented Nov 22, 2023 •

edited

Loading

D4ve-R commented Nov 22, 2023 •

edited

Loading

abetlen commented Nov 22, 2023 •

edited

Loading

bioshazard commented Nov 30, 2023 •

edited

Loading