Skip to content

[Feat] Multi model support #931

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
Dec 22, 2023
Merged

[Feat] Multi model support #931

merged 53 commits into from
Dec 22, 2023

Conversation

D4ve-R
Copy link
Contributor

@D4ve-R D4ve-R commented Nov 21, 2023

Added multi model support

Changes

  • Llama class handles chat_handler initialization based on chat_format internal in init()
  • Llama class handles cache initialization internal in init()
  • Change Request parameter model in api routes from optional to mandatory
  • Add MultiLlama class that handles multiple models
  • Settings in own file

Description

MultiLlama class handles multiple models in the given dir.
It loads models lazy based the model request param, e.g. will take some time at first api call or after model change.

Usage

/models
|__mistral-orca.gguf
|__mistral-instruct-v0.1.gguf
$ python3 -m llama_cpp.server --model /models

# works too, but model request param still mandatory
$ python3 -m llama_cpp.server --model /models/mistral-orca.gguf
from openai import OpenAI
client = OpenAI(api_key="sk-oss4win")
chat_completion = client.chat.completions.create(
    model="mistral-orca", # or mistral-instruct-v0.1,
    messages=[...],
)

@D4ve-R
Copy link
Contributor Author

D4ve-R commented Nov 21, 2023

Closes #906

@D4ve-R D4ve-R marked this pull request as draft November 21, 2023 17:34
@D4ve-R D4ve-R marked this pull request as ready for review November 21, 2023 17:55
@limoncc
Copy link

limoncc commented Nov 22, 2023

this is that i want

@abetlen
Copy link
Owner

abetlen commented Nov 22, 2023

Hi @D4ve-R thanks for the PR. Just a couple of notes:

  • Avoid the llama.py changes that add the clip path parameter, I think the chat handler approach is best right now while the api stabilises.
  • Don't change model to mandatory, this will cause the server to break for many users, it's better to have a default model set by the user or default to the first model.
  • For loading multiple models I think it would make sense to use a json file with parameters and 1 settings object per model

Something like:

{
   "host": "...",
   "port": "...",
   "models": [{
       "model": "models/mistral-7b/...",
       "chat_format": "mistral"
   }]
}

This way we can re-use the pydantic settings object inside some kind of config file settings object that just parses the json.

@D4ve-R D4ve-R marked this pull request as draft November 22, 2023 15:16
@D4ve-R
Copy link
Contributor Author

D4ve-R commented Nov 22, 2023

Hi @abetlen, thanks for the feedback & awesome project!

  • Avoid the llama.py changes that add the clip path parameter, I think the chat handler approach is best right now while the api stabilises.
  • Don't change model to mandatory, this will cause the server to break for many users, it's better to have a default model set by the user or default to the first model.

I will revert the changes concerning the clip_model & model param.

  • For loading multiple models I think it would make sense to use a json file with parameters and 1 settings object per model

Something like:

{
   "host": "...",
   "port": "...",
   "models": [{
       "model": "models/mistral-7b/...",
       "chat_format": "mistral"
   }]
}

This way we can re-use the pydantic settings object inside some kind of config file settings object that just parse

I was thinking about something like this too, i will try to make something work

@D4ve-R
Copy link
Contributor Author

D4ve-R commented Nov 22, 2023

Alternatively/additionally the app could expose a new endpoint "/settings", to set settings via api, @abetlen what are your thoughts on that?

@D4ve-R
Copy link
Contributor Author

D4ve-R commented Nov 22, 2023

Okay, so loading individual settings as json works now.
If loading [model].json fails, it will fallback to using the global settings.

Usage:

/models
|__mistral-orca.gguf
|__mistral-orca.json
|__mistral-instruct-v0.1.gguf

[model].json

{
  "n_gpu_layers": 0,
  "main_gpu": 0,
  "tensor_split": null,
  "vocab_only": false,
  "use_mmap": true,
  "use_mlock": true,
  "seed": 4294967295,
  "n_ctx": 2048,
  "n_batch": 512,
  "n_threads": 4,
  "n_threads_batch": 4,
  "rope_scaling_type": -1,
  "rope_freq_base": 0.0,
  "rope_freq_scale": 0.0,
  "yarn_ext_factor": -1.0,
  "yarn_attn_factor": 1.0,
  "yarn_beta_fast": 32.0,
  "yarn_beta_slow": 1.0,
  "yarn_orig_ctx": 0,
  "mul_mat_q": true,
  "f16_kv": true,
  "logits_all": true,
  "embedding": true,
  "last_n_tokens_size": 64,
  "lora_base": null,
  "lora_path": null,
  "numa": false,
  "chat_format": "llama-2",
  "clip_model_path": null,
  "cache": false,
  "cache_type": "ram",
  "cache_size": 2147483648,
  "verbose": true
}

@abetlen
Copy link
Owner

abetlen commented Nov 22, 2023

Great work.

Alternatively/additionally the app could expose a new endpoint "/settings", to set settings via api, @abetlen what are your thoughts on that?

Admin routes should be easy to set up given that all the settings are just pydantic models but probably out of scope for this PR.

Okay, so loading individual settings as json works now. If loading [model].json fails, it will fallback to using the global settings.

Usage:

/models
|__mistral-orca.gguf
|__mistral-orca.json
|__mistral-instruct-v0.1.gguf

[model].json

{
  "n_gpu_layers": 0,
  "main_gpu": 0,
  "tensor_split": null,
  "vocab_only": false,
  "use_mmap": true,
  "use_mlock": true,
  "seed": 4294967295,
  "n_ctx": 2048,
  "n_batch": 512,
  "n_threads": 4,
  "n_threads_batch": 4,
  "rope_scaling_type": -1,
  "rope_freq_base": 0.0,
  "rope_freq_scale": 0.0,
  "yarn_ext_factor": -1.0,
  "yarn_attn_factor": 1.0,
  "yarn_beta_fast": 32.0,
  "yarn_beta_slow": 1.0,
  "yarn_orig_ctx": 0,
  "mul_mat_q": true,
  "f16_kv": true,
  "logits_all": true,
  "embedding": true,
  "last_n_tokens_size": 64,
  "lora_base": null,
  "lora_path": null,
  "numa": false,
  "chat_format": "llama-2",
  "clip_model_path": null,
  "cache": false,
  "cache_type": "ram",
  "cache_size": 2147483648,
  "verbose": true
}

I think we should stick to a single config file that can act as a replacement for the server's cli arguments. ie you either specify settings for a single model on the cli or you pass a --config <path> which points to a single config.json

So something more like

> /models
> |__mistral-7b.gguf
> |__open-hermes-2.5-7b.gguf

config.json

{
    "models": [{
         "model": "models/mistral-7b.gguf",
         "model_alias": "text-davinci-003"
   }, {
         "model": "models/open-hermes-2.5-7b.gguf",
         "model_alias": "gpt-3.5-turbo"
   }]
}

The advantage is that you only need to manage / change a single config file so you can swap configs easily and we can also start to add server global options at the root of this document.

This reverts commit bc5cf51.
@bioshazard
Copy link
Contributor

bioshazard commented Nov 30, 2023

Thank you for working on this!!

@abetlen
Copy link
Owner

abetlen commented Dec 22, 2023

Okay I've got all the merge conflicts resolved and I did some general refactoring of the server submodule. It's a lot lighter than the original PR but that was necessary to avoid any breaking changes and I plan to reintegrate the additional features one at a time.

Overview

You can now pass a --config_file or use the CONFIG_FILE environment variable to load a server config file that can include multiple models. Here's an example of the config file

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
            "model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
            "model_alias": "gpt-3.5-turbo",
            "chat_format": "chatml",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
            "model_alias": "gpt-4",
            "chat_format": "chatml",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf",
            "model_alias": "gpt-4-vision-preview",
            "chat_format": "llava-1-5",
            "clip_model_path": "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf",
            "model_alias": "text-davinci-003",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 2048
        },
        {
            "model": "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf",
            "model_alias": "copilot-codex",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 1024,
            "n_ctx": 9216
        }
    ]
}

The server selects the appropriate model based on the model key passed to it and the model_alias defined in the config fiile.
If the selected model is not loaded already it's loaded on-the-fly.
The server currently only handles one single loaded model at a time and will automatically.

Next steps

  • Support for setting default model fields to make the config file less verbose.
  • Support for additional config file formats (yaml / toml / ini) so the config can support comments and multiline strings.
  • Support a models_dir or auto-generating a config file from a script to make it easier to get up and running with the config files.
  • Automatically unload a model after some period of time.
  • Documentation and examples.

@abetlen
Copy link
Owner

abetlen commented Dec 22, 2023

Here's a handy bash one-liner to traverse all of the .gguf files in a directory and generate a skeleton config.

find path/to/directory -name '*.gguf' -size +1k -printf '%p\n' | jq -R -s -c '{"models": [split("\n")[:-1][] | {"model": .}]}'

@abetlen abetlen merged commit 12b7f2f into abetlen:main Dec 22, 2023
domdomegg added a commit to domdomegg/llama-cpp-python that referenced this pull request Oct 5, 2024
The 'model' parameter has been supported since abetlen#931. Its placement in this section was copied from an older version of the file, and hasn't been corrected since.

Correcting this will make it clearer what parameters are supported by llama-cpp-python.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants