-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Feat] Multi model support #931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Closes #906 |
this is that i want |
Hi @D4ve-R thanks for the PR. Just a couple of notes:
Something like: {
"host": "...",
"port": "...",
"models": [{
"model": "models/mistral-7b/...",
"chat_format": "mistral"
}]
} This way we can re-use the pydantic settings object inside some kind of config file settings object that just parses the json. |
Hi @abetlen, thanks for the feedback & awesome project!
I will revert the changes concerning the clip_model & model param.
I was thinking about something like this too, i will try to make something work |
Alternatively/additionally the app could expose a new endpoint "/settings", to set settings via api, @abetlen what are your thoughts on that? |
Okay, so loading individual settings as json works now. Usage:
[model].json {
"n_gpu_layers": 0,
"main_gpu": 0,
"tensor_split": null,
"vocab_only": false,
"use_mmap": true,
"use_mlock": true,
"seed": 4294967295,
"n_ctx": 2048,
"n_batch": 512,
"n_threads": 4,
"n_threads_batch": 4,
"rope_scaling_type": -1,
"rope_freq_base": 0.0,
"rope_freq_scale": 0.0,
"yarn_ext_factor": -1.0,
"yarn_attn_factor": 1.0,
"yarn_beta_fast": 32.0,
"yarn_beta_slow": 1.0,
"yarn_orig_ctx": 0,
"mul_mat_q": true,
"f16_kv": true,
"logits_all": true,
"embedding": true,
"last_n_tokens_size": 64,
"lora_base": null,
"lora_path": null,
"numa": false,
"chat_format": "llama-2",
"clip_model_path": null,
"cache": false,
"cache_type": "ram",
"cache_size": 2147483648,
"verbose": true
} |
Great work.
Admin routes should be easy to set up given that all the settings are just pydantic models but probably out of scope for this PR.
I think we should stick to a single config file that can act as a replacement for the server's cli arguments. ie you either specify settings for a single model on the cli or you pass a So something more like
{
"models": [{
"model": "models/mistral-7b.gguf",
"model_alias": "text-davinci-003"
}, {
"model": "models/open-hermes-2.5-7b.gguf",
"model_alias": "gpt-3.5-turbo"
}]
} The advantage is that you only need to manage / change a single config file so you can swap configs easily and we can also start to add server global options at the root of this document. |
This reverts commit bc5cf51.
Thank you for working on this!! |
Okay I've got all the merge conflicts resolved and I did some general refactoring of the server submodule. It's a lot lighter than the original PR but that was necessary to avoid any breaking changes and I plan to reintegrate the additional features one at a time. OverviewYou can now pass a {
"host": "0.0.0.0",
"port": 8080,
"models": [
{
"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
"model_alias": "gpt-3.5-turbo",
"chat_format": "chatml",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
"model_alias": "gpt-4",
"chat_format": "chatml",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/ggml_llava-v1.5-7b/ggml-model-q4_k.gguf",
"model_alias": "gpt-4-vision-preview",
"chat_format": "llava-1-5",
"clip_model_path": "models/ggml_llava-v1.5-7b/mmproj-model-f16.gguf",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/mistral-7b-v0.1-GGUF/ggml-model-Q4_K.gguf",
"model_alias": "text-davinci-003",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 2048
},
{
"model": "models/replit-code-v1_5-3b-GGUF/replit-code-v1_5-3b.Q4_0.gguf",
"model_alias": "copilot-codex",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 1024,
"n_ctx": 9216
}
]
} The server selects the appropriate model based on the Next steps
|
Here's a handy bash one-liner to traverse all of the find path/to/directory -name '*.gguf' -size +1k -printf '%p\n' | jq -R -s -c '{"models": [split("\n")[:-1][] | {"model": .}]}' |
The 'model' parameter has been supported since abetlen#931. Its placement in this section was copied from an older version of the file, and hasn't been corrected since. Correcting this will make it clearer what parameters are supported by llama-cpp-python.
Added multi model support
Changes
Description
MultiLlama class handles multiple models in the given dir.
It loads models lazy based the model request param, e.g. will take some time at first api call or after model change.
Usage