Description
Description
Server returns "500 Internal Server Error\nvector::_M_default_append" when using certain models trying to use model's template with docker cuda image.
Steps to Reproduce
I'm using openAI in python:
def api_openai(placeholder, system_prompt, user_prompt, temperature, logit_bias):
full_response = ""
for response in openai_client.chat.completions.create(
model=st.session_state["openai_model"],
messages=[{"role": "system",
"content": system_prompt},
{"role": "user",
"content": user_prompt}],
stream=True, temperature=temperature, frequency_penalty=1, logit_bias=logit_bias):
full_response += (response.choices[0].delta.content or "")
placeholder.info(full_response + "▌")
return full_response
Actual Behavior
"500 Internal Server Error\nvector::_M_default_append"
Screenshots
Environment
-
Operating System: Docker
-
Docker compose:
api-server:
container_name: api-server
image: ghcr.io/ggerganov/llama.cpp:server-cuda
command: >
-m models/alphamonarch-7b.Q5_K_M.gguf
--ctx-size 8192
--host 0.0.0.0
--port 8080
--n-gpu-layers 1000
-np 1
-cb
--grp-attn-n 4
--grp-attn-w 2048
--api-key key
--verbose
ports:
- "8080:8080" -
Models that failed:
Additional Information
Models that I've tried that works:
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
https://huggingface.co/brittlewis12/NeuralDaredevil-7B-GGUF
Related Issues
i used #5593
Proposed Solution
I think that the problem could be related to the extracted chat_template, in Hugginface are using without problems "tokenizer.apply_chat_template" but i don't know if llama.cpp implementation works like that