Skip to content

Support Accept text/event-stream in chat and completion endpoints #1088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 16, 2024

Conversation

aniljava
Copy link
Contributor

Addresses : #1083

This would allow the endpoint to accept both Accept headers, application/json and text/event-stream.

Response model for the SSE response is left as str, i dont think openapi currently has mechanism to specify model for each events currently and using the list of chunk type might conflict if the client code is generated.

OpenAI allows Accept: text/event-stream, but does not use it as a flag for stream. It needs to be provided explicitly as a parameter to POST.

@aniljava aniljava mentioned this pull request Jan 15, 2024
4 tasks
@thiner
Copy link

thiner commented Jan 16, 2024

I tried to build this PR into a docker image. But when I ran the container, it's failed to startup with below error:

 Traceback (most recent call last):

   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

     return _run_code(code, main_globals, None,

   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

     exec(code, run_globals)

   File "/llama_cpp/server/__main__.py", line 88, in <module>

     main()

   File "/llama_cpp/server/__main__.py", line 74, in main

     app = create_app(

   File "/llama_cpp/server/app.py", line 133, in create_app

     set_llama_proxy(model_settings=model_settings)

   File "/llama_cpp/server/app.py", line 70, in set_llama_proxy

     _llama_proxy = LlamaProxy(models=model_settings)

   File "/llama_cpp/server/model.py", line 27, in __init__

     self._current_model = self.load_llama_from_model_settings(

   File "/llama_cpp/server/model.py", line 92, in load_llama_from_model_settings

     _model = llama_cpp.Llama(

   File "/llama_cpp/llama.py", line 861, in __init__

     raise ValueError(

 ValueError: Attempt to split tensors that exceed maximum supported devices. Current LLAMA_MAX_DEVICES=1

I was trying to load model TheBloke/openbuddy-mixtral-7bx8-v16.3-32k.Q5_K_M.gguf, and I have set the ENV LLAMA_MAX_DEVICES to 2. tensor_split: 0.5 0.5.
The same setting is working well with v0.2.28

@abetlen
Copy link
Owner

abetlen commented Jan 16, 2024

@aniljava thanks for catching this, it looks good to me, hopefully it fixes the issue in #1083

@thiner I think that's seperate, do you mind opening a new issue?

@abetlen abetlen merged commit cfb7da9 into abetlen:main Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants