server: enable token array inputs for OAI API #15001
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
According to the OpenAI documentation formatting the prompt as an array of tokens is supported. However, the llama.cpp server raises an error if you provide such input. I assume the reason is that the interpretation of tokens depends on the model so this would not be "OpenAI compatible" either way. However, I have a use case where I need such inputs. This PR simply removes the error in the llama.cpp server. I don't think this would cause issues but my understanding of the server code is also relatively poor.
I'm currently working on benchmarking llama.cpp vs. vllm. Both projects provide an OAI-compatible API. So I want to make
scripts/server-bench.py
use the OAI-compatible API instead of the llama.cpp-specific API in order to use the exact same code for benchmarking either project. Under these circumstances I want to be able to send prompts of an exact length (in tokens) while at the same time the interpretations of those prompts as text are irrelevant.