-
Notifications
You must be signed in to change notification settings - Fork 324
Input Types Compatibility with OpenAI's API #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input Types Compatibility with OpenAI's API #112
Conversation
OlivierDehaene
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
core/src/tokenization.rs
Outdated
| strategy: TruncationStrategy::LongestFirst, | ||
| stride: 0, | ||
| }); | ||
| if inputs.is_encoded() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be merged with the matchbellow? Since is_encoded is basically a match on EncodingInput::Vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
core/src/tokenization.rs
Outdated
| let inputs: EncodeInput = match inputs { | ||
| EncodingInput::Single(s) => s.into(), | ||
| EncodingInput::Dual(s1, s2) => (s1, s2).into(), | ||
| _ => Err(TextEmbeddingsError::Validation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, this branch cannot be reached. Can we merge the logic above here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged the logic above per your recommendation which made this branch irrelevant.
core/src/tokenization.rs
Outdated
| } | ||
| } | ||
|
|
||
| fn try_into_encoding(&self, position_offset: usize) -> Result<Encoding, TextEmbeddingsError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this needs to be a separate function. You can just take the logic here and add it to the match directly.
core/src/tokenization.rs
Outdated
| _ => Err(TextEmbeddingsError::Validation( | ||
| "`inputs` must be a vector of input_ids".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a logic error in our part and should not be a concern to the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this.
core/src/tokenization.rs
Outdated
| match self { | ||
| EncodingInput::Vector(v) => Ok(Encoding { | ||
| input_ids: v.clone(), | ||
| token_type_ids: vec![0; v.len()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit brittle. In the future this could be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored in favor of building a tokenizers::encoding see:
https://github.com/nlaanait/text-embeddings-inference/blob/2ee30448465239aefc6f212b18f9d51baac6b611/core/src/tokenization.rs#L143-L148
core/src/tokenization.rs
Outdated
| fn try_into_encoding(&self, position_offset: usize) -> Result<Encoding, TextEmbeddingsError> { | ||
| match self { | ||
| EncodingInput::Vector(v) => Ok(Encoding { | ||
| input_ids: v.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be validation on wether v contains unvalid ids e.g. values that are outside of the vocab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the validation is now performed via a call to tokenizer.decode see:
https://github.com/nlaanait/text-embeddings-inference/blob/2ee30448465239aefc6f212b18f9d51baac6b611/core/src/tokenization.rs#L168-L195
| InputType::SingleInt(_) => 1, | ||
| InputType::VectorInt(v) => v.len(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this how OpenAI counts when ids are given to the API? Or do they still count the chars by decoding the ids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into it and modify this per my findings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@OlivierDehaene I looked through the source of openai-python v1.7.1 and couldn't find a reference to characters counting in the embeddings API.
Should count_chars return 0 for InputType::SingleInt, InputType::VectorInt for correctness?
LMK how you want to proceed.
|
Is it possible to also add the correct routing "/v1/embeddings" as per the API? https://platform.openai.com/docs/api-reference/embeddings. Without this change it's not possible to use the environment variable OPENAI_BASE_URL in a consistent way with this service. Ex: For text-embeddings-inference:OPENAI_BASE_URL=http://text-embeddings-inference:80 Normally:OPENAI_BASE_URL=http://:/v1 Also, for reference, the text-generation-webui has a more compatible implementation, including base64 json encoding (undocumented, but in the openai python package): |
Co-authored-by: Numan Laanait <[email protected]>
…gface#214) Co-authored-by: Numan Laanait <[email protected]>
…gface#214) Co-authored-by: Numan Laanait <[email protected]>
What does this PR do?
Fixes #106
More specifically, this PR brings TEI's
/embeddingsinto full compatibility with all the input types supported byopenai's embeddings API (v1 API reference).I attempted to add support for these additional types within the current design and with minimal changes to the existing logic. Main changes are:
enum InputTypeto support the different input data types:String, u32, Vec<u32>(router/src/http/types.rs).enum Inputis now strictly a container type (i.e. Single/Batch) usingInputTypeas type.EncodingInput(core/src/tokenization.rs).Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.