Skip to content

Conversation

@nlaanait
Copy link
Contributor

@nlaanait nlaanait commented Dec 20, 2023

What does this PR do?

Fixes #106

More specifically, this PR brings TEI's /embeddings into full compatibility with all the input types supported by openai's embeddings API (v1 API reference).
I attempted to add support for these additional types within the current design and with minimal changes to the existing logic. Main changes are:

  1. Introduced enum InputType to support the different input data types: String, u32, Vec<u32> (router/src/http/types.rs).
  2. The existing enum Input is now strictly a container type (i.e. Single/Batch) using InputType as type.
  3. Necessary methods to handle encoding of inputs in the form of int/array[int] were added as implementations to EncodingInput (core/src/tokenization.rs).

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@nlaanait nlaanait changed the title implement InputType and encoding conditionals Implement Input Type Compatibility with OpenAI's API Dec 20, 2023
@nlaanait nlaanait changed the title Implement Input Type Compatibility with OpenAI's API Implement Input Types Compatibility with OpenAI's API Dec 20, 2023
@nlaanait nlaanait changed the title Implement Input Types Compatibility with OpenAI's API Input Types Compatibility with OpenAI's API Dec 20, 2023
Copy link
Contributor

@OlivierDehaene OlivierDehaene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

strategy: TruncationStrategy::LongestFirst,
stride: 0,
});
if inputs.is_encoded() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be merged with the matchbellow? Since is_encoded is basically a match on EncodingInput::Vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

let inputs: EncodeInput = match inputs {
EncodingInput::Single(s) => s.into(),
EncodingInput::Dual(s1, s2) => (s1, s2).into(),
_ => Err(TextEmbeddingsError::Validation(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, this branch cannot be reached. Can we merge the logic above here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merged the logic above per your recommendation which made this branch irrelevant.

}
}

fn try_into_encoding(&self, position_offset: usize) -> Result<Encoding, TextEmbeddingsError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this needs to be a separate function. You can just take the logic here and add it to the match directly.

Comment on lines 212 to 213
_ => Err(TextEmbeddingsError::Validation(
"`inputs` must be a vector of input_ids".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a logic error in our part and should not be a concern to the client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this.

match self {
EncodingInput::Vector(v) => Ok(Encoding {
input_ids: v.clone(),
token_type_ids: vec![0; v.len()],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit brittle. In the future this could be false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fn try_into_encoding(&self, position_offset: usize) -> Result<Encoding, TextEmbeddingsError> {
match self {
EncodingInput::Vector(v) => Ok(Encoding {
input_ids: v.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be validation on wether v contains unvalid ids e.g. values that are outside of the vocab.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +264 to +265
InputType::SingleInt(_) => 1,
InputType::VectorInt(v) => v.len(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how OpenAI counts when ids are given to the API? Or do they still count the chars by decoding the ids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it and modify this per my findings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OlivierDehaene I looked through the source of openai-python v1.7.1 and couldn't find a reference to characters counting in the embeddings API.
Should count_chars return 0 for InputType::SingleInt, InputType::VectorInt for correctness?
LMK how you want to proceed.

@nlaanait nlaanait marked this pull request as ready for review December 28, 2023 01:34
@matatonic
Copy link

Is it possible to also add the correct routing "/v1/embeddings" as per the API? https://platform.openai.com/docs/api-reference/embeddings. Without this change it's not possible to use the environment variable OPENAI_BASE_URL in a consistent way with this service. Ex:

For text-embeddings-inference:

OPENAI_BASE_URL=http://text-embeddings-inference:80

Normally:

OPENAI_BASE_URL=http://:/v1

Also, for reference, the text-generation-webui has a more compatible implementation, including base64 json encoding (undocumented, but in the openai python package):
https://github.com/oobabooga/text-generation-webui/tree/main/extensions/openai

@OlivierDehaene OlivierDehaene changed the base branch from main to dev March 22, 2024 14:20
@OlivierDehaene OlivierDehaene merged commit 4c26cb3 into huggingface:dev Mar 22, 2024
OlivierDehaene added a commit that referenced this pull request Mar 22, 2024
@nlaanait nlaanait deleted the feat/openai-type-compatibilty branch April 21, 2024 09:28
MasakiMu319 pushed a commit to MasakiMu319/text-embeddings-inference that referenced this pull request Nov 27, 2024
aagnone3 pushed a commit to StratisLLC/hf-text-embeddings-inference that referenced this pull request Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenAI compatible embedding API is not complete

3 participants