Input Types Compatibility with OpenAI's API #112

nlaanait · 2023-12-20T00:48:46Z

What does this PR do?

Fixes #106

More specifically, this PR brings TEI's /embeddings into full compatibility with all the input types supported by openai's embeddings API (v1 API reference).
I attempted to add support for these additional types within the current design and with minimal changes to the existing logic. Main changes are:

Introduced enum InputType to support the different input data types: String, u32, Vec<u32> (router/src/http/types.rs).
The existing enum Input is now strictly a container type (i.e. Single/Batch) using InputType as type.
Necessary methods to handle encoding of inputs in the form of int/array[int] were added as implementations to EncodingInput (core/src/tokenization.rs).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

OlivierDehaene

Nice

OlivierDehaene · 2023-12-20T08:55:44Z

core/src/tokenization.rs

        strategy: TruncationStrategy::LongestFirst,
        stride: 0,
    });
+    if inputs.is_encoded() {


Could this be merged with the matchbellow? Since is_encoded is basically a match on EncodingInput::Vector.

OlivierDehaene · 2023-12-20T08:56:12Z

core/src/tokenization.rs

    let inputs: EncodeInput = match inputs {
        EncodingInput::Single(s) => s.into(),
        EncodingInput::Dual(s1, s2) => (s1, s2).into(),
+        _ => Err(TextEmbeddingsError::Validation(


Right now, this branch cannot be reached. Can we merge the logic above here?

I'll give it a try.

merged the logic above per your recommendation which made this branch irrelevant.

OlivierDehaene · 2023-12-20T08:58:15Z

core/src/tokenization.rs

+        }
+    }
+
+    fn try_into_encoding(&self, position_offset: usize) -> Result<Encoding, TextEmbeddingsError> {


I'm not sure if this needs to be a separate function. You can just take the logic here and add it to the match directly.

OlivierDehaene · 2023-12-20T08:58:40Z

core/src/tokenization.rs

+            _ => Err(TextEmbeddingsError::Validation(
+                "`inputs` must be a vector of input_ids".to_string(),


This is a logic error in our part and should not be a concern to the client.

I removed this.

OlivierDehaene · 2023-12-20T08:59:08Z

core/src/tokenization.rs

+        match self {
+            EncodingInput::Vector(v) => Ok(Encoding {
+                input_ids: v.clone(),
+                token_type_ids: vec![0; v.len()],


This is a bit brittle. In the future this could be false.

I refactored in favor of building a tokenizers::encoding see:
https://github.com/nlaanait/text-embeddings-inference/blob/2ee30448465239aefc6f212b18f9d51baac6b611/core/src/tokenization.rs#L143-L148

OlivierDehaene · 2023-12-20T08:59:47Z

core/src/tokenization.rs

+    fn try_into_encoding(&self, position_offset: usize) -> Result<Encoding, TextEmbeddingsError> {
+        match self {
+            EncodingInput::Vector(v) => Ok(Encoding {
+                input_ids: v.clone(),


There needs to be validation on wether v contains unvalid ids e.g. values that are outside of the vocab.

the validation is now performed via a call to tokenizer.decode see:
https://github.com/nlaanait/text-embeddings-inference/blob/2ee30448465239aefc6f212b18f9d51baac6b611/core/src/tokenization.rs#L168-L195

OlivierDehaene · 2023-12-20T09:00:33Z

router/src/http/types.rs

+            InputType::SingleInt(_) => 1,
+            InputType::VectorInt(v) => v.len(),


Is this how OpenAI counts when ids are given to the API? Or do they still count the chars by decoding the ids?

I'll look into it and modify this per my findings.

@OlivierDehaene I looked through the source of openai-python v1.7.1 and couldn't find a reference to characters counting in the embeddings API.
Should count_chars return 0 for InputType::SingleInt, InputType::VectorInt for correctness?
LMK how you want to proceed.

matatonic · 2024-03-21T18:44:13Z

Is it possible to also add the correct routing "/v1/embeddings" as per the API? https://platform.openai.com/docs/api-reference/embeddings. Without this change it's not possible to use the environment variable OPENAI_BASE_URL in a consistent way with this service. Ex:

For text-embeddings-inference:

OPENAI_BASE_URL=http://text-embeddings-inference:80

Normally:

OPENAI_BASE_URL=http://:/v1

Also, for reference, the text-generation-webui has a more compatible implementation, including base64 json encoding (undocumented, but in the openai python package):
https://github.com/oobabooga/text-generation-webui/tree/main/extensions/openai

Co-authored-by: Numan Laanait <[email protected]>

…gface#214) Co-authored-by: Numan Laanait <[email protected]>

nlaanait added 2 commits December 19, 2023 19:09

implement InputType and encoding conditionals

f5cad1f

add comments and variable renaming

32dfd0a

nlaanait changed the title ~~implement InputType and encoding conditionals~~ Implement Input Type Compatibility with OpenAI's API Dec 20, 2023

nlaanait changed the title ~~Implement Input Type Compatibility with OpenAI's API~~ Implement Input Types Compatibility with OpenAI's API Dec 20, 2023

nlaanait changed the title ~~Implement Input Types Compatibility with OpenAI's API~~ Input Types Compatibility with OpenAI's API Dec 20, 2023

nlaanait mentioned this pull request Dec 20, 2023

OpenAI compatible embedding API is not complete #106

Closed

OlivierDehaene reviewed Dec 20, 2023

View reviewed changes

nlaanait added 2 commits December 25, 2023 23:51

decode and parse to set offsets and token_type_ids

3f32209

remove extracting pretokenizer logic

2ee3044

nlaanait marked this pull request as ready for review December 28, 2023 01:34

OlivierDehaene changed the base branch from main to dev March 22, 2024 14:20

Merge branch 'dev' into feat/openai-type-compatibilty

7292326

OlivierDehaene merged commit 4c26cb3 into huggingface:dev Mar 22, 2024

OlivierDehaene added a commit that referenced this pull request Mar 22, 2024

Input Types Compatibility with OpenAI's API (#112) (#214)

a1dd76d

Co-authored-by: Numan Laanait <[email protected]>

OlivierDehaene mentioned this pull request Mar 22, 2024

Incorrect URL for OpenAI compatibility (missing /v1) #213

Closed

4 tasks

nlaanait deleted the feat/openai-type-compatibilty branch April 21, 2024 09:28

MasakiMu319 pushed a commit to MasakiMu319/text-embeddings-inference that referenced this pull request Nov 27, 2024

Input Types Compatibility with OpenAI's API (huggingface#112) (huggin…

57b2f0e

…gface#214) Co-authored-by: Numan Laanait <[email protected]>

aagnone3 pushed a commit to StratisLLC/hf-text-embeddings-inference that referenced this pull request Dec 11, 2024

Input Types Compatibility with OpenAI's API (huggingface#112) (huggin…

2e95101

…gface#214) Co-authored-by: Numan Laanait <[email protected]>

		_ => Err(TextEmbeddingsError::Validation(
		"`inputs` must be a vector of input_ids".to_string(),

		InputType::SingleInt(_) => 1,
		InputType::VectorInt(v) => v.len(),

Input Types Compatibility with OpenAI's API #112

Input Types Compatibility with OpenAI's API #112

Uh oh!

Conversation

nlaanait commented Dec 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

OlivierDehaene left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matatonic commented Mar 21, 2024

For text-embeddings-inference:

Normally:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nlaanait commented Dec 20, 2023 •

edited

Loading