server: tests: embeddings use a real embeddings model (#5908)

phymbert · web-flow · commit 79ef3c05855e · 2024-03-06T21:24:20.000+02:00
diff --git a/.github/workflows/server.yml b/.github/workflows/server.yml
@@ -58,7 +58,8 @@ jobs:
             cmake \
             python3-pip \
             wget \
-            psmisc
+            psmisc \
+            language-pack-en
 
       - name: Build
         id: cmake_build
diff --git a/examples/server/tests/features/embeddings.feature b/examples/server/tests/features/embeddings.feature
@@ -0,0 +1,97 @@
+@llama.cpp
+@embeddings
+Feature: llama.cpp server
+
+  Background: Server startup
+    Given a server listening on localhost:8080
+    And   a model file bert-bge-small/ggml-model-f16.gguf from HF repo ggml-org/models
+    And   a model alias bert-bge-small
+    And   42 as server seed
+    And   2 slots
+    And   512 as batch size
+    And   1024 KV cache size
+    And   embeddings extraction
+    Then  the server is starting
+    Then  the server is healthy
+
+  Scenario: Embedding
+    When embeddings are computed for:
+    """
+    What is the capital of Bulgaria ?
+    """
+    Then embeddings are generated
+
+  Scenario: OAI Embeddings compatibility
+    Given a model bert-bge-small
+    When an OAI compatible embeddings computation request for:
+    """
+    What is the capital of Spain ?
+    """
+    Then embeddings are generated
+
+  Scenario: OAI Embeddings compatibility with multiple inputs
+    Given a model bert-bge-small
+    Given a prompt:
+      """
+      In which country Paris is located ?
+      """
+    And a prompt:
+      """
+      Is Madrid the capital of Spain ?
+      """
+    When an OAI compatible embeddings computation request for multiple inputs
+    Then embeddings are generated
+
+  Scenario: Multi users embeddings
+    Given a prompt:
+      """
+      Write a very long story about AI.
+      """
+    And a prompt:
+      """
+      Write another very long music lyrics.
+      """
+    And a prompt:
+      """
+      Write a very long poem.
+      """
+    And a prompt:
+      """
+      Write a very long joke.
+      """
+    Given concurrent embedding requests
+    Then the server is busy
+    Then the server is idle
+    Then all embeddings are generated
+
+  Scenario: Multi users OAI compatibility embeddings
+    Given a prompt:
+      """
+      In which country Paris is located ?
+      """
+    And a prompt:
+      """
+      Is Madrid the capital of Spain ?
+      """
+    And a prompt:
+      """
+      What is the biggest US city ?
+      """
+    And a prompt:
+      """
+      What is the capital of Bulgaria ?
+      """
+    And   a model bert-bge-small
+    Given concurrent OAI embedding requests
+    Then the server is busy
+    Then the server is idle
+    Then all embeddings are generated
+
+  @wip
+  Scenario: All embeddings should be the same
+    Given 20 fixed prompts
+    And   a model bert-bge-small
+    Given concurrent OAI embedding requests
+    Then the server is busy
+    Then the server is idle
+    Then all embeddings are the same
diff --git a/examples/server/tests/features/parallel.feature b/examples/server/tests/features/parallel.feature
@@ -9,7 +9,6 @@ Feature: Parallel
     And   512 as batch size
     And   64 KV cache size
     And   2 slots
-    And   embeddings extraction
     And   continuous batching
     Then  the server is starting
     Then  the server is healthy
@@ -99,48 +98,3 @@ Feature: Parallel
     Then the server is busy
     Then the server is idle
     Then all prompts are predicted
-
-  Scenario: Multi users embeddings
-    Given a prompt:
-      """
-      Write a very long story about AI.
-      """
-    And a prompt:
-      """
-      Write another very long music lyrics.
-      """
-    And a prompt:
-      """
-      Write a very long poem.
-      """
-    And a prompt:
-      """
-      Write a very long joke.
-      """
-    Given concurrent embedding requests
-    Then the server is busy
-    Then the server is idle
-    Then all embeddings are generated
-
-  Scenario: Multi users OAI compatibility embeddings
-    Given a prompt:
-      """
-      In which country Paris is located ?
-      """
-    And a prompt:
-      """
-      Is Madrid the capital of Spain ?
-      """
-    And a prompt:
-      """
-      What is the biggest US city ?
-      """
-    And a prompt:
-      """
-      What is the capital of Bulgaria ?
-      """
-    And   a model tinyllama-2
-    Given concurrent OAI embedding requests
-    Then the server is busy
-    Then the server is idle
-    Then all embeddings are generated
diff --git a/examples/server/tests/features/server.feature b/examples/server/tests/features/server.feature
@@ -49,34 +49,6 @@ Feature: llama.cpp server
       | llama-2      | Book                        | What is the best book                | 8          | (Mom\|what)+           | 8           | disabled         |
       | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 64         | (thanks\|happy\|bird)+ | 32          | enabled          |
 
-  Scenario: Embedding
-    When embeddings are computed for:
-    """
-    What is the capital of Bulgaria ?
-    """
-    Then embeddings are generated
-
-  Scenario: OAI Embeddings compatibility
-    Given a model tinyllama-2
-    When an OAI compatible embeddings computation request for:
-    """
-    What is the capital of Spain ?
-    """
-    Then embeddings are generated
-
-  Scenario: OAI Embeddings compatibility with multiple inputs
-    Given a model tinyllama-2
-    Given a prompt:
-      """
-      In which country Paris is located ?
-      """
-    And a prompt:
-      """
-      Is Madrid the capital of Spain ?
-      """
-    When an OAI compatible embeddings computation request for multiple inputs
-    Then embeddings are generated
-
   Scenario: Tokenize / Detokenize
     When tokenizing:
     """
diff --git a/examples/server/tests/features/steps/steps.py b/examples/server/tests/features/steps/steps.py
diff --git a/examples/server/tests/requirements.txt b/examples/server/tests/requirements.txt