Skip to content

Commit d9e2903

Browse files
author
Joan Martinez
committed
test: add a new step test
1 parent eb5a0e1 commit d9e2903

File tree

4 files changed

+14
-6
lines changed

4 files changed

+14
-6
lines changed

examples/server/tests/features/embeddings.feature

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,13 @@ Feature: llama.cpp server
2323
"""
2424
Then embeddings are generated
2525

26+
Scenario: Tokenize / Detokenize complex
27+
When tokenizing:
28+
"""
29+
España is a èspciâl café über naïve résumé cañón élite cañas Barça 例子 東京 こんにちは 你好 中国
30+
"""
31+
Then tokens can be detokenize and is equivalent False
32+
2633
Scenario: OAI Embeddings compatibility
2734
Given a model bert-bge-small
2835
When an OAI compatible embeddings computation request for:

examples/server/tests/features/server.feature

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ Feature: llama.cpp server
9191
"""
9292
What is the capital of France ?
9393
"""
94-
Then tokens can be detokenize
94+
Then tokens can be detokenize and is equivalent True
9595

9696
Scenario: Models available
9797
Given available models

examples/server/tests/features/steps/steps.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -670,9 +670,10 @@ async def step_tokenize(context):
670670
context.tokens = tokenize_json['tokens']
671671

672672

673-
@step('tokens can be detokenize')
673+
@step('tokens can be detokenize and is equivalent {equivalent}')
674674
@async_run_until_complete
675-
async def step_detokenize(context):
675+
async def step_detokenize(context, equivalent):
676+
equivalent = equivalent == 'True'
676677
assert len(context.tokens) > 0
677678
async with aiohttp.ClientSession() as session:
678679
async with session.post(f'{context.base_url}/detokenize',
@@ -682,8 +683,8 @@ async def step_detokenize(context):
682683
assert response.status == 200
683684
detokenize_json = await response.json()
684685
# SPM tokenizer adds a whitespace prefix: https://github.com/google/sentencepiece/issues/15
685-
assert context.tokenized_text == detokenize_json['content'].strip()
686-
686+
if equivalent:
687+
assert context.tokenized_text == detokenize_json['content'].strip()
687688

688689
@step('an OPTIONS request is sent from {origin}')
689690
@async_run_until_complete

unicode.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -492,7 +492,7 @@ std::vector<uint32_t> sort_by_canonical_class(std::vector<uint32_t> & cpts) {
492492
std::vector<uint32_t> canonical_decomposition_cpts(std::vector<uint32_t> & cpts, uint32_t starting_offset) {
493493
std::vector<uint32_t> result;
494494
for (auto i = starting_offset; i < cpts.size(); i++) {
495-
auto it = unicode_map_nfd.equal_range(cpts[i]);
495+
const auto& it = unicode_map_nfd.equal_range(cpts[i]);
496496
if (it.first != it.second) {
497497
uint offset = 0;
498498
for (auto jt = it.first; jt != it.second; jt++) {

0 commit comments

Comments
 (0)