Skip to content

feat: add changes to handle jina v2 chinese code #7795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 57 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
86a5d96
feat: first things to do
Apr 11, 2024
747d17a
feat: create tensors for Jina architecture
Apr 12, 2024
a40156a
fix: use other tensors
Apr 12, 2024
b00d38b
feat: embedding gets results
Apr 16, 2024
cf1c144
fix: fix usage of ALIBI
Apr 22, 2024
63a1d7c
fix: clean prints
Apr 22, 2024
c229e48
fix: do some cleanup unused vars
Apr 22, 2024
e232370
fix: revert changes to Makefile and CMakeLists
Apr 22, 2024
795ff1d
fix: revert some changes
Apr 22, 2024
d6ac931
fix: fix small detail
Apr 22, 2024
db7e8ce
Merge branch 'master' into feat-jina-embeddings
JoanFM Apr 22, 2024
c1c0f4d
fix: fix convert formatting
Apr 22, 2024
64cd4b1
fix: fix linting and editor
Apr 22, 2024
71ff763
feat: set proper vocab settings
Apr 22, 2024
d7d6a4e
fix: JinaBertForMaskedLM registration
Apr 23, 2024
cde49b7
feat: support q_normalization and k_normalization in Jina arch
Apr 23, 2024
dd060a2
feat: handle gpt2 tokenizer with Jina architecture
Apr 24, 2024
dfa0676
feat: example comments in embedding
Apr 24, 2024
c3f4b1f
feat: rename Jina Bert to Jina Bert V2
Apr 24, 2024
603f18b
feat: small changes to allow jina embeddings ZH model
Apr 29, 2024
f8d1709
Merge branch 'master' into feat-jina-embeddings
JoanFM Apr 30, 2024
da96368
fix: add some changes as per review
Apr 30, 2024
2835441
Merge branch 'feat-jina-embeddings' of https://github.com/JoanFM/llam…
Apr 30, 2024
d9b8dd6
fix: add some changes as per review
Apr 30, 2024
e73ab4b
Merge branch 'feat-jina-embeddings' of https://github.com/JoanFM/llam…
Apr 30, 2024
14073a2
feat: proper KQ_pos for Jina embeddings
Apr 30, 2024
f6365b8
Merge branch 'feat-jina-embeddings' of https://github.com/JoanFM/llam…
May 2, 2024
14cd69a
feat: add pre tokenization
May 2, 2024
d5c3525
feat: first iteration NFC
May 6, 2024
76436c1
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
May 6, 2024
365af24
Merge branch 'feat-jina-embeddings' of https://github.com/JoanFM/llam…
May 6, 2024
3269efe
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
May 11, 2024
d0a99aa
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
May 13, 2024
8957cac
refactor: rename jina tokenizers to v2
May 13, 2024
0771b17
Merge branch 'refactor-jina-rename' of https://github.com/JoanFM/llam…
May 13, 2024
22a0113
fix: fix alignment
May 13, 2024
fb83012
refactor: keep refactoring non-breaking
May 13, 2024
ea0f7df
Merge branch 'refactor-jina-rename' of https://github.com/JoanFM/llam…
May 13, 2024
22b5f6b
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
May 13, 2024
cc0ac09
feat: add changes to handle jina v2 base code
May 28, 2024
21936dd
fix: do not complicate things
May 28, 2024
9a65c7a
fix: fix the usage of the code model
May 31, 2024
96a6f55
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
May 31, 2024
0fc775e
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
Jun 4, 2024
4bce30c
fix: fix comments
Jun 4, 2024
3b44f8f
fix: fix linting issues
Jun 5, 2024
05659d3
fix: remove ollama patches
Jun 5, 2024
7ab6023
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
Jun 5, 2024
d86efa6
fix: merge with code
Jun 5, 2024
a8a64fd
fix: fix preprocessing jina v2 zh
Jun 6, 2024
605a619
fix: merge issues
Jun 6, 2024
728e1b4
fix: lowercase unicode pt by unicode pt
Jun 7, 2024
841b9a5
Merge branch 'master' into feat-jina-embeddings-v2-zh
Jun 18, 2024
175391d
merge with master
Jul 8, 2024
0699a4c
Merge branch 'feat-jina-embeddings-v2-zh' of https://github.com/JoanF…
Jul 8, 2024
afd76e6
fix: handle default
Jul 8, 2024
201559d
Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fea…
Jul 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -603,6 +603,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
if chkhsh == "855059429035d75a914d1eda9f10a876752e281a054a7a3d421ef0533e5b6249":
# ref: https://huggingface.co/HuggingFaceTB/SmolLM-135M
res = "smollm"
if chkhsh == "c7699093ba4255a91e702aa38a596aa81669f3525dae06c2953267dde580f448":
# ref: https://huggingface.co/jinaai/jina-embeddings-v2-base-zh
res = "jina-v2-zh"

if res is None:
logger.warning("\n")
Expand Down
1 change: 1 addition & 0 deletions convert_hf_to_gguf_update.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ class TOKENIZER_TYPE(IntEnum):
{"name": "codeshell", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/WisdomShell/CodeShell-7B", },
{"name": "tekken", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistralai/Mistral-Nemo-Base-2407", },
{"name": "smollm", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/HuggingFaceTB/SmolLM-135M", },
{"name": "jina-v2-zh", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/jinaai/jina-embeddings-v2-base-zh", },
]


Expand Down
1 change: 1 addition & 0 deletions include/llama.h
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,7 @@ extern "C" {
LLAMA_VOCAB_PRE_TYPE_TEKKEN = 20,
LLAMA_VOCAB_PRE_TYPE_SMOLLM = 21,
LLAMA_VOCAB_PRE_TYPE_CODESHELL = 22,
LLAMA_VOCAB_PRE_TYPE_JINA_V2_ZH = 23,
};

// note: these values should be synchronized with ggml_rope
Expand Down
19 changes: 18 additions & 1 deletion src/llama-vocab.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#include <forward_list>
#include <queue>
#include <sstream>
#include <regex>

//
// helpers
Expand Down Expand Up @@ -446,6 +447,9 @@ struct llm_tokenizer_bpe {
"[^\\r\\n\\p{L}\\p{N}]?((?=[\\p{L}])([^a-z]))*((?=[\\p{L}])([^A-Z]))+|[^\\r\\n\\p{L}\\p{N}]?((?=[\\p{L}])([^a-z]))+((?=[\\p{L}])([^A-Z]))*|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
};
break;
case LLAMA_VOCAB_PRE_TYPE_JINA_V2_ZH:
regex_exprs = {"\\w+|[^\\w\\s]+"};
break;
default:
// default regex for BPE tokenization pre-processing
regex_exprs = {
Expand Down Expand Up @@ -498,7 +502,20 @@ struct llm_tokenizer_bpe {
void tokenize(const std::string & text, std::vector<llama_vocab::id> & output) {
int final_prev_index = -1;

const auto word_collection = unicode_regex_split(text, regex_exprs);
std::vector<std::string> word_collection;
if (vocab.type_pre == LLAMA_VOCAB_PRE_TYPE_JINA_V2_ZH) {

std::string lowercase_text = lowercase(text);
std::regex regexPattern(regex_exprs[0]);
std::sregex_token_iterator it(lowercase_text.begin(), lowercase_text.end(), regexPattern);
std::sregex_token_iterator end;

while (it != end) {
word_collection.push_back(*it++);
}
} else {
word_collection = unicode_regex_split(text, regex_exprs);
}

symbols_final.clear();

Expand Down
25 changes: 18 additions & 7 deletions src/llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5385,8 +5385,8 @@ static void llm_load_vocab(
tokenizer_pre == "jina-v2-de" ||
tokenizer_pre == "jina-v2-code") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_GPT2;
} else if (
tokenizer_pre == "refact") {

} else if (tokenizer_pre == "refact") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_REFACT;
} else if (
tokenizer_pre == "command-r") {
Expand Down Expand Up @@ -5436,6 +5436,9 @@ static void llm_load_vocab(
} else if (
tokenizer_pre == "codeshell") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_CODESHELL;
} else if (
tokenizer_pre == "jina-v2-zh") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_JINA_V2_ZH;
} else {
throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str()));
}
Expand Down Expand Up @@ -5486,8 +5489,7 @@ static void llm_load_vocab(

for (uint32_t i = 0; i < n_vocab; i++) {
std::string word = gguf_get_arr_str(ctx, token_idx, i);
GGML_ASSERT(unicode_cpts_from_utf8(word).size() > 0);

//GGML_ASSERT(unicode_cpts_from_utf8(word).size() > 0); Remove check, some vocabs contain by mistake the NULL in vocab, (not ideal if it happens more than once) (jinaai-embeddings-v2-base-zh)
vocab.token_to_id[word] = i;
vocab.max_token_len = std::max(vocab.max_token_len, (int) word.size());

Expand Down Expand Up @@ -5560,9 +5562,18 @@ static void llm_load_vocab(
} else if (vocab.type == LLAMA_VOCAB_TYPE_WPM) {
vocab.linefeed_id = vocab.special_pad_id;
} else {
const std::vector<int> ids = llama_tokenize_internal(vocab, "\xC4\x8A", false); // U+010A
GGML_ASSERT(!ids.empty() && "model vocab missing newline token");
vocab.linefeed_id = ids[0];
try {
const std::vector<int> ids = llama_tokenize_internal(vocab, "\xC4\x8A", false); // U+010A
if (ids.empty()) {
LLAMA_LOG_WARN("%s: %s vocabulary, but newline token not found: %s! Using special_pad_id instead.", __func__, llama_model_vocab_type_name(vocab.type), "\xC4\x8A");
vocab.linefeed_id = -1;
} else {
vocab.linefeed_id = ids[0];
}
} catch (const std::exception & e) {
LLAMA_LOG_WARN("%s: %s vocabulary, but newline token not found: %s! Using special_pad_id instead.", __func__, llama_model_vocab_type_name(vocab.type), e.what());
vocab.linefeed_id = vocab.special_pad_id;
}
}

// special tokens
Expand Down
14 changes: 14 additions & 0 deletions src/unicode.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -816,3 +816,17 @@ std::vector<std::string> unicode_regex_split(const std::string & text, const std

return unicode_byte_encoding_process(bpe_words);
}



std::string lowercase(const std::string & text) {
std::string lowercase("");
const std::vector<uint32_t> cpts = unicode_cpts_from_utf8(text);

for (const char32_t cpt : cpts) {
const std::string s = unicode_cpt_to_utf8(unicode_tolower(cpt));
lowercase += unicode_cpt_to_utf8(unicode_tolower(cpt)); // append char to word
}

return lowercase;
}
2 changes: 2 additions & 0 deletions src/unicode.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,5 @@ uint8_t unicode_utf8_to_byte(const std::string & utf8);
uint32_t unicode_tolower(uint32_t cp);

std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regex_exprs);

std::string lowercase(const std::string & text);
Loading