Skip to content

Commit a93297c

Browse files
thementjxhor
authored andcommitted
Implement non-greedy tokenizer that tries to maximize token lengths (ggml-org#242)
* Implement non-greedy tokenizer that tries to maximize token lengths * Insert single space in front of the prompt - this is to match original llama tokenizer behavior --------- Co-authored-by: Jakub Horak <[email protected]>
1 parent 7b0382e commit a93297c

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

utils.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -302,7 +302,7 @@ std::vector<gpt_vocab::id> llama_tokenize(const gpt_vocab & vocab, const std::st
302302
// Forward pass
303303
for (int i = 0; i < len; i++) {
304304
int max_len = std::min(len - i, MAX_TOKEN_LEN);
305-
for (int sub_len = 1; sub_len <= max_len; sub_len++) {
305+
for (int sub_len = 1; sub_len <= len - i; sub_len++) {
306306
auto sub = text.substr(i, sub_len);
307307
auto token = vocab.token_to_id.find(sub);
308308
if (token != vocab.token_to_id.end()) {

0 commit comments

Comments
 (0)