Skip to content

The new tokenizer no longer encode space properly #2721

Closed
@jxy

Description

@jxy

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

llama.tokenizer

Python 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from llama.tokenizer import Tokenizer
>>> tokenizer = Tokenizer('tokenizer.model')
>>> tokenizer.encode('Hello', bos=False, eos=False)
[15043]
>>> tokenizer.encode(' Hello', bos=False, eos=False)
[29871, 15043]
>>> tokenizer.encode('  Hello', bos=False, eos=False)
[259, 15043]
>>> tokenizer.encode('   Hello', bos=False, eos=False)
[1678, 15043]
>>> tokenizer.encode('    Hello', bos=False, eos=False)
[268, 15043]
>>> tokenizer.encode('    Hello\n    Hello', bos=False, eos=False)
[268, 15043, 13, 1678, 15043]

Previous version, the one in PR #2306 before the GGUF merge

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"Hello"}' 
{"tokens":[15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":" Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"  Hello"}'
{"tokens":[259,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"   Hello"}'
{"tokens":[1678,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"    Hello"}'
{"tokens":[268,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"    Hello\n    Hello"}'
{"tokens":[268,15043,13,1678,15043]}

Current Behavior

$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"Hello"}'
{"tokens":[15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":" Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"  Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"   Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"    Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:28888/tokenize --header "Content-Type: application/json" --data '{"content":"    Hello\n    Hello"}'
{"tokens":[29871,15043,13,15043]}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions