Bug: tab/space mistokenization for gemma spm models

### What happened?

```sh
$ ./tokenize codegemma-2b.gguf "                                         test"
[snip]
     2 -> '<bos>'
255970 -> '			'
255970 -> '			'
  2121 -> ' test'
$ echo "                                             test" | spm_encode --model codegemma-2b.model --input /dev/stdin --output_format id
255973 2195
$ echo "255970 255970 2121" | spm_decode --model codegemma-2b.model --input /dev/stdin --input_format id | jq -R .
"\t\t\t\t\t\t test"
$ echo "255973 2195" | spm_decode --model codegemma-2b.model --input /dev/stdin --input_format id | jq -R .
"\t\t\t\t\t\ttest"
```

Note that the input is six tabs followed by "test", i.e. `"\t\t\t\t\t\ttest"`. Take care not to accidentally use spaces when reproducing.

Note that this is not _just_ inserting a stray space before "test": it also breaks the tabs into two sets of 3 instead of a single set of 6.

Inputs like this (leading indentation followed by text) happen a lot with code.

There are three issues here:

* Mismatch between what the model was trained on and how llama.cpp tokenizes it. Adding a space is definitely OOD, particularly for languages with strong formatting opinions (Go) or significant whitespace (Python).
* llama.cpp tokenizer doesn't roundtrip (inserts an extraneous space).
* llama.cpp tokenizer uses more tokens to represent the input.

Thanks!


### Name and Version

```
$ ./llama-cli --version
version: 3325 (87e25a1d)
```

(head as of Sat Jul 6 09:22:16 2024 +0200)


### What operating system are you seeing the problem on?

Linux, Mac

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: tab/space mistokenization for gemma spm models #8338

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: tab/space mistokenization for gemma spm models #8338

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions