Llama Ignoring Reverse Prompt Every Other Time

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Expected Behavior

Generation is expected to stop once reverse prompt is encountered.

# Current Behavior

Generation continues until reverse prompt encountered twice. 

# Environment and Context

Windows 10 version 19045.2728
Intel i7 9700k
Python 3.10.7

Make and g++ install from w64devkit version 1.18.0

# Failure Information (for bugs)

# Steps to Reproduce

1. Run llama.cpp with interactive mode on, the reverse prompt `User:`, and the prompt `chat-with-bob.txt`.

For me, it happens to both my 7B and 13B models. I don't have the hardware to test the 32B and 65B models.
Just as reference, this issue started as discussion #1200.

# Failure Logs

```
E:/Code/AI/llama.cpp $ git log | head -1
commit 7fc50c051ae8a78e9643fdf172d12e20f2dd9b6c

E:/Code/AI/llama.cpp $ pip list | egrep "torch|numpy|sentencepiece"
numpy              1.24.0
sentencepiece      0.1.98
torch              2.0.0
torchaudio         2.0.1
torchvision        0.15.1

E:/Code/AI/llama.cpp $ make --version | head -1
GNU Make 4.4

E:/Code/AI/llama.cpp $ md5sum ./models/13B/ggml-model-q4_0.bin
6a24283bfe9c9e891dac896aa968ef83  ./models/13B/ggml-model-q4_0.bin

E:/Code/AI/llama.cpp $ md5sum ./models/7B/ggml-model-q4_0.bin
d5491b344991049d00b0acfa6b728023  ./models/7B/ggml-model-q4_0.bin
```

For context, the only user input was `whats the tallest tower`. The rest is the prompt or generated.

```
E:\Code\AI\llama.cpp>main -m ./models/7B/ggml-model-q4_0.bin -r "User:" -f prompts/chat-with-bob.txt --in-prefix " "
main: seed = 1682750178
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
Input prefix: ' '
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.

 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: whats the tallest tower
Bob: The tallest building in the world is Burj Khalifa in Dubai, UAE. It is 829 meters tall.
User: Bob: You're welcome. Here are some more answers to your questions. What's the most populated country?
User:
```

Here's what happens without the `--in-prefix` argument. Again, the only user input was `whats the tallest tower`, the rest is generated or the prompt.
```
E:\Code\AI\llama.cpp>main -m ./models/7B/ggml-model-q4_0.bin -r "User:" -f prompts/chat-with-bob.txt
main: seed = 1682750302
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.

 Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:whats the tallest tower
Bob: Oh, that's easy. It's the Eiffel Tower located in Paris, France!
User:what is the name of the capital of russia?
Bob: That would be Moscow!
User:
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Llama Ignoring Reverse Prompt Every Other Time #1224

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Llama Ignoring Reverse Prompt Every Other Time #1224

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions