Skip to content

server : fix context shift #5195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 30, 2024
Merged

server : fix context shift #5195

merged 4 commits into from
Jan 30, 2024

Conversation

ggerganov
Copy link
Member

This probably fixes the context shift functionality.

Also, continuation of #5104 - not sure if the n_past_se added there is really necessary, so I removed it. Needs testing.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jan 29, 2024

slot 0 is processing [task id: 283]
slot 0 : in cache: 264 tokens | to process: 0 tokens
slot 0 : kv cache rm - [264, end)
slot 0 : we have to evaluate at least 1 token to generate logits

print_timings: prompt eval time =      57.36 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =     106.73 ms /     5 runs   (   21.35 ms per token,    46.84 tokens per second)
print_timings:       total time =     164.10 ms
slot 0 released (269 tokens in cache)
{"timestamp":1706538015,"level":"INFO","function":"log_server_request","line":2356,"message":"request","remote_addr":"127.0.0.1","remote_port":53316,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 290]
slot 0 : in cache: 268 tokens | to process: 1 tokens
slot 0 : kv cache rm - [268, end)
slot 0: context shift - n_keep = 0, n_left = 511, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 511, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 511, n_discard = 255

print_timings: prompt eval time =      37.80 ms /     1 tokens (   37.80 ms per token,    26.45 tokens per second)
print_timings:        eval time =   19156.52 ms /  1000 runs   (   19.16 ms per token,    52.20 tokens per second)
print_timings:       total time =   19194.32 ms
slot 0 released (504 tokens in cache)
{"timestamp":1706538034,"level":"INFO","function":"log_server_request","line":2356,"message":"request","remote_addr":"127.0.0.1","remote_port":53326,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 1292]
slot 0 : in cache: 0 tokens | to process: 393 tokens
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =     238.00 ms /   393 tokens (    0.61 ms per token,  1651.24 tokens per second)
print_timings:        eval time =     189.38 ms /     9 runs   (   21.04 ms per token,    47.52 tokens per second)
print_timings:       total time =     427.38 ms
slot 0 released (402 tokens in cache)

not sure its correct, will test more later.

edit: I forgot to mention the important thing: I does fix the previous observed hang!

Copy link
Collaborator

@Green-Sky Green-Sky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: not your fault, but it uses a mixture of x-- and x += 1...

@Green-Sky
Copy link
Collaborator

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jan 29, 2024

decided to test the self extension and it broke.

looks different but still ok:

Available slots:
 -> Slot 0 - max context: 8192
 -> Slot 0 - self-extend: ga_n = 4, ga_w = 1024
{"timestamp":1706545764,"level":"INFO","function":"main","line":2543,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1706545776,"level":"INFO","function":"log_server_request","line":2356,"message":"request","remote_addr":"127.0.0.1","remote_port":41762,"status":200,"method":"GET","path":"/health","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 4830 tokens
slot 0 : kv cache rm - [0, end)

shift: [     0,   4830] +      0 -> [     0,   4830]
div:   [     0,   1024] /      4 -> [     0,    256]
shift: [  1024,   4830] +   -768 -> [   256,   4062]

n_past_old = 4830, n_past = 4062, ga_i = 256


shift: [   256,   4062] +    768 -> [  1024,   4830]
div:   [  1024,   2048] /      4 -> [   256,    512]
shift: [  2048,   4830] +  -1536 -> [   512,   3294]

n_past_old = 4062, n_past = 3294, ga_i = 512


shift: [   512,   3294] +   1536 -> [  2048,   4830]
div:   [  2048,   3072] /      4 -> [   512,    768]
shift: [  3072,   4830] +  -2304 -> [   768,   2526]

n_past_old = 3294, n_past = 2526, ga_i = 768


shift: [   768,   2526] +   2304 -> [  3072,   4830]
div:   [  3072,   4096] /      4 -> [   768,   1024]
shift: [  4096,   4830] +  -3072 -> [  1024,   1758]

n_past_old = 2526, n_past = 1758, ga_i = 1024


shift: [  1024,   2270] +   3072 -> [  4096,   5342]
div:   [  4096,   5120] /      4 -> [  1024,   1280]
shift: [  5120,   5342] +  -3840 -> [  1280,   1502]

n_past_old = 2270, n_past = 1502, ga_i = 1280


shift: [  1280,   2526] +   3840 -> [  5120,   6366]
div:   [  5120,   6144] /      4 -> [  1280,   1536]
shift: [  6144,   6366] +  -4608 -> [  1536,   1758]

n_past_old = 2526, n_past = 1758, ga_i = 1536


shift: [  1536,   2782] +   4608 -> [  6144,   7390]
div:   [  6144,   7168] /      4 -> [  1536,   1792]
shift: [  7168,   7390] +  -5376 -> [  1792,   2014]

n_past_old = 2782, n_past = 2014, ga_i = 1792


shift: [  1792,   3038] +   5376 -> [  7168,   8414]
div:   [  7168,   8192] /      4 -> [  1792,   2048]
shift: [  8192,   8414] +  -6144 -> [  2048,   2270]

n_past_old = 3038, n_past = 2270, ga_i = 2048


shift: [  2048,   3294] +   6144 -> [  8192,   9438]
div:   [  8192,   9216] /      4 -> [  2048,   2304]
shift: [  9216,   9438] +  -6912 -> [  2304,   2526]

n_past_old = 3294, n_past = 2526, ga_i = 2304


shift: [  2304,   3550] +   6912 -> [  9216,  10462]
div:   [  9216,  10240] /      4 -> [  2304,   2560]
shift: [ 10240,  10462] +  -7680 -> [  2560,   2782]

n_past_old = 3550, n_past = 2782, ga_i = 2560


shift: [  2560,   3806] +   7680 -> [ 10240,  11486]
div:   [ 10240,  11264] /      4 -> [  2560,   2816]
shift: [ 11264,  11486] +  -8448 -> [  2816,   3038]

n_past_old = 3806, n_past = 3038, ga_i = 2816


shift: [  2816,   4062] +   8448 -> [ 11264,  12510]
div:   [ 11264,  12288] /      4 -> [  2816,   3072]
shift: [ 12288,  12510] +  -9216 -> [  3072,   3294]

n_past_old = 4062, n_past = 3294, ga_i = 3072

print_timings: prompt eval time =    6495.03 ms /  4830 tokens (    1.34 ms per token,   743.65 tokens per second)
print_timings:        eval time =     203.73 ms /     5 runs   (   40.75 ms per token,    24.54 tokens per second)
print_timings:       total time =    6698.76 ms
slot 0 released (4835 tokens in cache)

Then the next completion.
uh big spam:

...............
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
update_slots : failed to decode the batch, n_batch = 1, ret = 1
................

@ggerganov ggerganov force-pushed the gg/server-fix-shift branch from 5f62e23 to d0e10bf Compare January 30, 2024 11:57
@ggerganov
Copy link
Member Author

There were some major issues still - I think I fixed those, but I'm still not sure if self-extend works now. Please let me know if you give it another try

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jan 30, 2024

normal operation seems to be working now.

slot 0 is processing [task id: 12676]
slot 0 : in cache: 4093 tokens | to process: 0 tokens
slot 0 : kv cache rm - [4093, end)
slot 0 : we have to evaluate at least 1 token to generate logits
slot 0: context shift - n_keep = 0, n_left = 4094, n_discard = 2047
slot 0 is processing [task id: 15233]
slot 0 : in cache: 4095 tokens | to process: 0 tokens
slot 0 : kv cache rm - [4095, end)
slot 0 : we have to evaluate at least 1 token to generate logits
slot 0: context shift - n_keep = 0, n_left = 4094, n_discard = 2047

There were some major issues still

yea, i forgot to mention that the actual predictions returned where garbled, even without getting close to full cache/context.

@Green-Sky
Copy link
Collaborator

Self extend does not work. Either my settings are way off (not 100% on those values), or its just broken.

Available slots:
 -> Slot 0 - max context: 8192
 -> Slot 0 - self-extend: ga_n = 4, ga_w = 1024
{"timestamp":1706632770,"level":"INFO","function":"main","line":2541,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1706632840,"level":"INFO","function":"log_server_request","line":2354,"message":"request","remote_addr":"127.0.0.1","remote_port":55592,"status":200,"method":"GET","path":"/health","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 4643 tokens
slot 0 : kv cache rm - [0, end)
slot 0 : applied self-extend to prompt: 0 tokens
Segmentation fault (core dumped)

@Maximilian-Winter

@Green-Sky
Copy link
Collaborator

update: running without selfextend now works properly, had my bot spam for hours with >20k server tasks and it was still fine. (beside the obv self deterioration when it keeps talking to itself)

...
InstructBot: (disappear)
InstructBot: (chuckles)
InstructBot: (voice) Goodbye and have a great day, everyone!
InstructBot: (disappear)
InstructBot: (chuckles)
InstructBot: (voice) If you need any assistance or have any questions, don't hesitate to ask InstructBot or myself.
InstructBot: (smiles)
InstructBot: (disappear)
InstructBot: (chuckles)
...

@Maximilian-Winter
Copy link
Contributor

@Green-Sky Are you talking about my code or the code of this PR?

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jan 30, 2024

@Green-Sky Are you talking about my code or the code of this PR?

This pr with caching enabled. But since master with caching is broken too...

edit: mostly tagged you to have your eyes on this :)

@Maximilian-Winter
Copy link
Contributor

n_past_se was needed because otherwise the self extend won't work because all tokens of the prompt are added to the batch at once and the n_past is the total token count

@ggerganov ggerganov changed the title server : fix context shift + simplify self-extend server : fix context shift Jan 30, 2024
@ggerganov
Copy link
Member Author

I've reverted my changes on the self-extend for now, as it is difficult to test. But I think after we merge this PR in order to fix the context shift, we have to revisit and simplify the self-extend implementation. It's confusing to keep 2 values for n_past, so we should try improve this

@Maximilian-Winter
Copy link
Contributor

Maximilian-Winter commented Jan 30, 2024

@ggerganov I don't know if it is enough but calling it something like current_token_pos or token_pos_se would maybe help?

@ggerganov
Copy link
Member Author

That's one option, but I think there might be a way to do it with only n_past. Not sure though - need to look more into this. Let me know if you get any ideas

@ggerganov ggerganov merged commit e6f291d into master Jan 30, 2024
@ggerganov ggerganov deleted the gg/server-fix-shift branch January 30, 2024 18:17
@Green-Sky
Copy link
Collaborator

btw, with both caching and self extend it does the log spam again:

update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
update_slots : failed to decode the batch, n_batch = 1, ret = 1

but at least either seems to be working fine.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Feb 1, 2024

It is still broken, but its way rarer now.

[1706811043] slot 0 is processing [task id: 10817]
[1706811043] slot 0 : in cache: 4063 tokens | to process: 1 tokens
[1706811043] slot 0 : kv cache rm - [4063, end)
[1706811043] sampled token:   688: 'In'
[1706811043] sampled token:  2855: 'struct'
[1706811043] sampled token: 33355: 'Bot'
[1706811043] sampled token:    27: ':'
[1706811043] Resampling because token 27: ':' does not meet grammar rules
[1706811043] sampled token:     0: ''
[1706811043] 
[1706811043] print_timings: prompt eval time =      36.89 ms /     1 tokens (   36.89 ms per token,    27.11 tokens per second)
[1706811043] print_timings:        eval time =      91.48 ms /     4 runs   (   22.87 ms per token,    43.72 tokens per second)
[1706811043] print_timings:       total time =     128.37 ms
[1706811043] slot 0 released (4068 tokens in cache)
[1706811043] slot 0 is processing [task id: 10823]
[1706811043] slot 0 : in cache: 4067 tokens | to process: 1 tokens
[1706811043] slot 0 : kv cache rm - [4067, end)
[1706811043] sampled token: 15699: ' Overall'
[1706811043] sampled token:    13: ','
[1706811043] sampled token:   309: ' I'
[1706811043] sampled token:  4035: ' continue'
[1706811043] sampled token:   281: ' to'
[1706811043] sampled token:  3037: ' learn'
[1706811043] sampled token:   285: ' and'
[1706811043] sampled token:  3157: ' improve'
[1706811043] sampled token:   619: ' my'
[1706811043] sampled token:  4685: ' understanding'
[1706811043] sampled token:   273: ' of'
[1706811043] sampled token:  3448: ' language'
[1706811043] sampled token:   285: ' and'
[1706811043] sampled token:  4466: ' culture'
[1706811043] sampled token:   281: ' to'
[1706811043] sampled token:  2085: ' provide'
[1706811043] sampled token:   253: ' the'
[1706811043] sampled token:  1682: ' best'
[1706811043] sampled token:  1896: ' possible'
[1706811043] sampled token:  6128: ' responses'
[1706811043] sampled token:   323: ' for'
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
[1706811043] update_slots : failed to decode the batch, n_batch = 1, ret = 1
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256

(from the 6gig log file)

$ result/bin/llama-server -ngl 99 -m models/stablelm-zephyr-3b.Q8_0.gguf -c 0

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
* server : fix context shift + simplify self-extend

* server : take system_tokens into account

* server : more n_past fixes

* server : rever n_past_se changes
@nelsonhurstdev
Copy link

Is there a new PR for this? I am getting the same spam messages for the KV cache. It seems to happen as soon as it reaches the context limit. Unless I am using it wrong.

Is this the correct format?

/server.exe -ngl 99 --host localhost --port 5000 -c 8192 --grp-attn-n 4 --grp-attn-w 2048 -m D:/AI/LLM/Mixtral-8x7B-Instruct-v0.1.IQ3_XXS.gguf

@Green-Sky
Copy link
Collaborator

I saw #5420, but did not test it yet. Or it is the one making it worse shrug

@ristew
Copy link
Contributor

ristew commented Feb 11, 2024

Possible it broke group attention, I was only testing without.

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* server : fix context shift + simplify self-extend

* server : take system_tokens into account

* server : more n_past fixes

* server : rever n_past_se changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants