server : fix context shift #5195

ggerganov · 2024-01-29T13:33:45Z

This probably fixes the context shift functionality.

Also, continuation of #5104 - not sure if the n_past_se added there is really necessary, so I removed it. Needs testing.

Green-Sky · 2024-01-29T14:21:58Z

slot 0 is processing [task id: 283]
slot 0 : in cache: 264 tokens | to process: 0 tokens
slot 0 : kv cache rm - [264, end)
slot 0 : we have to evaluate at least 1 token to generate logits

print_timings: prompt eval time =      57.36 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
print_timings:        eval time =     106.73 ms /     5 runs   (   21.35 ms per token,    46.84 tokens per second)
print_timings:       total time =     164.10 ms
slot 0 released (269 tokens in cache)
{"timestamp":1706538015,"level":"INFO","function":"log_server_request","line":2356,"message":"request","remote_addr":"127.0.0.1","remote_port":53316,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 290]
slot 0 : in cache: 268 tokens | to process: 1 tokens
slot 0 : kv cache rm - [268, end)
slot 0: context shift - n_keep = 0, n_left = 511, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 511, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 511, n_discard = 255

print_timings: prompt eval time =      37.80 ms /     1 tokens (   37.80 ms per token,    26.45 tokens per second)
print_timings:        eval time =   19156.52 ms /  1000 runs   (   19.16 ms per token,    52.20 tokens per second)
print_timings:       total time =   19194.32 ms
slot 0 released (504 tokens in cache)
{"timestamp":1706538034,"level":"INFO","function":"log_server_request","line":2356,"message":"request","remote_addr":"127.0.0.1","remote_port":53326,"status":200,"method":"POST","path":"/completion","params":{}}
slot 0 is processing [task id: 1292]
slot 0 : in cache: 0 tokens | to process: 393 tokens
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time =     238.00 ms /   393 tokens (    0.61 ms per token,  1651.24 tokens per second)
print_timings:        eval time =     189.38 ms /     9 runs   (   21.04 ms per token,    47.52 tokens per second)
print_timings:       total time =     427.38 ms
slot 0 released (402 tokens in cache)

not sure its correct, will test more later.

edit: I forgot to mention the important thing: I does fix the previous observed hang!

Green-Sky

style: not your fault, but it uses a mixture of x-- and x += 1...

Green-Sky · 2024-01-29T16:25:23Z

while you are at it, pls add missing newlines to the help here https://github.com/ggerganov/llama.cpp/blob/5f62e231db3ec60320607ac6450ce9bf4dd43208/examples/server/server.cpp#L1825-L1826

Green-Sky · 2024-01-29T16:33:59Z

decided to test the self extension and it broke.

looks different but still ok:

Available slots:
 -> Slot 0 - max context: 8192
 -> Slot 0 - self-extend: ga_n = 4, ga_w = 1024
{"timestamp":1706545764,"level":"INFO","function":"main","line":2543,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1706545776,"level":"INFO","function":"log_server_request","line":2356,"message":"request","remote_addr":"127.0.0.1","remote_port":41762,"status":200,"method":"GET","path":"/health","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 4830 tokens
slot 0 : kv cache rm - [0, end)

shift: [     0,   4830] +      0 -> [     0,   4830]
div:   [     0,   1024] /      4 -> [     0,    256]
shift: [  1024,   4830] +   -768 -> [   256,   4062]

n_past_old = 4830, n_past = 4062, ga_i = 256


shift: [   256,   4062] +    768 -> [  1024,   4830]
div:   [  1024,   2048] /      4 -> [   256,    512]
shift: [  2048,   4830] +  -1536 -> [   512,   3294]

n_past_old = 4062, n_past = 3294, ga_i = 512


shift: [   512,   3294] +   1536 -> [  2048,   4830]
div:   [  2048,   3072] /      4 -> [   512,    768]
shift: [  3072,   4830] +  -2304 -> [   768,   2526]

n_past_old = 3294, n_past = 2526, ga_i = 768


shift: [   768,   2526] +   2304 -> [  3072,   4830]
div:   [  3072,   4096] /      4 -> [   768,   1024]
shift: [  4096,   4830] +  -3072 -> [  1024,   1758]

n_past_old = 2526, n_past = 1758, ga_i = 1024


shift: [  1024,   2270] +   3072 -> [  4096,   5342]
div:   [  4096,   5120] /      4 -> [  1024,   1280]
shift: [  5120,   5342] +  -3840 -> [  1280,   1502]

n_past_old = 2270, n_past = 1502, ga_i = 1280


shift: [  1280,   2526] +   3840 -> [  5120,   6366]
div:   [  5120,   6144] /      4 -> [  1280,   1536]
shift: [  6144,   6366] +  -4608 -> [  1536,   1758]

n_past_old = 2526, n_past = 1758, ga_i = 1536


shift: [  1536,   2782] +   4608 -> [  6144,   7390]
div:   [  6144,   7168] /      4 -> [  1536,   1792]
shift: [  7168,   7390] +  -5376 -> [  1792,   2014]

n_past_old = 2782, n_past = 2014, ga_i = 1792


shift: [  1792,   3038] +   5376 -> [  7168,   8414]
div:   [  7168,   8192] /      4 -> [  1792,   2048]
shift: [  8192,   8414] +  -6144 -> [  2048,   2270]

n_past_old = 3038, n_past = 2270, ga_i = 2048


shift: [  2048,   3294] +   6144 -> [  8192,   9438]
div:   [  8192,   9216] /      4 -> [  2048,   2304]
shift: [  9216,   9438] +  -6912 -> [  2304,   2526]

n_past_old = 3294, n_past = 2526, ga_i = 2304


shift: [  2304,   3550] +   6912 -> [  9216,  10462]
div:   [  9216,  10240] /      4 -> [  2304,   2560]
shift: [ 10240,  10462] +  -7680 -> [  2560,   2782]

n_past_old = 3550, n_past = 2782, ga_i = 2560


shift: [  2560,   3806] +   7680 -> [ 10240,  11486]
div:   [ 10240,  11264] /      4 -> [  2560,   2816]
shift: [ 11264,  11486] +  -8448 -> [  2816,   3038]

n_past_old = 3806, n_past = 3038, ga_i = 2816


shift: [  2816,   4062] +   8448 -> [ 11264,  12510]
div:   [ 11264,  12288] /      4 -> [  2816,   3072]
shift: [ 12288,  12510] +  -9216 -> [  3072,   3294]

n_past_old = 4062, n_past = 3294, ga_i = 3072

print_timings: prompt eval time =    6495.03 ms /  4830 tokens (    1.34 ms per token,   743.65 tokens per second)
print_timings:        eval time =     203.73 ms /     5 runs   (   40.75 ms per token,    24.54 tokens per second)
print_timings:       total time =    6698.76 ms
slot 0 released (4835 tokens in cache)

Then the next completion.
uh big spam:

...............
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
update_slots : failed to decode the batch, n_batch = 1, ret = 1
................

ggerganov · 2024-01-30T11:58:42Z

There were some major issues still - I think I fixed those, but I'm still not sure if self-extend works now. Please let me know if you give it another try

Green-Sky · 2024-01-30T14:27:00Z

normal operation seems to be working now.

slot 0 is processing [task id: 12676]
slot 0 : in cache: 4093 tokens | to process: 0 tokens
slot 0 : kv cache rm - [4093, end)
slot 0 : we have to evaluate at least 1 token to generate logits
slot 0: context shift - n_keep = 0, n_left = 4094, n_discard = 2047

slot 0 is processing [task id: 15233]
slot 0 : in cache: 4095 tokens | to process: 0 tokens
slot 0 : kv cache rm - [4095, end)
slot 0 : we have to evaluate at least 1 token to generate logits
slot 0: context shift - n_keep = 0, n_left = 4094, n_discard = 2047

There were some major issues still

yea, i forgot to mention that the actual predictions returned where garbled, even without getting close to full cache/context.

Green-Sky · 2024-01-30T16:42:06Z

Self extend does not work. Either my settings are way off (not 100% on those values), or its just broken.

Available slots:
 -> Slot 0 - max context: 8192
 -> Slot 0 - self-extend: ga_n = 4, ga_w = 1024
{"timestamp":1706632770,"level":"INFO","function":"main","line":2541,"message":"model loaded"}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1706632840,"level":"INFO","function":"log_server_request","line":2354,"message":"request","remote_addr":"127.0.0.1","remote_port":55592,"status":200,"method":"GET","path":"/health","params":{}}
slot 0 is processing [task id: 0]
slot 0 : in cache: 0 tokens | to process: 4643 tokens
slot 0 : kv cache rm - [0, end)
slot 0 : applied self-extend to prompt: 0 tokens
Segmentation fault (core dumped)

@Maximilian-Winter

Green-Sky · 2024-01-30T16:44:18Z

update: running without selfextend now works properly, had my bot spam for hours with >20k server tasks and it was still fine. (beside the obv self deterioration when it keeps talking to itself)

...
InstructBot: (disappear)
InstructBot: (chuckles)
InstructBot: (voice) Goodbye and have a great day, everyone!
InstructBot: (disappear)
InstructBot: (chuckles)
InstructBot: (voice) If you need any assistance or have any questions, don't hesitate to ask InstructBot or myself.
InstructBot: (smiles)
InstructBot: (disappear)
InstructBot: (chuckles)
...

Maximilian-Winter · 2024-01-30T16:44:35Z

@Green-Sky Are you talking about my code or the code of this PR?

Green-Sky · 2024-01-30T16:45:43Z

@Green-Sky Are you talking about my code or the code of this PR?

This pr with caching enabled. But since master with caching is broken too...

edit: mostly tagged you to have your eyes on this :)

Maximilian-Winter · 2024-01-30T16:47:36Z

n_past_se was needed because otherwise the self extend won't work because all tokens of the prompt are added to the batch at once and the n_past is the total token count

ggerganov · 2024-01-30T17:08:00Z

I've reverted my changes on the self-extend for now, as it is difficult to test. But I think after we merge this PR in order to fix the context shift, we have to revisit and simplify the self-extend implementation. It's confusing to keep 2 values for n_past, so we should try improve this

Maximilian-Winter · 2024-01-30T17:22:37Z

@ggerganov I don't know if it is enough but calling it something like current_token_pos or token_pos_se would maybe help?

ggerganov · 2024-01-30T17:37:57Z

That's one option, but I think there might be a way to do it with only n_past. Not sure though - need to look more into this. Let me know if you get any ideas

Green-Sky · 2024-01-30T18:34:01Z

btw, with both caching and self extend it does the log spam again:

update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
update_slots : failed to decode the batch, n_batch = 1, ret = 1

but at least either seems to be working fine.

Green-Sky · 2024-02-01T18:21:55Z

It is still broken, but its way rarer now.

[1706811043] slot 0 is processing [task id: 10817]
[1706811043] slot 0 : in cache: 4063 tokens | to process: 1 tokens
[1706811043] slot 0 : kv cache rm - [4063, end)
[1706811043] sampled token:   688: 'In'
[1706811043] sampled token:  2855: 'struct'
[1706811043] sampled token: 33355: 'Bot'
[1706811043] sampled token:    27: ':'
[1706811043] Resampling because token 27: ':' does not meet grammar rules
[1706811043] sampled token:     0: ''
[1706811043] 
[1706811043] print_timings: prompt eval time =      36.89 ms /     1 tokens (   36.89 ms per token,    27.11 tokens per second)
[1706811043] print_timings:        eval time =      91.48 ms /     4 runs   (   22.87 ms per token,    43.72 tokens per second)
[1706811043] print_timings:       total time =     128.37 ms
[1706811043] slot 0 released (4068 tokens in cache)
[1706811043] slot 0 is processing [task id: 10823]
[1706811043] slot 0 : in cache: 4067 tokens | to process: 1 tokens
[1706811043] slot 0 : kv cache rm - [4067, end)
[1706811043] sampled token: 15699: ' Overall'
[1706811043] sampled token:    13: ','
[1706811043] sampled token:   309: ' I'
[1706811043] sampled token:  4035: ' continue'
[1706811043] sampled token:   281: ' to'
[1706811043] sampled token:  3037: ' learn'
[1706811043] sampled token:   285: ' and'
[1706811043] sampled token:  3157: ' improve'
[1706811043] sampled token:   619: ' my'
[1706811043] sampled token:  4685: ' understanding'
[1706811043] sampled token:   273: ' of'
[1706811043] sampled token:  3448: ' language'
[1706811043] sampled token:   285: ' and'
[1706811043] sampled token:  4466: ' culture'
[1706811043] sampled token:   281: ' to'
[1706811043] sampled token:  2085: ' provide'
[1706811043] sampled token:   253: ' the'
[1706811043] sampled token:  1682: ' best'
[1706811043] sampled token:  1896: ' possible'
[1706811043] sampled token:  6128: ' responses'
[1706811043] sampled token:   323: ' for'
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 128
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 64
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 32
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 16
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 8
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 4
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 2
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 1
[1706811043] update_slots : failed to decode the batch, n_batch = 1, ret = 1
[1706811043] update_slots : failed to find free space in the KV cache, retrying with smaller n_batch = 256

(from the 6gig log file)

$ result/bin/llama-server -ngl 99 -m models/stablelm-zephyr-3b.Q8_0.gguf -c 0

* server : fix context shift + simplify self-extend * server : take system_tokens into account * server : more n_past fixes * server : rever n_past_se changes

nelsonhurstdev · 2024-02-10T06:31:36Z

Is there a new PR for this? I am getting the same spam messages for the KV cache. It seems to happen as soon as it reaches the context limit. Unless I am using it wrong.

Is this the correct format?

/server.exe -ngl 99 --host localhost --port 5000 -c 8192 --grp-attn-n 4 --grp-attn-w 2048 -m D:/AI/LLM/Mixtral-8x7B-Instruct-v0.1.IQ3_XXS.gguf

Green-Sky · 2024-02-10T09:36:35Z

I saw #5420, but did not test it yet. Or it is the one making it worse shrug

ristew · 2024-02-11T02:29:10Z

Possible it broke group attention, I was only testing without.

* server : fix context shift + simplify self-extend * server : take system_tokens into account * server : more n_past fixes * server : rever n_past_se changes

ggerganov mentioned this pull request Jan 29, 2024

Port of self extension to server #5104

Merged

Green-Sky reviewed Jan 29, 2024

View reviewed changes

ggerganov added 3 commits January 30, 2024 13:34

server : fix context shift + simplify self-extend

51bb7f0

server : take system_tokens into account

8772d3e

server : more n_past fixes

d0e10bf

ggerganov force-pushed the gg/server-fix-shift branch from 5f62e23 to d0e10bf Compare January 30, 2024 11:57

server : rever n_past_se changes

05350f2

ggerganov changed the title ~~server : fix context shift + simplify self-extend~~ server : fix context shift Jan 30, 2024

ggerganov merged commit e6f291d into master Jan 30, 2024

ggerganov deleted the gg/server-fix-shift branch January 30, 2024 18:17

ggerganov mentioned this pull request Oct 12, 2024

server : remove self-extend features #9859

Closed

server : fix context shift #5195

server : fix context shift #5195

Uh oh!

Conversation

ggerganov commented Jan 29, 2024

Uh oh!

Green-Sky commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky left a comment

Choose a reason for hiding this comment

Uh oh!

Green-Sky commented Jan 29, 2024

Uh oh!

Green-Sky commented Jan 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 30, 2024

Uh oh!

Green-Sky commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Jan 30, 2024

Uh oh!

Green-Sky commented Jan 30, 2024

Uh oh!

Maximilian-Winter commented Jan 30, 2024

Uh oh!

Green-Sky commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Maximilian-Winter commented Jan 30, 2024

Uh oh!

ggerganov commented Jan 30, 2024

Uh oh!

Maximilian-Winter commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jan 30, 2024

Uh oh!

Green-Sky commented Jan 30, 2024

Uh oh!

Green-Sky commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nelsonhurstdev commented Feb 10, 2024

Uh oh!

Green-Sky commented Feb 10, 2024

Uh oh!

ristew commented Feb 11, 2024

Uh oh!

Uh oh!

Green-Sky commented Jan 29, 2024 •

edited

Loading

Green-Sky commented Jan 29, 2024 •

edited

Loading

Green-Sky commented Jan 30, 2024 •

edited

Loading

Green-Sky commented Jan 30, 2024 •

edited

Loading

Maximilian-Winter commented Jan 30, 2024 •

edited

Loading

Green-Sky commented Feb 1, 2024 •

edited

Loading