Apply min_p to unsorted tokens #5115

JohannesGaessler · 2024-01-24T16:34:58Z

Complementary PR to #5109 , #5101 , and #5085 . The formula for min_p can be rewritten to operate on token logits rather than token probabilities. This then allows you to perform min_p on unsorted tokens without the need to sort them and/or to apply softmax. This PR implements just that.

I am using the command ./main --model models/nvme/${model_name}-${quantization}.gguf -ngl 99 --ctx-size 4096 --ignore-eos --n-predict 256 --seed 1337 --min-p 0.1 --top-k <TOP_K_VALUE> --sampling-seq <SAMPLING_SEQ> to judge performance (I take the sampling t/s reported by llama.cpp). I have a Ryzen 3700X and an RTX 3090 and I'm compiling with GCC 13.2.1 on Linux 6.6.7-4-MANJARO. I get the following performance:

--top-k	t/s kfypmt master	t/s mkfypt master	t/s mkfypt PR	Speedup mkfypt
100	3751	494	3691	7.47
200	3442	488	3669	7.52
500	2732	489	3705	7.58
1000	2082	496	3768	7.60
2000	1417	485	3766	7.76
4000	914	498	3714	7.46
8000	584	489	3736	7.64
16000	391	490	3741	7.63
31780	312	491	3709	7.55
32000	506	492	3725	7.57

min_p first has roughly constant performance both on master and with this PR. On master min_p first always forces the full 32000 tokens to be sorted while with this PR almost no tokens need to be sorted. Notably min_p first changes the results because for top_p the probabilities of only the filtered tokens are considered.

ikawrakow

The original implementation sets the probabilities. This implementation does not. Are we confident there isn't the assumption that probabilities will be set after llama_sample_min_p anywhere else?

JohannesGaessler · 2024-01-25T08:40:24Z

Unless I'm misunderstanding something the software design is that those sampling functions that need probabilities call llama_sample_softmax at the beginning to ensure that they are defined. In particular, when after sampler_queue llama_sample_token is called there is a call to llama_sample_softmax.

ikawrakow · 2024-01-25T09:45:19Z

Unless I'm misunderstanding something the software design is that those sampling functions that need probabilities call llama_sample_softmax at the beginning to ensure that they are defined. In particular, when after sampler_queue llama_sample_token is called there is a call to llama_sample_softmax.

Yes, and because the original version of llama_sample_min_p calls llama_sample_softmax as the very first thing, someone somewhere could have assumed they don't need to do that after having called llama_sample_min_p. No? I'm not saying this is the case, but asking to make sure that this is not the case.

ggerganov · 2024-01-26T08:22:33Z

Please consider adding some tests for these sampling changes - we have some in tests/test-sampling.cpp, but looks like the bare minimum of what we need. Would make it easier to review such kind of changes

ggerganov · 2024-01-28T08:27:04Z

I think it is okay to remove the softmax from min_p since the API does not indicate that softmax would by applied when calling llama_sample_min_p. It would be an incorrect assumption if someone relies on having probabilities after llama_sample_min_p

JohannesGaessler · 2024-01-28T08:38:46Z

I rebased this PR onto master. The results for the tests added in #5147 do not change.

JohannesGaessler mentioned this pull request Jan 24, 2024

Another bucket sort #5109

Merged

ikawrakow approved these changes Jan 25, 2024

View reviewed changes

JohannesGaessler mentioned this pull request Jan 26, 2024

Tests for min_p, sampling queue #5147

Merged

Apply min_p to unsorted tokens

3f1b793

JohannesGaessler force-pushed the unsorted-min_p branch from f6ad32d to 3f1b793 Compare January 28, 2024 08:37

ggerganov approved these changes Jan 28, 2024

View reviewed changes

JohannesGaessler merged commit 9241c3a into ggml-org:master Jan 28, 2024

jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024

Apply min_p to unsorted tokens (ggml-org#5115)

d71059c

hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024

Apply min_p to unsorted tokens (ggml-org#5115)

c8b698e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apply min_p to unsorted tokens #5115

Apply min_p to unsorted tokens #5115

Uh oh!

JohannesGaessler commented Jan 24, 2024 •

edited

Loading

Uh oh!

ikawrakow left a comment

Uh oh!

JohannesGaessler commented Jan 25, 2024

Uh oh!

ikawrakow commented Jan 25, 2024

Uh oh!

ggerganov commented Jan 26, 2024

Uh oh!

ggerganov commented Jan 28, 2024

Uh oh!

JohannesGaessler commented Jan 28, 2024

Uh oh!

Uh oh!

Apply min_p to unsorted tokens #5115

Apply min_p to unsorted tokens #5115

Uh oh!

Conversation

JohannesGaessler commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jan 25, 2024

Uh oh!

ikawrakow commented Jan 25, 2024

Uh oh!

ggerganov commented Jan 26, 2024

Uh oh!

ggerganov commented Jan 28, 2024

Uh oh!

JohannesGaessler commented Jan 28, 2024

Uh oh!

Uh oh!

JohannesGaessler commented Jan 24, 2024 •

edited

Loading