Skip to content

Apply min_p to unsorted tokens #5115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 28, 2024

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Jan 24, 2024

Complementary PR to #5109 , #5101 , and #5085 . The formula for min_p can be rewritten to operate on token logits rather than token probabilities. This then allows you to perform min_p on unsorted tokens without the need to sort them and/or to apply softmax. This PR implements just that.

I am using the command ./main --model models/nvme/${model_name}-${quantization}.gguf -ngl 99 --ctx-size 4096 --ignore-eos --n-predict 256 --seed 1337 --min-p 0.1 --top-k <TOP_K_VALUE> --sampling-seq <SAMPLING_SEQ> to judge performance (I take the sampling t/s reported by llama.cpp). I have a Ryzen 3700X and an RTX 3090 and I'm compiling with GCC 13.2.1 on Linux 6.6.7-4-MANJARO. I get the following performance:

--top-k t/s kfypmt master t/s mkfypt master t/s mkfypt PR Speedup mkfypt
100 3751 494 3691 7.47
200 3442 488 3669 7.52
500 2732 489 3705 7.58
1000 2082 496 3768 7.60
2000 1417 485 3766 7.76
4000 914 498 3714 7.46
8000 584 489 3736 7.64
16000 391 490 3741 7.63
31780 312 491 3709 7.55
32000 506 492 3725 7.57

min_p first has roughly constant performance both on master and with this PR. On master min_p first always forces the full 32000 tokens to be sorted while with this PR almost no tokens need to be sorted. Notably min_p first changes the results because for top_p the probabilities of only the filtered tokens are considered.

Copy link
Contributor

@ikawrakow ikawrakow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation sets the probabilities. This implementation does not. Are we confident there isn't the assumption that probabilities will be set after llama_sample_min_p anywhere else?

@JohannesGaessler
Copy link
Collaborator Author

Unless I'm misunderstanding something the software design is that those sampling functions that need probabilities call llama_sample_softmax at the beginning to ensure that they are defined. In particular, when after sampler_queue llama_sample_token is called there is a call to llama_sample_softmax.

@ikawrakow
Copy link
Contributor

Unless I'm misunderstanding something the software design is that those sampling functions that need probabilities call llama_sample_softmax at the beginning to ensure that they are defined. In particular, when after sampler_queue llama_sample_token is called there is a call to llama_sample_softmax.

Yes, and because the original version of llama_sample_min_p calls llama_sample_softmax as the very first thing, someone somewhere could have assumed they don't need to do that after having called llama_sample_min_p. No? I'm not saying this is the case, but asking to make sure that this is not the case.

@ggerganov
Copy link
Member

Please consider adding some tests for these sampling changes - we have some in tests/test-sampling.cpp, but looks like the bare minimum of what we need. Would make it easier to review such kind of changes

@ggerganov
Copy link
Member

I think it is okay to remove the softmax from min_p since the API does not indicate that softmax would by applied when calling llama_sample_min_p. It would be an incorrect assumption if someone relies on having probabilities after llama_sample_min_p

@JohannesGaessler
Copy link
Collaborator Author

I rebased this PR onto master. The results for the tests added in #5147 do not change.

@JohannesGaessler JohannesGaessler merged commit 9241c3a into ggml-org:master Jan 28, 2024
jordankanter pushed a commit to jordankanter/llama.cpp that referenced this pull request Feb 3, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants