-
Notifications
You must be signed in to change notification settings - Fork 12k
Apply min_p to unsorted tokens #5115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply min_p to unsorted tokens #5115
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation sets the probabilities. This implementation does not. Are we confident there isn't the assumption that probabilities will be set after llama_sample_min_p
anywhere else?
Unless I'm misunderstanding something the software design is that those sampling functions that need probabilities call |
Yes, and because the original version of |
Please consider adding some tests for these sampling changes - we have some in |
I think it is okay to remove the softmax from |
f6ad32d
to
3f1b793
Compare
I rebased this PR onto master. The results for the tests added in #5147 do not change. |
Complementary PR to #5109 , #5101 , and #5085 . The formula for min_p can be rewritten to operate on token logits rather than token probabilities. This then allows you to perform min_p on unsorted tokens without the need to sort them and/or to apply softmax. This PR implements just that.
I am using the command
./main --model models/nvme/${model_name}-${quantization}.gguf -ngl 99 --ctx-size 4096 --ignore-eos --n-predict 256 --seed 1337 --min-p 0.1 --top-k <TOP_K_VALUE> --sampling-seq <SAMPLING_SEQ>
to judge performance (I take the sampling t/s reported by llama.cpp). I have a Ryzen 3700X and an RTX 3090 and I'm compiling with GCC 13.2.1 on Linux 6.6.7-4-MANJARO. I get the following performance:min_p first has roughly constant performance both on master and with this PR. On master min_p first always forces the full 32000 tokens to be sorted while with this PR almost no tokens need to be sorted. Notably min_p first changes the results because for top_p the probabilities of only the filtered tokens are considered.