Input Temperature & Output Temperature #4267
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The expected implementation of Temperature (as it is used in OpenAI's models and also inference backends) is to modify the original distribution so that truncation samplers such as Top P or Top K aren't strictly necessary, but the current implementation as it is in llama.cpp behaves differently.
This can be confusing because truncation fundamentally changes the model's scores in a way that is very similar to lower temperature, except it explicitly cuts out bad choices to do this rather than scaling the model's confidence. This is not objectively a flawed approach; the llama.cpp style temperature allows you to apply some randomization after selecting your limited set of 'good' candidates. But the problem is, we want to have interpretability in what the model is doing in response to sampler changes so that there's an easy and understood way to control the model output.
Basically, the current trend across FOSS LLM inference backends is layering different samplers (and in the case of Temperature, the order of operations) haphazardly without carefully considering how they interact: this causes a very skewed and sometimes unnatural representation of what the model is actually predicting in response to different settings across different backends.
I want to propose a new standard that is interpretable and helps distinguish between the Temperature implementation variants:
This would give users freedom because:
In addition to this, Top P & Top K samplers are skewing the model in ways that are also unnatural as I've documented in the past. After a lot of research on sampler solutions, I think it makes sense to try to phase out their use (with the exception of Top K for debugging / deterministic tests) and to avoid using them as defaults instead of Min P, which seems to be universally preferred by now.
Min P 0.1 seems to do well across most models, and only allows for tokens at least 1/10th as probable as the highest ranking choice, but it's a bit more deterministic. I'd say 0.05 is a solid all arounder default option assuming 1.0 temp for the "Input Temperature" unless we want to make the defaults more 'safe'.