Skip to content

Input Temperature & Output Temperature #4267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kalomaze
Copy link
Contributor

@kalomaze kalomaze commented Nov 30, 2023

The expected implementation of Temperature (as it is used in OpenAI's models and also inference backends) is to modify the original distribution so that truncation samplers such as Top P or Top K aren't strictly necessary, but the current implementation as it is in llama.cpp behaves differently.

This can be confusing because truncation fundamentally changes the model's scores in a way that is very similar to lower temperature, except it explicitly cuts out bad choices to do this rather than scaling the model's confidence. This is not objectively a flawed approach; the llama.cpp style temperature allows you to apply some randomization after selecting your limited set of 'good' candidates. But the problem is, we want to have interpretability in what the model is doing in response to sampler changes so that there's an easy and understood way to control the model output.

Basically, the current trend across FOSS LLM inference backends is layering different samplers (and in the case of Temperature, the order of operations) haphazardly without carefully considering how they interact: this causes a very skewed and sometimes unnatural representation of what the model is actually predicting in response to different settings across different backends.

I want to propose a new standard that is interpretable and helps distinguish between the Temperature implementation variants:

  • An Input Temperature which is ran before any other sampler and changes the original distribution before any changes are made, this is the 'classic' style which llama.cpp never implemented.
  • An Output Temperature which will come last after all the truncation samplers (such as Top K, Top P, etc) have been ran, in which the end goal is to increase the diversity of choices made by making all token probabilities closer together.

This would give users freedom because:

  • You can apply temperature after the model has selected a set of high quality candidates (post-truncation samplers) to 'randomize' in a way that won't invite any 'low quality' token choices, but instead works as a way to make the model avoid overly predictable outputs while staying in the safe range.
  • You can apply temperature before the model has selected its list of candidates to change the scale at which the model's scores are 'graded' in the first place to help trim out the outliers.

In addition to this, Top P & Top K samplers are skewing the model in ways that are also unnatural as I've documented in the past. After a lot of research on sampler solutions, I think it makes sense to try to phase out their use (with the exception of Top K for debugging / deterministic tests) and to avoid using them as defaults instead of Min P, which seems to be universally preferred by now.

Min P 0.1 seems to do well across most models, and only allows for tokens at least 1/10th as probable as the highest ranking choice, but it's a bit more deterministic. I'd say 0.05 is a solid all arounder default option assuming 1.0 temp for the "Input Temperature" unless we want to make the defaults more 'safe'.

@kalomaze kalomaze marked this pull request as draft November 30, 2023 13:23
@MaggotHATE
Copy link
Contributor

Sorry for barging in, but I've tried disabling top_p and top_k by default with your Noisy Sampling already added (and penalty_repeat disabled by default too), tested several 7b models with temperature first and last. To test it properly, I set temperature to 2.0.

I'm getting great results in general, however, I've found that temperature first can make models hallucinate (neural-chat-7b-v3-1, for example, is completely loosing it), while they keep on topic and true to instruct with temperature last. I didn't notice much repetition in dialogs despite penalty_repeat being turned off.

So question is, do you plan on combining this with Noisy variant of min_p, or should temperatures work independently? Since this PR is focused on avoiding truncations, wouldn't replacing repetition penalty in future affect the result (or help if included here)?

@crasm
Copy link
Contributor

crasm commented Dec 1, 2023

I think the root issue is that the samplers are hard-coded.

In my dart implementation, I am able to pass in a list of Sampler objects, such as a MinP. The actual sampling just calls the sampling functions from each user-provided sampler in turn.

This lets you arbitrarily order samplers, even using them multiple times. If you expose the sampler interface publicly, it's then possible for library users to write their own custom samplers without needing to add them to llama.cpp.

I think this approach could be copied over fairly easily to c++.

Btw min_p is awesome 👍

@MaggotHATE
Copy link
Contributor

After implementing configurable samplers order I tested again against high temperature values, this time at both ends, similarly to what you've proposed. I also turned off top_k and top_p, as suggested (--samplers temp;min_p;temp).

Even with temperature of 1.5 a 7b model (neural-chat-7b-v3-2.Q4_K_S.gguf, for example) stays on topic and gives coherent output. The effect from temperature sampler applied twice is quite noticeable - it definitely gives more interesting, longer responses even to technical questions. I'm not sure how accurate the information in the response is, though.

Having two separate values for temperature should help a lot, especially in cases when we want to experiment with extremely high values. I think this idea is very, very interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants