Skip to content

Addition of DRY: A modern repetition penalty that reliably prevents looping #447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
awtrisk opened this issue May 9, 2024 · 16 comments
Closed

Comments

@awtrisk
Copy link
Contributor

awtrisk commented May 9, 2024

Would it be worth it to add DRY as an alternative to the traditional repetition penalty? Users have reported that it actually works, and the PR on the ooba repo itself seems to be solid. It also has a llama.cpp PR. There seem to be barely any downsides to it too.

If it seems good, I can make the PR and implement it here.

@turboderp
Copy link
Member

As far as I can tell it's basically just an n-gram penalty, but without combining it with a beam search it doesn't really solve offer a way to discourage repetitions before they occur. I.e. the model is allowed to start down the path of a repetition, and it's only somewhere along that path that the penalty kicks in, at which point it's impossible to turn back.

So I'm not too sure about it. Are there any thorough comparisons to other methods like increased temperature, skew, frequency penalty etc.?

@awtrisk
Copy link
Contributor Author

awtrisk commented May 10, 2024

AFAIK I don't think this was meant to discourage against repetition, but instead that when a pattern of repetition occurs, it can quickly cull it by biasing against the mean repeated tokens. Imo this is better than the current ways of preventing repetition we have.

@p-e-w may be able to shed more insight on things like comparisons, although I will be testing it with other samplers.

@p-e-w
Copy link
Contributor

p-e-w commented May 12, 2024

DRY is indeed an n-gram/sequence penalty, but it works a little differently from no_repeat_ngram_size and other proposals I've seen. The differences can be summarized as follows:

  • The penalty grows smoothly with the length of the repeated sequence, preventing garbage from being generated in situations where extending a repetition is mandated by the context and no_repeat_ngram_size and its ilk just slam the door.
  • The penalty grows exponentially with the length of the repeated sequence, guaranteeing that the model's tendency to loop is eventually overcome. Many models, when presented with a partially repeated sequence, will overwhelmingly predict continuing the repetition, so slower-growing penalties can be insufficient.
  • The "sequence breakers" mechanism protects the structure of chat/instruction templates from being penalized, allowing much stronger penalties to be used without negative effects. I have extensively tested this in chat scenarios.

Simply put, it works. I and others have been running DRY for over two months now, and it's such a massive improvement over traditional repetition penalties that I can't imagine going back. Looping is a scourge, and the existing penalties are a cure that's almost worse than the disease, being noticeably detrimental to output quality. DRY is far better than the three flavors of RepPen at actually preventing repetition, while leaving standard sentence structure completely unaffected.

All samplers are hacks by definition (we should be able to just use the distribution from the model as-is). DRY was developed not primarily from theoretical considerations, but guided by constant real-world experimentation. Having generated and examined probably in excess of 200k tokens in well over 100 contexts by now using DRY, I can confidently say that it works, and enables results that cannot be replicated using any combination of the widely available samplers of today.

@yamosin
Copy link

yamosin commented May 15, 2024

Really looking forward to seeing it implemented on TabbyAPI

@AgeOfAlgorithms
Copy link

bump

@Vhallo
Copy link

Vhallo commented Jun 10, 2024

The performance issues have been solved by now thanks to belladoreai, so might be worthwhile to integrate this now.

@AgeOfAlgorithms
Copy link

AgeOfAlgorithms commented Jun 15, 2024

I just wanted to bring this comment by @belladoreai here for eveyone's convenience. It gives another good reason why no_repeat_ngram_size is unsuitable for stopping repetition. This was from their discussion with @p-e-w

For what it's worth, I've done a lot of experimentation with no_repeat_ngram_size in the past and I can confirm it's fairly useless in a chat context. It might be useful in other contexts, especially in contexts where the input is relatively small. But when a chat message history grows, using no_repeat_ngram_size typically leads to situations where the model is intentionally writing broken english (like writing "engglish" instead of "english"), where the brokenness of the language just grows more and more absurd over time. This seems to happen because in many cases (especially with smaller models) the model perceives repetitive output to be extremely likely - so likely, that even broken versions of the repetitive output appear more likely than some other alternative continuation of the text. So when we prevent the model from generating the exact same repetitive continuation to the text, it chooses to use a broken alternative version of the same repetitive text instead of choosing some more natural text.

I do not recommend using no_repeat_ngram_size except at very high values, if no other "circuit breaker" for repetition exists.

@Vibecoder9000
Copy link

What's the status on this? Sorry if I'm missing something in github, but it just seems to have stalled. DRY is great, but moving from KoboldCPP to Tabbyapi leaves my models significantly dumber.

@turboderp
Copy link
Member

What settings are you using for Kobold and ExLlama, respectively? And how are you defining dumber?

The short answer to your question is that it's been suggested someone PR it, I've agreed that it may be worth adding at some point (though I have a long, long list of other things to add as well so I'm not sure about the priorities), and I'm still waiting on concrete examples of what DRY achieves in practice, and how it does so without degrading the output.

@kingbri1
Copy link
Collaborator

kingbri1 commented Sep 2, 2024

DRY is a sampler that's meant for breaking loops, so if your outputs are "dumber", I'd look into prompting, parameters, character cards, the model itself, etc. Those are more likely points for regressions to occur. DRY may have been masking that since a single sampler isn't a magic bullet.

I agree with turbo, DRY is on the timeline to be added in exl2 eventually, but our time is limited and there are a bunch of features outside of sampling to tackle.

If someone does make a PR (like how every other backend added it), that will make it much easier to get the sampler in faster.

@p-e-w
Copy link
Contributor

p-e-w commented Sep 2, 2024

DRY is a sampler that's meant for breaking loops, so if your outputs are "dumber", I'd look into prompting, parameters, character cards, the model itself, etc.

It's not that simple. If DRY is unavailable, users are often forced to enable standard presence/frequency repetition penalties to combat looping. And those "established" samplers absolutely do make models dumber. That's because they penalize tokens that form the backbone of standard language: Articles, prepositions, punctuation, etc. In doing so, they can significantly distort the probability distribution predicted by the model, affecting output quality.

With the default parameter values, DRY only penalizes repeated sequences of 3 tokens or more. This leaves the distributions for the vast majority of token positions completely untouched, and prevents many of the issues caused by traditional penalties. Therefore, when substituting standard penalties with DRY, it is quite possible for a model to feel smarter.

@baronrabban
Copy link

But when a chat message history grows, using no_repeat_ngram_size typically leads to situations where the model is intentionally writing broken english (like writing "engglish" instead of "english"), where the brokenness of the language just grows more and more absurd over time. This seems to happen because in many cases (especially with smaller models) the model perceives repetitive output to be extremely likely - so likely, that even broken versions of the repetitive output appear more likely than some other alternative continuation of the text.

I have experienced a version of this but in my case it began concatenating words together. Also encountered a situation where it just started liberally inserting newlines every couple words or so. I think the main thing is thing is that it's not obvious where this behavior is coming from. Nothing says "I'm doing this crazy thing because you set DRY a few hours ago", so it can be confusing until you turn DRY off and it stops doing it.

As was stated in the quoted comment the model really wants to write this text and I think it's going to find a way no matter what restrictions you try to place on it.

@turboderp
Copy link
Member

Concatenation is probably explained by the fact that a lot of tokens are duplicated in the vocabulary, with and without a leading space. So if the model tries to say United States of America but it isn't allowed to because it's already said it twice or whatever, it could easily reach a point where it's already sampled United States of and then, being suddenly barred from sampling America, it will choose America instead since that's going to have a very similar embedding vector. Then the result is United States ofAmerica and nobody's happy.

You'd want to hope that a good model doesn't have America and America as its top two choices if the latter would break the language grammar, but at the same time a good model needs to understand that the two tokens convey the same meaning otherwise, so that e.g. tokenizing a string like "America" produces tokens that encode "the word 'America', in quotes". When dealing with merges and questionable finetunes, you can't take it for granted that the model understands/retains those nuances, let alone under the influence of too many other sampling rules. And there's a vaguely defined noise floor from quantization that you have to take into account also.

@Vibecoder9000
Copy link

Vibecoder9000 commented Sep 2, 2024

Midnight miqu 70b 2bpw, no DRY: perfect results.
image
Mistral Dory v2 12b 6bpw, no DRY: Some repeating, nothing serious but will degrade further with continued use.
image
Gemma 2 2b 8bpw, no DRY: This model might just suck.
image
Gemma 2 2b q4ks, DRY at 1.3 multiply and 2 base. Some issues with markdown formatting.
image
Gemma 2 27b qks, same DRY. This is forced repetition, all up until the last one. Rerolling the not prefilled response results in about 1/3 changing it from smiled to something else. The screenshot is that 1/3 happening.
image

Overall, I think it's good, even if disabled by default. For some models and long chats, it can completely save the chat as every event detailed here becomes more intense over time.

@Vibecoder9000
Copy link

How was this added without merging here? It is in latest build with sillytavern.

@turboderp
Copy link
Member

Not sure I understand the question. DRY is implemented here: affdc0d...c1fed2e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants