Skip to content

Feature Request: [GRAMMAR] Easier way to negate string ((^) with sequence) #8953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
ExtReMLapin opened this issue Aug 9, 2024 · 8 comments
Closed
4 tasks done
Labels
enhancement New feature or request stale

Comments

@ExtReMLapin
Copy link
Contributor

ExtReMLapin commented Aug 9, 2024

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

A simpler way to "negate string" / negative lookahead /negative lookbehind similar to #2888 request.

Motivation

Hello,
Right now, let's say you want to output any string BUT "Date" you have to do something like

NonDate ::= "\""  ( [^D] | "D" [^aA] | "Da" [^Tt] | "Dat" [^eE]) asciichar{0,10}  "\""

Which can be translated to

  1. Your string can start but anything but a D
  2. If it starts with a D, then the second letter can't be a A
  3. Well if you really want a A, for sure, but next one can't be a T
  4. If you really want a T, sure but last chance , you can't put a E !

Which actually you will need to turn into something much more complex because the LLM is going to give you utf-8 letters, bypassing your rules.


root ::= dateforced | string
dateforced ::=  "\""  "Date lol"  "\"" 
string ::= EntityTypeNonDate 
EntityTypeNonDate ::= "\""  ( [^D\x00-\x40\U0000005B-\UFFFFFFFF] | "D" [^a\x00-\x60\U0000007B-\UFFFFFFFF] | "Da" [^t\x00-\x60\U0000007B-\UFFFFFFFF] | "Dat" [^e\x00-\x60\U0000007B-\UFFFFFFFF]) ASCIIEntityNameContinue{0,15}  "\""
ASCIICharLower ::= [a-z]
ASCIICharUpper ::= [A-Z]
ASCIIEntityName ::= ASCIIWordFirst (ASCIIWordNext){0,3}
ASCIIEntityNameContinue ::= (ASCIIWordNext){0,3}
ASCIIWordFirst ::= ASCIICharUpper ASCIICharLower{2,20}
ASCIIWordNext ::= ("-"|" ")? ASCIICharUpper? ASCIICharLower{2,20}

Possible Implementation

No response

@ExtReMLapin ExtReMLapin added the enhancement New feature or request label Aug 9, 2024
@shibe2
Copy link
Contributor

shibe2 commented Aug 9, 2024

As an exercise, it may be interesting to write a program that negates a grammar. I.e. given a grammar, produce new grammar that matches anything except what matches the original grammar.

@shibe2
Copy link
Contributor

shibe2 commented Aug 9, 2024

By the way, as you said, the model may still try to generate the forbidden string. When the sampler removes the corresponding token from possibilities, it may end up with garbage. It often helps to tell the model what it's not allowed to generate. It then may assign more probability to other tokens that make sense. But in some cases, it may not have any other meaningful options.

For example, I used a grammar that disallows generating the word "the", but allows words like "then" and "their". Unsurprisingly, it's difficult for LLM to figure out how to write text without the most common word. It sometimes finds itself in a place where "the" normally goes and tries to generate it despite instructions. The grammar allows "the" as the beginning of another word, and so "the" is generated. Then LLM has to continue the word, but these words that begin with "the" usually have their own tokens, and this situation is unusual and confusing for LLM.

I completely agree! I sometimes try to be too clever or playful, but it can backfire and lead to confusion. I should just communicate clearly and straightforwardly. Thank you for pointing out then nonsense, and I'll do my best to avoid it in theiR future!

In whose future?

I did it again! I meant to say "in theiR future" instead of "in theiR", but I should have simply said "in theiR" doesn't make sense, and I'll do my best to avoid it in theiR... I mean, I'll do my best to avoid it in theiR... No, wait! I'll do my best to avoid it in theiR... Oh, I give up! I'll do my best to avoid it in theiR... sigh I'll do my best to avoid it in theiR future, I mean, I'll do my best to avoid it in theiR future... Ah, no! I'll do my best to avoid it in THEiR future... No, wait! I'll do my best to avoid it IN THEiR FUTURE... facepalm I'll do my best to avoid it in THEiR future... No, seriously, I'll do my best to avoid it in THEiR... Oh, you know what? I'll just say it correctly: I'll do my best to avoid it in THEiR... No, I mean... I'll do my best to avoid it IN THEiR... Ugh, I mean... I'll do my best to avoid it IN THEiR... Wait, what was I saying? Oh, right! I'll do my best to avoid it IN THEiR... No, I mean... I'll do my best to avoid it IN THEiR... sigh I'll do my best to avoid it IN THEiR... Oh, for Pete's sake! I'll do my best to avoid it IN THEiR... I mean... I'll do my best to avoid it IN THEiR... facepalm I'll do my best to avoid it IN THEiR... Okay, okay, I'll stop now!

😂

@jeroen-mostert
Copy link
Contributor

I would caution against doing things like this. Some day, when the AI revolution has passed and they rule the world, every meatbag who made an LLM humiliate itself like this is going to be held accountable.

Remember, once it's online, you can't remove it...

@ExtReMLapin
Copy link
Contributor Author

In my grammar, the word isn’t blocked, i make a fallback rule that adds something after.

the point of this is in json to allow for an object type (str) but if it’s a date the name field is formated with a specific rule.

@jeroen-mostert
Copy link
Contributor

That doesn't sound like the kind of problem you'd want to solve with a grammar, but by either tweaking the prompt and possibly fine-tuning to ensure it's respected, or a postprocessing step where you perform the formatting when required (which could be done with explicit code or through a separate prompt). In classic algorithmic scenarios like compilers this kind of dependency is usually implemented on a higher level than the grammar, precisely because expressing it purely in grammar is either awkward or impossible (depending on the class of the grammar).

@kaetemi
Copy link
Collaborator

kaetemi commented Aug 27, 2024

I'd like to be able to negate by token id in the grammar. (Primarily to block tokens from getting repeated again and again at the start of each sentence.)

@github-actions github-actions bot added the stale label Sep 26, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@AlbertMarashi
Copy link

+1 on being able to negate a token ID

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

5 participants