-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Extra characters erroneously matched when using possessive quantifier with negative lookahead #100061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This should fix the bug. --- a/Modules/_sre/sre_lib.h
+++ b/Modules/_sre/sre_lib.h
@@ -1333,7 +1333,7 @@ SRE(match)(SRE_STATE* state, const SRE_CODE* pattern, int toplevel)
state. */
MARK_POP(ctx->lastmark);
LASTMARK_RESTORE();
-
+ state->ptr = ptr;
/* We have sufficient matches, so exit loop. */
break;
} |
Can you kindly tell me why if it's valid to use lookahead not lookbehind(?<!) for backward checks? |
But still, while looking into this issue, moving (?!C) up front still have similar problem, unlike java giving |
I also agree with @animalize's solution. Please review :) |
IIRC, |
…fiers (GH-102612) Restore the global Input Stream pointer after trying to match a sub-pattern. Co-authored-by: Ma Lin <[email protected]>
…essive quantifiers (pythonGH-102612) Restore the global Input Stream pointer after trying to match a sub-pattern. Co-authored-by: Ma Lin <[email protected]>. (cherry picked from commit abd9cc5) Co-authored-by: SKO <[email protected]>
…essive quantifiers (pythonGH-102612) Restore the global Input Stream pointer after trying to match a sub-pattern. Co-authored-by: Ma Lin <[email protected]>. (cherry picked from commit abd9cc5) Co-authored-by: SKO <[email protected]>
… quantifiers (GH-102612) (GH-108004) Restore the global Input Stream pointer after trying to match a sub-pattern. Co-authored-by: Ma Lin <[email protected]> (cherry picked from commit abd9cc5) Co-authored-by: SKO <[email protected]>
… quantifiers (GH-102612) (#108003) Restore the global Input Stream pointer after trying to match a sub-pattern. . (cherry picked from commit abd9cc5) Co-authored-by: SKO <[email protected]>
Thank you for the fix! It works perfectly. To belatedly answer @uyw4687's question, as @animalize mentioned, I use a lookahead because the lookbehind in sre is fixed width only. |
Bug report
Regular expressions that combine a possessive quantifier with a negative lookahead match extra erroneous characters in re module 2.2.1 of Python 3.11. (The test was run on Windows 10 using the official distribution of Python 3.11.0.)
For example, the following regular expression aims to match consecutive characters that are not 'C' in string 'ABC'. (There are simpler ways to do this, but this is just an example to illustrate the problem.)
Output:
The first subgroup of the match is the entire match, while the second subgroup is the last character that was matched. They should be 'AB' and 'B', respectively. While the last matched character is correctly identified as 'B', the complete match is erroneously set to 'ABC'.
Replacing the negative lookahead with a positive lookahead eliminates the problem:
Output:
Alternately, keeping the negative lookahead but replacing the possessive quantifier with a greedy quantifier also eliminates the problem:
Output:
While this example uses the ++ quantifier, the *+ and ?+ quantifiers exhibit similar behaviour. Also, using a longer pattern in the negative lookahead leads to even more characters being erroneously matched.
Thank you for adding possessive quantifiers to the re module! It is a very useful feature!
Environment
Linked PRs
The text was updated successfully, but these errors were encountered: