Skip to content

Extra characters erroneously matched when using possessive quantifier with negative lookahead #100061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
steve-guillouzic-gc opened this issue Dec 6, 2022 · 6 comments · Fixed by #108003
Assignees
Labels
topic-regex type-bug An unexpected behavior, bug, or error

Comments

@steve-guillouzic-gc
Copy link

steve-guillouzic-gc commented Dec 6, 2022

Bug report

Regular expressions that combine a possessive quantifier with a negative lookahead match extra erroneous characters in re module 2.2.1 of Python 3.11. (The test was run on Windows 10 using the official distribution of Python 3.11.0.)

For example, the following regular expression aims to match consecutive characters that are not 'C' in string 'ABC'. (There are simpler ways to do this, but this is just an example to illustrate the problem.)

import re

text = 'ABC'
print('Possessive quantifier, negative lookahead:', 
      re.findall('(((?!C).)++)', text))

Output:

Possessive quantifier, negative lookahead: [('ABC', 'B')]

The first subgroup of the match is the entire match, while the second subgroup is the last character that was matched. They should be 'AB' and 'B', respectively. While the last matched character is correctly identified as 'B', the complete match is erroneously set to 'ABC'.

Replacing the negative lookahead with a positive lookahead eliminates the problem:

print('Possessive quantifier, positive lookahead:',
      re.findall('(((?=[^C]).)++)', text))

Output:

Possessive quantifier, positive lookahead: [('AB', 'B')]

Alternately, keeping the negative lookahead but replacing the possessive quantifier with a greedy quantifier also eliminates the problem:

print('Greedy quantifier, negative lookahead:',
      re.findall('(((?!C).)+)', text))

Output:

Greedy quantifier, negative lookahead: [('AB', 'B')]

While this example uses the ++ quantifier, the *+ and ?+ quantifiers exhibit similar behaviour. Also, using a longer pattern in the negative lookahead leads to even more characters being erroneously matched.

Thank you for adding possessive quantifiers to the re module! It is a very useful feature!

Environment

  • re module 2.2.1 in standard library
  • CPython versions tested on: 3.11.0
  • Operating system and architecture: Windows 10

Linked PRs

@steve-guillouzic-gc steve-guillouzic-gc added the type-bug An unexpected behavior, bug, or error label Dec 6, 2022
@ghost
Copy link

ghost commented Dec 30, 2022

This should fix the bug.
I don't have time. Welcome others to investigate and submit a PR.

--- a/Modules/_sre/sre_lib.h
+++ b/Modules/_sre/sre_lib.h
@@ -1333,7 +1333,7 @@ SRE(match)(SRE_STATE* state, const SRE_CODE* pattern, int toplevel)
                        state. */
                     MARK_POP(ctx->lastmark);
                     LASTMARK_RESTORE();
-
+                    state->ptr = ptr;
                     /* We have sufficient matches, so exit loop. */
                     break;
                 }

@uyw4687
Copy link
Contributor

uyw4687 commented Mar 11, 2023

Can you kindly tell me why if it's valid to use lookahead not lookbehind(?<!) for backward checks?
I'm referring to https://www.regular-expressions.info/lookaround.html and I thought we might need to use lookbehind since it says Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string.

@uyw4687
Copy link
Contributor

uyw4687 commented Mar 11, 2023

But still, while looking into this issue, moving (?!C) up front still have similar problem,
giving following result for the command re.match('((.(?!C))++)', 'ABC')
<re.Match object; span=(0, 3), match='ABC'>

unlike java giving A

uyw4687 added a commit to uyw4687/cpython that referenced this issue Mar 11, 2023
uyw4687 added a commit to uyw4687/cpython that referenced this issue Mar 11, 2023
uyw4687 added a commit to uyw4687/cpython that referenced this issue Mar 11, 2023
@uyw4687
Copy link
Contributor

uyw4687 commented Mar 11, 2023

I also agree with @animalize's solution. Please review :)

@ghost
Copy link

ghost commented Mar 12, 2023

Can you kindly tell me why if it's valid to use lookahead not lookbehind(?<!) for backward checks? I'm referring to https://www.regular-expressions.info/lookaround.html and I thought we might need to use lookbehind since it says Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string.

IIRC, sre's lookbehind can only match fixed length pattern, so it can use lookahead with an offset to handle it.

uyw4687 added a commit to uyw4687/cpython that referenced this issue Mar 13, 2023
uyw4687 added a commit to uyw4687/cpython that referenced this issue Mar 13, 2023
uyw4687 added a commit to uyw4687/cpython that referenced this issue Mar 13, 2023
serhiy-storchaka added a commit to uyw4687/cpython that referenced this issue Aug 16, 2023
serhiy-storchaka pushed a commit that referenced this issue Aug 16, 2023
…fiers (GH-102612)

Restore the global Input Stream pointer after trying to match a sub-pattern.

Co-authored-by: Ma Lin <[email protected]>
serhiy-storchaka pushed a commit to serhiy-storchaka/cpython that referenced this issue Aug 16, 2023
…essive quantifiers (pythonGH-102612)

Restore the global Input Stream pointer after trying to match a sub-pattern.

Co-authored-by: Ma Lin <[email protected]>.
(cherry picked from commit abd9cc5)

Co-authored-by: SKO <[email protected]>
serhiy-storchaka pushed a commit to serhiy-storchaka/cpython that referenced this issue Aug 16, 2023
…essive quantifiers (pythonGH-102612)

Restore the global Input Stream pointer after trying to match a sub-pattern.

Co-authored-by: Ma Lin <[email protected]>.
(cherry picked from commit abd9cc5)

Co-authored-by: SKO <[email protected]>
serhiy-storchaka added a commit that referenced this issue Aug 16, 2023
… quantifiers (GH-102612) (GH-108004)

Restore the global Input Stream pointer after trying to match a sub-pattern.

Co-authored-by: Ma Lin <[email protected]>

(cherry picked from commit abd9cc5)
    
Co-authored-by: SKO <[email protected]>
Yhg1s pushed a commit that referenced this issue Aug 16, 2023
… quantifiers (GH-102612) (#108003)

Restore the global Input Stream pointer after trying to match a sub-pattern.

.
(cherry picked from commit abd9cc5)

Co-authored-by: SKO <[email protected]>
@steve-guillouzic-gc
Copy link
Author

Thank you for the fix! It works perfectly. To belatedly answer @uyw4687's question, as @animalize mentioned, I use a lookahead because the lookbehind in sre is fixed width only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-regex type-bug An unexpected behavior, bug, or error
Projects
None yet
4 participants