gh-126505: Do not use Unicode case folding in ASCII regexes #126544
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When a pattern is being compiled in
_compiler.py
'soptimize_charset
, theRANGE
opcode is translated into theRANGE_UNI_IGNORE
opcode. This should be done only in regexes which set the Unicode flag, otherwise we get Unicode case folding behavior in regexes which set the ASCII or Locale mode flags.The correct way to check for Unicode mode in
optimize_charset
would be to checkif fixes:
, because thefixes
argument isNone
in ASCII and Locale modes and adict
in Unicode mode. The code currently uses the conditionif fixup:
, butfixup
isNone
only in Locale mode and it is a function in both ASCII and Unicode mode. This means that this replacement is used in ASCII mode too and theRANGE
opcode is translated to aRANGE_UNI_IGNORE
opcode for character sets which include characters outside of the basic multilingual plane (the second time anIndexError
is thrown inoptimize_charset
).