Skip to content

Incorrect case-insensitive matching of character ranges #76

Closed
@SimonSapin

Description

@SimonSapin

Character range matching is conceptually (range_start..range_end).any(|c| c == input_char), but as an optimization is implemented as range_start <= input_char && input_char <= range_end. This is fine.

Case-insensitive matching is implemented as uppercase(c) == uppercase(input_char). This is fine (modulo #55).

So case-insensitive range matching is conceptually (range_start..range_end).any(|c| uppercase(c) == uppercase(input_char)). It is currently implemented as uppercase(range_start) <= uppercase(input_char) && uppercase(input_char) <= uppercase(range_end) which is not equivalent.

One of the tests currently passing is that (?i)\p{Lu}+ matches ΛΘΓΔα entirely. That is, greek letters (both upper case and lower case) all match the category of upper case letters when matched case-insensitively. But the same test with \p{Ll} (category of lower case letters) instead of \p{Lu} currently fails because of this issue. (\p{Lu} and \p{Ll} expand to large unions of character ranges.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions