diff --git a/source/specifications/glob-patterns.rst b/source/specifications/glob-patterns.rst new file mode 100644 index 000000000..abdb15b0f --- /dev/null +++ b/source/specifications/glob-patterns.rst @@ -0,0 +1,115 @@ +================= +``glob`` patterns +================= + +Some PyPA specifications, e.g. :ref:`pyproject.toml's license-files +`, accept certain types of *glob patterns* +to match a given string containing wildcards and character ranges against +files and directories. This specification defines which patterns are acceptable +and how they should be handled. + + +Valid glob patterns +=================== + +For PyPA purposes, a *valid glob pattern* MUST be a string matched against +filesystem entries as specified below: + +- Alphanumeric characters, underscores (``_``), hyphens (``-``) and dots (``.``) + MUST be matched verbatim. + +- Special glob characters: ``*``, ``?``, ``**`` and character ranges: ``[]`` + containing only the verbatim matched characters MUST be supported. + Within ``[...]``, the hyphen indicates a locale-agnostic range (e.g. ``a-z``, + order based on Unicode code points). + Hyphens at the start or end are matched literally. + +- Path delimiters MUST be the forward slash character (``/``). + +- Patterns always refer to *relative paths*, + e.g., when used in :file:`pyproject.toml`, patterns should always be + relative to the directory containing that file. + Therefore the leading slash character MUST NOT be used. + +- Parent directory indicators (``..``) MUST NOT be used. + +Any characters or character sequences not covered by this specification are +invalid. Projects MUST NOT use such values. +Tools consuming glob patterns SHOULD reject invalid values with an error. + +Literal paths (e.g. :file:`LICENSE`) are valid globs which means they +can also be defined. + +Tools consuming glob patterns: + +- MUST treat each value as a glob pattern, and MUST raise an error if the + pattern contains invalid glob syntax. +- MUST raise an error if any individual user-specified pattern does not match + at least one file. + +Examples of valid glob patterns: + +.. code-block:: python + + "LICEN[CS]E*" + "AUTHORS*" + "licenses/LICENSE.MIT" + "licenses/LICENSE.CC0" + "LICENSE.txt" + "licenses/*" + +Examples of invalid glob patterns: + +.. code-block:: python + + "..\LICENSE.MIT" + # .. must not be used. + # \ is an invalid path delimiter, / must be used. + + "LICEN{CSE*" + # the { character is not allowed + + +Reference implementation in Python +================================== + +It is possible to defer the majority of the pattern matching against the file +system to the :mod:`glob` module in Python's standard library. It is necessary +however to perform additional validations. + +The code below is as a simple reference implementation: + +.. code-block:: python + + import os + import re + from glob import glob + + + def find_pattern(pattern: str) -> list[str]: + """ + >>> find_pattern("/LICENSE.MIT") + Traceback (most recent call last): + ... + ValueError: Pattern '/LICENSE.MIT' should be relative... + >>> find_pattern("../LICENSE.MIT") + Traceback (most recent call last): + ... + ValueError: Pattern '../LICENSE.MIT' cannot contain '..'... + >>> find_pattern("LICEN{CSE*") + Traceback (most recent call last): + ... + ValueError: Pattern 'LICEN{CSE*' contains invalid characters... + """ + if ".." in pattern: + raise ValueError(f"Pattern {pattern!r} cannot contain '..'") + if pattern.startswith((os.sep, "/")) or ":\\" in pattern: + raise ValueError( + f"Pattern {pattern!r} should be relative and must not start with '/'" + ) + if re.match(r'^[\w\-\.\/\*\?\[\]]+$', pattern) is None: + raise ValueError(f"Pattern '{pattern}' contains invalid characters.") + found = glob(pattern, recursive=True) + if not found: + raise ValueError(f"Pattern '{pattern}' did not match any files.") + return found diff --git a/source/specifications/pyproject-toml.rst b/source/specifications/pyproject-toml.rst index 802f50959..25cf75bc8 100644 --- a/source/specifications/pyproject-toml.rst +++ b/source/specifications/pyproject-toml.rst @@ -247,6 +247,8 @@ Tools SHOULD validate and perform case normalization of the expression. The table subkeys of the ``license`` key are deprecated. +.. _pyproject-toml-license-files: + ``license-files`` ----------------- @@ -260,43 +262,20 @@ configuration files, e.g. :file:`setup.py`, :file:`setup.cfg`, etc.) to file(s) containing licenses and other legal notices to be distributed with the package. -The strings MUST contain valid glob patterns, as specified below: - -- Alphanumeric characters, underscores (``_``), hyphens (``-``) and dots (``.``) - MUST be matched verbatim. - -- Special glob characters: ``*``, ``?``, ``**`` and character ranges: ``[]`` - containing only the verbatim matched characters MUST be supported. - Within ``[...]``, the hyphen indicates a locale-agnostic range (e.g. ``a-z``, - order based on Unicode code points). - Hyphens at the start or end are matched literally. +The strings MUST contain valid glob patterns, as specified in +:doc:`/specifications/glob-patterns`. -- Path delimiters MUST be the forward slash character (``/``). - Patterns are relative to the directory containing :file:`pyproject.toml`, - therefore the leading slash character MUST NOT be used. - -- Parent directory indicators (``..``) MUST NOT be used. - -Any characters or character sequences not covered by this specification are -invalid. Projects MUST NOT use such values. -Tools consuming this field SHOULD reject invalid values with an error. +Patterns are relative to the directory containing :file:`pyproject.toml`, Tools MUST assume that license file content is valid UTF-8 encoded text, and SHOULD validate this and raise an error if it is not. -Literal paths (e.g. :file:`LICENSE`) are valid globs which means they -can also be defined. - Build tools: -- MUST treat each value as a glob pattern, and MUST raise an error if the - pattern contains invalid glob syntax. - MUST include all files matched by a listed pattern in all distribution archives. - MUST list each matched file path under a License-File field in the Core Metadata. -- MUST raise an error if any individual user-specified pattern does not match - at least one file. If the ``license-files`` key is present and is set to a value of an empty array, then tools MUST NOT include any diff --git a/source/specifications/section-distribution-metadata.rst b/source/specifications/section-distribution-metadata.rst index af7c1c3e6..4f3a67777 100644 --- a/source/specifications/section-distribution-metadata.rst +++ b/source/specifications/section-distribution-metadata.rst @@ -14,3 +14,4 @@ Package Distribution Metadata inline-script-metadata platform-compatibility-tags well-known-project-urls + glob-patterns