Skip to content

GH-127456: pathlib ABCs: add protocol for path parser #127494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Dec 9, 2024

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Dec 1, 2024

Change the default value of PurePathBase.parser from ParserBase() to posixpath. As a result, user subclasses of PurePathBase and PathBase use POSIX path syntax by default, which is very often desirable.

Move pathlib._abc.ParserBase to pathlib._types.Parser, and convert it to a runtime-checkable protocol.

No user-facing changes as the pathlib ABCs are still private.

`posixpath`. As a result, user subclasses of `PurePathBase` and `PathBase`
use POSIX path syntax by default, which is very often desirable.
…ies.

Objects of this type provide a subset of the `os.DirEntry` API,
specifically those methods and attributes needed to implement `glob()` and
`walk()`.
Objects of this type provide a small subset of the `os.stat_result` API,
specifically attributes for the file type, permissions and location/offset.
@encukou
Copy link
Member

encukou commented Dec 3, 2024

Penny for your thoughts on this @encukou :)

This looks reasonable, but StatResult looks rather useless with its three mandatory but unreliable members:

  • st_mode has UNIX permission bits and a platform-specific representation of file type (since AFAIK, POSIX doesn't specify the values of S_IFREG et.al.)
  • st_ino & st_dev, which can be 0 if the info is not available -- if it weren't for the tuple API, the attributes could just as well be missing in this case.

Also, I guess typing has no way to say that e.g. if st_birthtime is present, it needs to be a time in seconds.

It seems that for your goal of supporting virtual filesystems, it might be better to rely on DirEntry, with stat remaining as a backend for filesystems where it's useful.
I wonder if we need a FS-agnostic version of st_ino/st_dev, like:

def key(self) -> Hashable:
    """Return a comparison key. Entries with the same key refer to the same file."""
    # For a Unix-like FS:
    ...
    return (stat_result.st_dev, stat_result.st_ino)

@barneygale
Copy link
Contributor Author

barneygale commented Dec 3, 2024

Thank you for the feedback and reviews.

It seems that for your goal of supporting virtual filesystems, it might be better to rely on DirEntry, with stat remaining as a backend for filesystems where it's useful. I wonder if we need a FS-agnostic version of st_ino/st_dev, like:

def key(self) -> Hashable:
    """Return a comparison key. Entries with the same key refer to the same file."""
    # For a Unix-like FS:
    ...
    return (stat_result.st_dev, stat_result.st_ino)

I've occasionally thought about adding a PathBase.status() abstract method, which could return a DirEntry-like object with caching methods. Methods like PathBase.is_dir() could then be implemented via self.status().is_dir(). If we did that, we could either delete PathBase.stat(), or implement it in terms of self.status(). Would that be better?

I don't think DirEntry is an appropriate return type for status(), because we're not scanning the parent directory, rather we're querying the path directly. Perhaps it's just a naming problem.

@barneygale barneygale marked this pull request as ready for review December 3, 2024 19:21
@encukou
Copy link
Member

encukou commented Dec 4, 2024

I don't think DirEntry is an appropriate return type for status(), because we're not scanning the parent directory, rather we're querying the path directly. Perhaps it's just a naming problem.

Status, then? Like stat, but not a C-style abbr.
Or FileInfo?

@zooba
Copy link
Member

zooba commented Dec 4, 2024

I'm not a fan of the stat() model at all. It's the best of a bad lot, I know, but I'd rather have some kind of API that lets us use cached information for is_file()/is_dir()/etc. rather than having to guarantee that it's fresh for each call.

I'm not sure what such an API should look like, apart from I want it to have the Path object interface, and be using cached responses about the state of the path (or fetching them and caching on demand). There are more than enough properties that might as well be cached that it's worth it, and the TOCTOU issues basically all exist even with our current model.

@encukou
Copy link
Member

encukou commented Dec 5, 2024

Brainstorming:
Add a FileInfo protocol, implemented by classes that hold file info for a path. All the information in it is cached; the cache(s) might be initialized when an attribute is accessed (this can any attribute -- e.g. accessing the creation time might cache all of stat()), or when the FileInfo is created. The cache(s) can't be cleared (externally, FileInfo is immutable; but you can make a fresh instance).
Paths would get a cached_property-like info attribute, initialized to a FileInfo either on first access or when the Path is created (e.g. to hold extra info from iterdir).
As with cached_property, the cache can be cleared using del path.info.
We'd need a new_info() method as well, to hand out fresh info but leave the cache alone -- mainly as a backend for the current Path methods.

A downside is that this breaks the rule that touching the filesystem is done by methods, not attributes. (Can we say rule was for getting “fresh” information, and we can use something else here?)

@barneygale
Copy link
Contributor Author

This all sounds great to me.

A downside is that this breaks the rule that touching the filesystem is done by methods, not attributes.

Why would accessing Path.info touch the filesystem? Couldn't make it so that only FileInfo.is_*() methods can perform FS access?

@barneygale
Copy link
Contributor Author

barneygale commented Dec 5, 2024

Oh, I suppose you might want to lstat() or stat() so that FileInfo always represents something that exists/existed. Otherwise we'd need a FileInfo.exists() method, and if we're intending to store os.DirEntry objects as Path.info when paths are generated from iterdir(), then we'd need to add an os.DirEntry.exists() method too.

@barneygale
Copy link
Contributor Author

Courtesy CCing @ncoghlan, who proposed something similar here. I shot it down then because it wasn't clear to me when to call [l]stat(), how to clear caches, and how it interacts with Path.is_*(). I'm also a tiny bit uneasy making Path stateful.

@zooba
Copy link
Member

zooba commented Dec 5, 2024

(Process note - we should probably move the API design discussion off this PR)

exists() shouldn't be an issue, because we can cache that as True or False at creation time (e.g. returned from glob() implies that exists() == True at some point in the recent past; created by a user implies that exists() is unknown and needs to make a filesystem call). The same could potentially be true for any other state, most obviously is_dir() and is_file(), but also any of the regular stat attributes (where a subclass could well return their own type from .stat()).

What we need to design is the way for a user to discover that they've got cached values, and reset it. A CachedInfo mixin (just for type check purposes) and a dedicated subclass (so that e.g. Path(p) would recreate and uncache all members) would seem to suffice, though I can also see the reasoning for adding a new member to manage it.

@barneygale
Copy link
Contributor Author

@barneygale barneygale changed the title GH-127456: pathlib ABCs: add protocols for supporting types GH-127456: pathlib ABCs: add protocol for path parser Dec 7, 2024
@barneygale
Copy link
Contributor Author

I've removed DirEntry and StatResult from this PR, so it now covers only only Parser.

@raymondiiii

This comment was marked as spam.

Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reduced PR to just the parser changes LGTM

@barneygale
Copy link
Contributor Author

Thanks both, very much appreciate the feedback :-)

@barneygale barneygale enabled auto-merge (squash) December 9, 2024 18:08
@barneygale barneygale merged commit 5c89adf into python:main Dec 9, 2024
35 of 36 checks passed
srinivasreddy pushed a commit to srinivasreddy/cpython that referenced this pull request Jan 8, 2025
…27494)

Change the default value of `PurePathBase.parser` from `ParserBase()` to
`posixpath`. As a result, user subclasses of `PurePathBase` and `PathBase`
use POSIX path syntax by default, which is very often desirable.

Move `pathlib._abc.ParserBase` to `pathlib._types.Parser`, and convert it
to a runtime-checkable protocol.

Co-authored-by: Bénédikt Tran <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants