Skip to content

GH-125413: Add pathlib.Path.dir_entry attribute #125419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

barneygale
Copy link
Contributor

@barneygale barneygale commented Oct 13, 2024

Add a Path.dir_entry attribute. In any path object generated by Path.iterdir(), it stores an os.DirEntry object corresponding to the path; in other cases it is None.

This can be used to retrieve the file type and attributes of directory children without necessarily incurring further system calls.

Under the hood, we use dir_entry in our implementations of PathBase.glob(), PathBase.walk() and PathBase.copy(), the last of which also provides the implementation of Path.copy(), resulting in a modest speedup when copying local directory trees.


📚 Documentation preview 📚: https://cpython-previews--125419.org.readthedocs.build/

Add a `Path.dir_entry` attribute. In any path object generated by
`Path.iterdir()`, it stores an `os.DirEntry` object corresponding to the
path; in other cases it is `None`.

This can be used to retrieve the file type and attributes of directory
children without necessarily incurring further system calls.

Under the hood, we use `dir_entry` in our implementations of
`PathBase.glob()`, `PathBase.walk()` and `PathBase.copy()`, the last of
which also provides the implementation of `Path.copy()`, resulting in a
modest speedup when copying local directory trees.
@barneygale
Copy link
Contributor Author

Copying is a little faster:

$ ./python -m timeit -s "from pathlib import Path" "Path('Doc').copy('Doc2', dirs_exist_ok=True, preserve_metadata=True)"
5 loops, best of 5: 70.7 msec per loop  # before
5 loops, best of 5: 68.7 msec per loop  # after

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review tests when I'm not sleepy.

Copy link
Contributor

@ncoghlan ncoghlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code that accesses dir_entry is explicitly saying "potentially stale values are OK", so what if we defined it as being lazily populated rather than as it being None if not set externally before being accessed?

This would have the added benefit that the required-for-technical-reasons slot on PurePathBase would be called _dir_entry, and we could define the public read-only property on PathBase like:

@property
def dir_entry(self):
    if self._dir_entry is not None:
        return self._dir_entry
    self.dir_entry = dir_entry = os.DirEntry.from_path(self)
    return dir_entry

It would need a new helper in os.DirEntry that accepted an os.PathLike parameter and creating a populated directory entry instance for it, but that seems like a potentially useful feature anyway.

@bedevere-app
Copy link

bedevere-app bot commented Oct 23, 2024

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@barneygale
Copy link
Contributor Author

I played around with that idea, and I haven't completely ruled it out, but it's a bit of a rabbit hole.

On naming and re-using DirEntry: I don't think os.DirEntry.from_path() makes sense. The purpose of DirEntry is that it stores information from calling os.scandir() on the parent directory. I think we'd need a new class with name, is_dir() and is_symlink() attributes. We'd lazily generate an instance of this class from Path.last_status (or .status, or soemthing), assuming there's not already a DirEntry stored. The new class could be called pathlib.PathStatus or something along those lines.

Then we need to define when os.stat() is called and when exceptions are raised. A DirEntry object is initially populated with some information from the os.scandir() call, so we might want our PathStatus object to perform a stat() on creation. But should it os.stat() or os.lstat()? And doesn't that imply that our Path attribute should be a method rather than a property, given it may perform serious work? Maybe Path.cached_status()?

Then we need to figure out how this interacts with the rest of the Path methods. Should Path.stat() and Path.lstat() automatically update the status object? Should it replace an existing DirEntry object with a PathStatus object? Should Path.is_dir() call self.stat(); return self.cached_status().is_dir()?

None of this is insurmountable, mind :)

@barneygale
Copy link
Contributor Author

Perhaps I'm overthinking this, and all we really need is a Path.scandir() method

@barneygale barneygale marked this pull request as draft October 25, 2024 20:41
@picnixz
Copy link
Member

picnixz commented Oct 28, 2024

Once you've decided on whether to continue on this work or not, please ping me again (sorry, I missed this one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants