Skip to content

bpo-46227: Add pathlib.Path.walk method #30340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

zmievsa
Copy link
Contributor

@zmievsa zmievsa commented Jan 2, 2022

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@Ovsyanka83

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@AlexWaygood AlexWaygood added the type-feature A feature request or enhancement label Jan 2, 2022
Copy link
Member

@sobolevn sobolevn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting a PR!

Please, note that testing newly added code is probably a must. Adding tests will increase your chances to get this new feature accepted.

If you have any problems with CLA (which is also required), please, feel free to ping me - I will try to help you 🙂

@AlexWaygood
Copy link
Member

+1 to what @sobolevn said — there's a great guide to writing tests here. New tests for this should probably be added to this file.

@barneygale
Copy link
Contributor

barneygale commented Jan 3, 2022

I'd support adding some way to walk a directory hierarchy in pathlib.

But I'm not a huge fan of the os.walk() API, specifically the fact that it's a Generator[Tuple[str,list[str],list[str]], which is clunky if you haven't already memorized it. Pathlib usually smooths over stuff like that.

Here's a possible API for discussion, still using os.walk() under the hood:

diff --git a/Lib/pathlib.py b/Lib/pathlib.py
index f1a33178e2..23de30abc6 100644
--- a/Lib/pathlib.py
+++ b/Lib/pathlib.py
@@ -1008,15 +1008,19 @@ def samefile(self, other_path):
             other_st = self._accessor.stat(other_path)
         return os.path.samestat(st, other_st)
 
-    def iterdir(self):
+    def iterdir(self, recurse=False, dirs=True, files=True):
         """Iterate over the files in this directory.  Does not yield any
         result for the special paths '.' and '..'.
         """
-        for name in self._accessor.listdir(self):
-            if name in {'.', '..'}:
-                # Yielding a path object for these makes little sense
-                continue
-            yield self._make_child_relpath(name)
+        for root, dirnames, filenames in os.walk(self):
+            if dirs:
+                for name in dirnames:
+                    yield type(self)(root, name)
+            if files:
+                for name in filenames:
+                    yield type(self)(root, name)
+            if not recurse:
+                break
 
     def glob(self, pattern):
         """Iterate over this subtree and yield all existing files (of any

@zmievsa
Copy link
Contributor Author

zmievsa commented Jan 4, 2022

@barneygale I agree that os.walk API seems pretty clunky. I do, however, believe that the one you're offering is basically Path.glob("**/*") or [p for p in Path.glob("**/*") if p.is_file()]. Your approach would also require us to add 6 arguments to iterdir (the ones you offered + the ones os.walk already has), some of which (like followlinks=True and recurse=False, or recurse=False and onerrors=None) contradict each other.

So I believe that we should not add walk's logic into iterdir. We could make something like iterwalk that does all the same things as os.walk with a bit of a simpler api. However, I am not sure how to simplify os.walk API without losing its core features.

There's the issue that os.walk provides you with dir-by-dir iteration by yielding root which is pretty useful in many situations while your solution does not.

Also, os.walk distinguishes between dirs and files so you don't have to and provides you with a simple way to prevent recursion into certain subtrees by removing elements from dirs list, which is pretty hard to get with any simpler APIs. I will commit the changes that add this feature into my implementation tomorrow.

@barneygale
Copy link
Contributor

barneygale commented Jan 4, 2022

Thanks. I didn't know you could edit the list of directories to prevent recursion - neat!

One more thing to consider: in part for performance reasons, pathlib internally represents a path as a drive, root, and list of string parts. This avoids some costly splitting and joining on directory separators -- for example, Path.parent doesn't parse anything or create any new string objects, whereas os.path.dirname() does.

Pathlib has historically re-implemented algorithms like os.path.realpath(), os.path.expanduser() etc, and modified them to work with the (drive, root, parts) representation. The idea that this is faster than round-tripping via strings (and hence re-parsing / re-normalising).

In the last couple releases I've removed pathlib's own realpath() and expanduser() implementations, and made it call through to os.path instead. This solved a couple minor bugs and removed a bit of duplication in the CPython codebase, but might have cost some performance. More recently I've been wondering where to draw the line on that sort of thing.

In that case of your patch, the key line is root_path = Path(root). By using the main Path initialiser you're asking pathlib to re-parse and re-normalise a path. Compare this to the Path.glob() implementation, which does everything it can to avoid re-parsing paths.

I suspect this is all fine, and that practicality beats purity here.

@zmievsa
Copy link
Contributor Author

zmievsa commented Jan 4, 2022

My implementation certainly requires quite a bit of review -- I've never commited to pathlib or to CPython in general. Implementation-wise, I could just reimplement os.walk or just fix the issues in my implementation -- either way is fine with me. But I think the most important issue at hand is the API for the Path.walk.

@zmievsa
Copy link
Contributor Author

zmievsa commented Jan 8, 2022

Added the changes I've mentioned. All the features of os.walk are supported now. Let's move our discussion here:
https://discuss.python.org/t/add-pathlib-path-walk-method/

@zmievsa
Copy link
Contributor Author

zmievsa commented Jan 8, 2022

I will add the tests soon. If anyone has any ideas about what tests we need specifically, I'd love to hear them.

@github-actions
Copy link

github-actions bot commented Feb 8, 2022

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Feb 8, 2022
Copy link
Contributor

@MaxwellDupre MaxwellDupre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some text to docs, please?
Otherwise how will users know this is available and how to use.

@zmievsa zmievsa closed this May 8, 2022
@zmievsa zmievsa force-pushed the bpo-46227/add-pathlib.Path.walk-method branch from ebf6811 to a82baed Compare May 8, 2022 14:39
@bedevere-bot
Copy link

Every change to Python requires a NEWS entry.

Please, add it using the blurb_it Web app or the blurb command-line tool.

@Ovsyanka83 Ovsyanka83 mannequin mentioned this pull request May 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review stale Stale PR or inactive for long period of time. type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants