-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Recursive directory list with pathlib.Path.iterdir #80783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Currently, 'pathlib.Path.iterdir' can only list the contents of the instance directory. It is common to also want the contents of subdirectories recursively. The proposal is for 'pathlib.Path.iterdir' to have an argument 'recursive' which when 'True' will cause 'iterdir' to yield contents of subdirectories recursively. This would be trivial to implement as 'iterdir' can simply yield from subdirectories' 'iterdir'. A decision would have to be made whether to continue to yield the subdirectories, or skip them. Another decision would be for whether each path should be resolved before checking if it is a directory to be recursed into. |
Is the behaviour you're proposing any different from using Path.rglob('*')? |
I believe @Epic_Wink:
One thing you may need to worry about here is the fact that symlinks can have cycles, so you may need to do some cycle detection to avoid creating the dangerous possibility of infinite loops. There's also the question of whether you want this to be a depth-first or breadth-first traversal, and whether you would want both of these to be options. |
rglob and glob also return a generator. Slightly related, pathlib.walk was proposed in the past in python-ideas : https://mail.python.org/pipermail/python-ideas/2017-April/045398.html |
By that logic, we should remove
I agree, which is the main reason the current implementation in the pull-request is to not resolve symlinks: users can subclass and implement symlink resolving if they want
As much as I want to say that I don't see a use-case for breadth-first file listing (when I list files, I expect the next file provided to be 'next to' the current file), users currently have no standard-library functionality to perform breadth-first searches as far as I know: they'd have to implement it themself or find it in a third-party library
I've never really liked the interface to |
My mistake, I didn't notice the
What is the case for why iterdir() is justified when Of course, removing things (which can break existing code) and failing to add them (which cannot) have two different thresholds for when they can take place, so even if we decide "iterdir() is to glob('*') as iterdir(recursive=True) is to rglob('*')", that doesn't mean that we should remove iterdir() entirely if recursive=True is not added.
I don't see that on the implementation here, but we can discuss this on the PR itself. I do think that skipping *all* symlinks automatically with no option to follow them will be counter-intuitive for people.
I kinda agree about the interface to |
Having spent more time than I'm proud of recursing through directories, I'd be happy enough with a convenience function that has sensible defaults. If I want breadth-first recursion (and I often do), I'll write it myself. I have a slight preference for getting all files in a directory before going deeper (which is not what the PR does), and I think that's most consistent with the current behaviour. I don't spend enough time dealing with symlinks to have strong opinions there, but given we have ways to resolve symlinks but not to get back to the original name (and I *have* had to deal with issues where I've needed to find the original name from the target :roll-eyes:) I'd say don't resolve anything eagerly. If there's an easy and well-known algorithm for detecting infinite symlink recursion (e.g. resolve and check if it's a parent of itself) then do that and skip it, but don't return the targets. |
You mean treating symlinks to directories like files? I suppose that's a possibility, but I do think it will end up being a source of bugs around symlinking. Admittedly, it is apparently what rglob('*') does (just tested it - apparently it won't follow symlinks to directories), though I think it might be a better interface to try to break cycles rather than not follow symlinks (particularly since |
Having I feel like not yielding directories is the way to go, but it's easy enough to check if a yielded path is a directory in application code. The current implementation of using recursion to list subdirectory contents doesn't seem to allow for the obvious implementation of symlink cycle-detection: keeping track of which (real) directories have been listed. PS: I've updated the pull-request to not follow symlinks to directories. This is not a final decision, but just updating to be in line with what I've implied up to this point |
I've updated the pull-request to list directories pointed to by listed symbolic links, preventing cyclic listing. An extra instance method We still need to decide whether to yield the directories themselves rather than just the files under those directories. Very easy to implement this. One thing I just realised is that a symlink can point to a subdirectory further down the chain. The current implementation will list the files under that directory using the symlink as prefix rather than the full path. Shouldn't matter though. |
Would this change also have to copy implemented in the new ZipFile Pathlib API? https://bugs.python.org/issue36832 |
I think I may have broken bedevere-bot by request change reviews before the PR was assigned... |
|
Closing, since we now have If another core developer still wants to proceed with this, feel free to reopen! |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: