Recursive directory list with pathlib.Path.iterdir #80783

EpicWink · 2019-04-11T10:59:34Z

BPO	36602
Nosy	@pitrou, @zooba, @pganssle, @tirkarthi, @EpicWink
PRs	bpo-36602: Allow pathlib.Path.iterdir to list recursively #12785

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2019-04-11.10:59:34.291>
labels = ['type-feature', 'library', '3.9']
title = 'Recursive directory list with pathlib.Path.iterdir'
updated_at = <Date 2019-06-05.00:36:30.476>
user = 'https://github.com/EpicWink'

bugs.python.org fields:

activity = <Date 2019-06-05.00:36:30.476>
actor = 'Epic_Wink'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2019-04-11.10:59:34.291>
creator = 'Epic_Wink'
dependencies = []
files = []
hgrepos = []
issue_num = 36602
keywords = ['patch']
message_count = 12.0
messages = ['339959', '339969', '339997', '340003', '340012', '340042', '340072', '340135', '340144', '341059', '342007', '342687']
nosy_count = 6.0
nosy_names = ['pitrou', 'SilentGhost', 'steve.dower', 'p-ganssle', 'xtreak', 'Epic_Wink']
pr_nums = ['12785']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue36602'
versions = ['Python 3.9']

EpicWink · 2019-04-11T10:59:34Z

Currently, 'pathlib.Path.iterdir' can only list the contents of the instance directory. It is common to also want the contents of subdirectories recursively.

The proposal is for 'pathlib.Path.iterdir' to have an argument 'recursive' which when 'True' will cause 'iterdir' to yield contents of subdirectories recursively.

This would be trivial to implement as 'iterdir' can simply yield from subdirectories' 'iterdir'.

A decision would have to be made whether to continue to yield the subdirectories, or skip them. Another decision would be for whether each path should be resolved before checking if it is a directory to be recursed into.

SilentGhost · 2019-04-11T11:47:11Z

Is the behaviour you're proposing any different from using Path.rglob('*')?

pganssle · 2019-04-11T22:10:45Z

Is the behaviour you're proposing any different from using Path.rglob('*')?

I believe rglob("*") is eager, while iterdir is lazy.

@Epic_Wink:

This would be trivial to implement as 'iterdir' can simply yield from subdirectories' 'iterdir'.

One thing you may need to worry about here is the fact that symlinks can have cycles, so you may need to do some cycle detection to avoid creating the dangerous possibility of infinite loops.

There's also the question of whether you want this to be a depth-first or breadth-first traversal, and whether you would want both of these to be options.

tirkarthi · 2019-04-12T02:45:45Z

I believe rglob("*") is eager, while iterdir is lazy.

rglob and glob also return a generator.

Slightly related, pathlib.walk was proposed in the past in python-ideas : https://mail.python.org/pipermail/python-ideas/2017-April/045398.html

EpicWink · 2019-04-12T08:12:49Z

Is the behaviour you're proposing any different from using Path.rglob('*')?

By that logic, we should remove Path.iterdir() in favour of Path.glob('*'). In addition, having iterdir the way it is makes it easy for subclasses to extend its functionality (for example, one of my Path subclasses allows a callable to be passed to the iterdir which can filter paths)

One thing you may need to worry about here is the fact that symlinks can have cycles, so you may need to do some cycle detection to avoid creating the dangerous possibility of infinite loops.

I agree, which is the main reason the current implementation in the pull-request is to not resolve symlinks: users can subclass and implement symlink resolving if they want

There's also the question of whether you want this to be a depth-first or breadth-first traversal, and whether you would want both of these to be options.

As much as I want to say that I don't see a use-case for breadth-first file listing (when I list files, I expect the next file provided to be 'next to' the current file), users currently have no standard-library functionality to perform breadth-first searches as far as I know: they'd have to implement it themself or find it in a third-party library

Slightly related, pathlib.walk was proposed in the past in python-ideas...

I've never really liked the interface to walk, personal preference

pganssle · 2019-04-12T13:35:19Z

rglob and glob also return a generator.

My mistake, I didn't notice the sorted in the rglob documentation and thought it was emitting a list.

By that logic, we should remove Path.iterdir() in favour of Path.glob('*').

What is the case for why iterdir() is justified when Path.glob('*') exists? Is it just discoverability? Is there some efficiency reason to do it?

Of course, removing things (which can break existing code) and failing to add them (which cannot) have two different thresholds for when they can take place, so even if we decide "iterdir() is to glob('*') as iterdir(recursive=True) is to rglob('*')", that doesn't mean that we should remove iterdir() entirely if recursive=True is not added.

I agree, which is the main reason the current implementation in the pull-request is to not resolve symlinks: users can subclass and implement symlink resolving if they want

I don't see that on the implementation here, but we can discuss this on the PR itself. I do think that skipping *all* symlinks automatically with no option to follow them will be counter-intuitive for people.

I've never really liked the interface to walk, personal preference

I kinda agree about the interface to walk, but it is worth noting that as we've seen in this thread, there are a bunch of plausible and slightly different ways to walk a directory: breadth-first or depth-first? following symlinks, following symlinks with cycle detection, or not following symlinks at all? emit the directory itself, or only emit its contents? It's worth taking into account that having two completely different complicated interfaces for recursively walking directories would be a usability challenge.

zooba · 2019-04-12T15:51:22Z

Having spent more time than I'm proud of recursing through directories, I'd be happy enough with a convenience function that has sensible defaults.

If I want breadth-first recursion (and I often do), I'll write it myself. I have a slight preference for getting all files in a directory before going deeper (which is not what the PR does), and I think that's most consistent with the current behaviour.

I don't spend enough time dealing with symlinks to have strong opinions there, but given we have ways to resolve symlinks but not to get back to the original name (and I *have* had to deal with issues where I've needed to find the original name from the target :roll-eyes:) I'd say don't resolve anything eagerly.

If there's an easy and well-known algorithm for detecting infinite symlink recursion (e.g. resolve and check if it's a parent of itself) then do that and skip it, but don't return the targets.

pganssle · 2019-04-13T00:13:09Z

I don't spend enough time dealing with symlinks to have strong opinions there, but given we have ways to resolve symlinks but not to get back to the original name (and I *have* had to deal with issues where I've needed to find the original name from the target :roll-eyes:) I'd say don't resolve anything eagerly.

You mean treating symlinks to directories like files? I suppose that's a possibility, but I do think it will end up being a source of bugs around symlinking.

Admittedly, it is apparently what rglob('*') does (just tested it - apparently it won't follow symlinks to directories), though I think it might be a better interface to try to break cycles rather than not follow symlinks (particularly since iterdir currently treats symlinks to directories as directories).

EpicWink · 2019-04-13T03:10:34Z

Having iterdir(recursive=True) recurse into symlinks to directories would mean we would either not yield those symlinks, or we yield those symlinks and all other directories.

I feel like not yielding directories is the way to go, but it's easy enough to check if a yielded path is a directory in application code.

The current implementation of using recursion to list subdirectory contents doesn't seem to allow for the obvious implementation of symlink cycle-detection: keeping track of which (real) directories have been listed.

PS: I've updated the pull-request to not follow symlinks to directories. This is not a final decision, but just updating to be in line with what I've implied up to this point

EpicWink · 2019-04-29T09:07:16Z

I've updated the pull-request to list directories pointed to by listed symbolic links, preventing cyclic listing. An extra instance method pathlib.Path._iterdir_recursive was added for this.

We still need to decide whether to yield the directories themselves rather than just the files under those directories. Very easy to implement this.

One thing I just realised is that a symlink can point to a subdirectory further down the chain. The current implementation will list the files under that directory using the symlink as prefix rather than the full path. Shouldn't matter though.

EpicWink · 2019-05-09T23:13:12Z

Would this change also have to copy implemented in the new ZipFile Pathlib API? https://bugs.python.org/issue36832

EpicWink · 2019-05-17T03:44:07Z

I think I may have broken bedevere-bot by request change reviews before the PR was assigned...

barneygale · 2022-07-23T19:47:01Z

Hey @EpicWink, we now have a Path.walk() method that recursively walks directory trees, using os.scandir() for efficiency - see #92517. This is faster than using rglob('*') in most cases, and it supports following symlinks. Do you think that would cover your original use case?

EpicWink · 2022-07-24T00:24:43Z

walk's interface is a little bit clunky, but I've realised iterdir_recursive doesn't need to live in the standard library, as is trivial to include in a library or copy-paste between scripts.

AlexWaygood · 2023-01-08T23:23:35Z

Closing, since we now have Path.walk(), and OP no longer has interest in the idea.

If another core developer still wants to proceed with this, feel free to reopen!

EpicWink mannequin added 3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 11, 2019

EpicWink mannequin added 3.9 only security fixes and removed 3.8 (EOL) end of life labels Jun 5, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

brettcannon added the topic-pathlib label Jun 21, 2022

EpicWink mentioned this issue Aug 3, 2022

bpo-36602: Allow pathlib.Path.iterdir to list recursively #12785

Closed

AlexWaygood closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursive directory list with pathlib.Path.iterdir #80783

Recursive directory list with pathlib.Path.iterdir #80783

EpicWink mannequin commented Apr 11, 2019

EpicWink mannequin commented Apr 11, 2019

SilentGhost mannequin commented Apr 11, 2019

pganssle commented Apr 11, 2019

tirkarthi commented Apr 12, 2019

EpicWink mannequin commented Apr 12, 2019

pganssle commented Apr 12, 2019

zooba commented Apr 12, 2019

pganssle commented Apr 13, 2019

EpicWink mannequin commented Apr 13, 2019

EpicWink mannequin commented Apr 29, 2019

EpicWink mannequin commented May 9, 2019

EpicWink mannequin commented May 17, 2019

barneygale commented Jul 23, 2022

EpicWink commented Jul 24, 2022

AlexWaygood commented Jan 8, 2023

Recursive directory list with pathlib.Path.iterdir #80783

Recursive directory list with pathlib.Path.iterdir #80783

Comments

EpicWink mannequin commented Apr 11, 2019

EpicWink mannequin commented Apr 11, 2019

SilentGhost mannequin commented Apr 11, 2019

pganssle commented Apr 11, 2019

tirkarthi commented Apr 12, 2019

EpicWink mannequin commented Apr 12, 2019

pganssle commented Apr 12, 2019

zooba commented Apr 12, 2019

pganssle commented Apr 13, 2019

EpicWink mannequin commented Apr 13, 2019

EpicWink mannequin commented Apr 29, 2019

EpicWink mannequin commented May 9, 2019

EpicWink mannequin commented May 17, 2019

barneygale commented Jul 23, 2022

EpicWink commented Jul 24, 2022

AlexWaygood commented Jan 8, 2023