Skip to content

Recursive directory list with pathlib.Path.iterdir #80783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
EpicWink mannequin opened this issue Apr 11, 2019 · 15 comments
Closed

Recursive directory list with pathlib.Path.iterdir #80783

EpicWink mannequin opened this issue Apr 11, 2019 · 15 comments
Labels
3.9 only security fixes stdlib Python modules in the Lib dir topic-pathlib type-feature A feature request or enhancement

Comments

@EpicWink
Copy link
Mannequin

EpicWink mannequin commented Apr 11, 2019

BPO 36602
Nosy @pitrou, @zooba, @pganssle, @tirkarthi, @EpicWink
PRs
  • bpo-36602: Allow pathlib.Path.iterdir to list recursively #12785
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2019-04-11.10:59:34.291>
    labels = ['type-feature', 'library', '3.9']
    title = 'Recursive directory list with pathlib.Path.iterdir'
    updated_at = <Date 2019-06-05.00:36:30.476>
    user = 'https://github.com/EpicWink'

    bugs.python.org fields:

    activity = <Date 2019-06-05.00:36:30.476>
    actor = 'Epic_Wink'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2019-04-11.10:59:34.291>
    creator = 'Epic_Wink'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 36602
    keywords = ['patch']
    message_count = 12.0
    messages = ['339959', '339969', '339997', '340003', '340012', '340042', '340072', '340135', '340144', '341059', '342007', '342687']
    nosy_count = 6.0
    nosy_names = ['pitrou', 'SilentGhost', 'steve.dower', 'p-ganssle', 'xtreak', 'Epic_Wink']
    pr_nums = ['12785']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue36602'
    versions = ['Python 3.9']

    @EpicWink
    Copy link
    Mannequin Author

    EpicWink mannequin commented Apr 11, 2019

    Currently, 'pathlib.Path.iterdir' can only list the contents of the instance directory. It is common to also want the contents of subdirectories recursively.

    The proposal is for 'pathlib.Path.iterdir' to have an argument 'recursive' which when 'True' will cause 'iterdir' to yield contents of subdirectories recursively.

    This would be trivial to implement as 'iterdir' can simply yield from subdirectories' 'iterdir'.

    A decision would have to be made whether to continue to yield the subdirectories, or skip them. Another decision would be for whether each path should be resolved before checking if it is a directory to be recursed into.

    @EpicWink EpicWink mannequin added 3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Apr 11, 2019
    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Apr 11, 2019

    Is the behaviour you're proposing any different from using Path.rglob('*')?

    @pganssle
    Copy link
    Member

    Is the behaviour you're proposing any different from using Path.rglob('*')?

    I believe rglob("*") is eager, while iterdir is lazy.

    @Epic_Wink:

    This would be trivial to implement as 'iterdir' can simply yield from subdirectories' 'iterdir'.

    One thing you may need to worry about here is the fact that symlinks can have cycles, so you may need to do some cycle detection to avoid creating the dangerous possibility of infinite loops.

    There's also the question of whether you want this to be a depth-first or breadth-first traversal, and whether you would want both of these to be options.

    @tirkarthi
    Copy link
    Member

    I believe rglob("*") is eager, while iterdir is lazy.

    rglob and glob also return a generator.

    Slightly related, pathlib.walk was proposed in the past in python-ideas : https://mail.python.org/pipermail/python-ideas/2017-April/045398.html

    @EpicWink
    Copy link
    Mannequin Author

    EpicWink mannequin commented Apr 12, 2019

    Is the behaviour you're proposing any different from using Path.rglob('*')?

    By that logic, we should remove Path.iterdir() in favour of Path.glob('*'). In addition, having iterdir the way it is makes it easy for subclasses to extend its functionality (for example, one of my Path subclasses allows a callable to be passed to the iterdir which can filter paths)

    One thing you may need to worry about here is the fact that symlinks can have cycles, so you may need to do some cycle detection to avoid creating the dangerous possibility of infinite loops.

    I agree, which is the main reason the current implementation in the pull-request is to not resolve symlinks: users can subclass and implement symlink resolving if they want

    There's also the question of whether you want this to be a depth-first or breadth-first traversal, and whether you would want both of these to be options.

    As much as I want to say that I don't see a use-case for breadth-first file listing (when I list files, I expect the next file provided to be 'next to' the current file), users currently have no standard-library functionality to perform breadth-first searches as far as I know: they'd have to implement it themself or find it in a third-party library

    Slightly related, pathlib.walk was proposed in the past in python-ideas...

    I've never really liked the interface to walk, personal preference

    @pganssle
    Copy link
    Member

    rglob and glob also return a generator.

    My mistake, I didn't notice the sorted in the rglob documentation and thought it was emitting a list.

    By that logic, we should remove Path.iterdir() in favour of Path.glob('*').

    What is the case for why iterdir() is justified when Path.glob('*') exists? Is it just discoverability? Is there some efficiency reason to do it?

    Of course, removing things (which can break existing code) and failing to add them (which cannot) have two different thresholds for when they can take place, so even if we decide "iterdir() is to glob('*') as iterdir(recursive=True) is to rglob('*')", that doesn't mean that we should remove iterdir() entirely if recursive=True is not added.

    I agree, which is the main reason the current implementation in the pull-request is to not resolve symlinks: users can subclass and implement symlink resolving if they want

    I don't see that on the implementation here, but we can discuss this on the PR itself. I do think that skipping *all* symlinks automatically with no option to follow them will be counter-intuitive for people.

    I've never really liked the interface to walk, personal preference

    I kinda agree about the interface to walk, but it is worth noting that as we've seen in this thread, there are a bunch of plausible and slightly different ways to walk a directory: breadth-first or depth-first? following symlinks, following symlinks with cycle detection, or not following symlinks at all? emit the directory itself, or only emit its contents? It's worth taking into account that having two completely different complicated interfaces for recursively walking directories would be a usability challenge.

    @zooba
    Copy link
    Member

    zooba commented Apr 12, 2019

    Having spent more time than I'm proud of recursing through directories, I'd be happy enough with a convenience function that has sensible defaults.

    If I want breadth-first recursion (and I often do), I'll write it myself. I have a slight preference for getting all files in a directory before going deeper (which is not what the PR does), and I think that's most consistent with the current behaviour.

    I don't spend enough time dealing with symlinks to have strong opinions there, but given we have ways to resolve symlinks but not to get back to the original name (and I *have* had to deal with issues where I've needed to find the original name from the target :roll-eyes:) I'd say don't resolve anything eagerly.

    If there's an easy and well-known algorithm for detecting infinite symlink recursion (e.g. resolve and check if it's a parent of itself) then do that and skip it, but don't return the targets.

    @pganssle
    Copy link
    Member

    I don't spend enough time dealing with symlinks to have strong opinions there, but given we have ways to resolve symlinks but not to get back to the original name (and I *have* had to deal with issues where I've needed to find the original name from the target :roll-eyes:) I'd say don't resolve anything eagerly.

    You mean treating symlinks to directories like files? I suppose that's a possibility, but I do think it will end up being a source of bugs around symlinking.

    Admittedly, it is apparently what rglob('*') does (just tested it - apparently it won't follow symlinks to directories), though I think it might be a better interface to try to break cycles rather than not follow symlinks (particularly since iterdir currently treats symlinks to directories as directories).

    @EpicWink
    Copy link
    Mannequin Author

    EpicWink mannequin commented Apr 13, 2019

    Having iterdir(recursive=True) recurse into symlinks to directories would mean we would either not yield those symlinks, or we yield those symlinks and all other directories.

    I feel like not yielding directories is the way to go, but it's easy enough to check if a yielded path is a directory in application code.

    The current implementation of using recursion to list subdirectory contents doesn't seem to allow for the obvious implementation of symlink cycle-detection: keeping track of which (real) directories have been listed.

    PS: I've updated the pull-request to not follow symlinks to directories. This is not a final decision, but just updating to be in line with what I've implied up to this point

    @EpicWink
    Copy link
    Mannequin Author

    EpicWink mannequin commented Apr 29, 2019

    I've updated the pull-request to list directories pointed to by listed symbolic links, preventing cyclic listing. An extra instance method pathlib.Path._iterdir_recursive was added for this.

    We still need to decide whether to yield the directories themselves rather than just the files under those directories. Very easy to implement this.

    One thing I just realised is that a symlink can point to a subdirectory further down the chain. The current implementation will list the files under that directory using the symlink as prefix rather than the full path. Shouldn't matter though.

    @EpicWink
    Copy link
    Mannequin Author

    EpicWink mannequin commented May 9, 2019

    Would this change also have to copy implemented in the new ZipFile Pathlib API? https://bugs.python.org/issue36832

    @EpicWink
    Copy link
    Mannequin Author

    EpicWink mannequin commented May 17, 2019

    I think I may have broken bedevere-bot by request change reviews before the PR was assigned...

    @EpicWink EpicWink mannequin added 3.9 only security fixes and removed 3.8 (EOL) end of life labels Jun 5, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @barneygale
    Copy link
    Contributor

    Hey @EpicWink, we now have a Path.walk() method that recursively walks directory trees, using os.scandir() for efficiency - see #92517. This is faster than using rglob('*') in most cases, and it supports following symlinks. Do you think that would cover your original use case?

    @EpicWink
    Copy link
    Contributor

    walk's interface is a little bit clunky, but I've realised iterdir_recursive doesn't need to live in the standard library, as is trivial to include in a library or copy-paste between scripts.

    @AlexWaygood
    Copy link
    Member

    Closing, since we now have Path.walk(), and OP no longer has interest in the idea.

    If another core developer still wants to proceed with this, feel free to reopen!

    @AlexWaygood AlexWaygood closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes stdlib Python modules in the Lib dir topic-pathlib type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants