Skip to content

Python 3.10 or later script to help walk a directory tree following symlinks without infinite recursion

License

Notifications You must be signed in to change notification settings

ekchew/safe_follow_symlinks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Safely Following Symbolic Links

Many tools that scan directory trees offer an option to follow symbolic links, but with the caveat that the tool may get stuck in infinite recursion should a link loop back on itself.

This tool walks directory trees while detecting and preventing symlink recursion.

It can also be run in a simpler mode to resolve a single path or list only the immediate members of a directory.

Package Requirements

Python 3.10 or later.

The package has no dependencies beyond the Python Standard Library.

symlinkwalk.py Executable Script

Basic usage of the command line script would look something like this:

> python3 symlinkwalk.py --resolve=path foo
d /full/path/to/foo

The --resolve (or -r) option lets you set the resolving mode to one of:

  • path
    • prints the absolute path to foo with any symlinks resolved
  • list
    • prints a listing of the foo directory, with one path per line
    • all member paths will be absolute and fully resolved
  • tree (the default)
    • prints the entire directory tree rooted at foo
    • follows a depth-first traversal order

Output Format

Each line printed to the standard output consists of a code, a single space, and an absolute path. This code may be one of:

  • d: path is to an existing directory
  • f: path is to an existing non-directory
  • m: path is to a missing object
  • b: path is to a broken symlink
  • r: path is to a recursive symlink
  • x: path was excluded by an --exclude pattern
  • u#: the same unique path was encountered # times, where # ≥ 2
    • this only appears in conjunction with the --unique-paths option

d and f lines are printed as the item is encountered, while the rest of the codes are gathered during execution and printed at the end. While an f path would most likely be to a regular file (hence the 'f'), it could potentially be something more exotic, such as a device or fifo.

An m path, while missing at some level, is a path you ought to be able to complete as say a directory using mkdir -p /path/to/missing/foo.

The same cannot be said for b or r paths. In --resolve=path mode, if a broken/recursive symlink is encountered anywhere along the path you supply, the printed path will be an absolute path to the symlink itself with anything beyond it discarded. For example, say your path was foo/bar/baz, of which foo turned out to be a broken symlink. The output would be something like b /path/to/broken/foo. This is the path of concern you need to address, and the bar/baz part becomes superfluous at this point.

Regarding r paths, symlink recursion is only detected after the symlink is followed once. If the path walk eventually circles back to the same symlink, it will be flagged as recursive with the r code. If there are several symlinks forming a loop, only one of them may get flagged, but that should be enough to stop the infinite recursion. The point, though, is that you should be aware that the algorithm may not exhaustively identify all recursive symlinks.

Command Line

path/to/symlinkwalk.py [-h] [-r MODE] [-x PATTERN] [-u] [TARGET ...]
  • TARGET
    • zero or more target paths
    • these may be either absolute or relative to the current working directory
    • defaults to the current working directory if no targets supplied
  • -r MODE or --resolve=MODE
    • MODE = one of path, list, or tree as described earlier
    • defaults to tree
  • -x PATTERN or --exclude=PATTERN
    • PATTERN = a glob pattern that can be used to exclude certain paths
    • you may specify multiple -x arguments
  • -u or --unique-paths
    • normally, the same path may appear multiple times in a listing
    • this may happen if you provide multiple targets that overlap
    • it may also happen if multiple symlinks direct to the same location
    • -u should prevent such paths appearing in f or d lines more than once
    • you can still screen for u# lines to see which would have done so

Python Package Interface

Note: The documentation in this section is structured like a tutorial. If you are looking for an API reference, consult pydoc instead.

The safe_follow_symlinks package is built around 2 central classes, each defined within its own module:

  • symlinkwalk.SymlinkWalk
  • support.pathref.PathRef

SymLinkWalk implements the core functionality, but uses PathRef exclusively when working with paths.

class PathRef

This class unifies a number of common types used to represent paths by the Python Standard Library.

Its sole ref attribute can be any of the following path-like types:

  • str
  • bytes
  • pathlib.PurePath (and all of its subclasses including pathlib.Path)
  • os.DirEntry (the type generated by iterating os.scandir())

PathRef itself is a path-like type. It includes 2 useful properties:

  • path (pathlib.Path, read-only)
    • the ref attribute in Path form
  • path_or_entry (pathlib.Path or os.DirEntry, read-only)
    • this can be useful since those two classes have a lot of APIs in common

The built-in exists() method works much like pathlib.Path.exists(). You can call it on any PathRef.

For those generated by SymlinkWalk, you may also call:

  • is_broken_link(): path is a symlink pointing to nothing that exists?
    • exists() will return False in this case
    • this is in keeping with Path.exists() behaviour
  • is_recursive_link(): path is a symlink flagged as recursive?
    • note that a False return value does not imply it is not recursive
    • the SymlinkWalk algorithm may not have detected the recursion
  • is_bad_link(): path is a broken or recursive symlink
    • this simply combines the above 2 calls
  • is_bad_path(): path does not exist or is a bad link

class SymlinkWalk

As mentioned earlier, this class implements the core functionality. Its public attributes can be divided into those you can optionally supply through its __init__() method as input and those generated as output when you call one of its primary methods.

The input args include:

  • path_filter (callback): accepts/rejects paths being iterated
  • yield_unique (bool): never yield the same path twice?
    • multiple symlinks can sometimes point to the same place

These args influence the iterating algorithms implemented by iter_dir() and iter_tree(). Suppose you only wanted to list zip files? You could write a path filter like this:

def zip_file_filter(pr: PathRef) -> bool:
    return pr.path.suffix == '.zip'

When instantiating a SymlinkWalk, you can optionally treat it as a context manager.

with Symlinkwalk(path_filter=zip_file_filter) as slw:
    for pr in slw.iter_tree('foo'):
        print(pr)

SymlinkWalk generates a lot of state, so this ensures that it gets released when you are done with it (or at least flagged for release by the garbage collector).

SymlinkWalk Methods

There are 3 primary methods you can call to resolve paths.

resolve_path(pathRef, expand_user=True, strict=False)

This is actually a class method, so you need not instantiate a SymlinkWalk to call it. (It does so internally.)

You give it a PathRef and it returns another PathRef containing an absolute version of your input with all symlinks resolved where possible.

pr = SymlinkWalk.resolve_path(PathRef('foo'))
if pr.is_bad_path():
    print(f'there is a problem with the path:', pr)

It is a good idea to call some of those PathRef methods on the result to make sure nothing went wrong during path resolution. Alternatively, you can use the strict=True option to have it raise an exception in that case.

iter_dir(dirPathRef, resolved=False, expand_user=True)

This is a generator that yields all paths in a directory in fully-resolved form, provided:

  1. The path is not rejected by your custom path_filter.
  2. The path can, in fact, be properly resolved to an existing object.
  3. Either yield_unique=False or this the first enounter of the path.

Failing any of these, the output skipped, bad_paths, and path_hits attributes will be updated with the rejected paths, respectively.

Let's say you wanted to implement your own recursive directory tree-walking using iter_dir(). It might look something like this:

def print_dir(dirPathRef, symLinkWalk, resolved=False):
    print(dirPathRef)
    for pr in symLinkWalk.iter_dir(dirPathRef, resolved):
        if pr.path_or_entry.is_dir():
            print_dir(pr, symLinkWalk, resolved=True)
        else:
            print(pr)

print_dir(PathRef('foo'))

iter_dir()'s resolved argument defaults to False, meaning that what you are passing in as your PathRef needs to be resolved before the iteration can commence. That makes sense the first time you call your print_dir() function, but in recursive calls, anything that is coming out of iter_dir() should already be resolved. There is no need to do so again.

(Note that when resolved=True, expand_user is ignored.)

iter_tree(dirPathRef, resolved=False, expand_user=True)

This works much like iter_dir() except it walks through the whole directory tree for you. You could reduce the earlier example to this:

for pr in symLinkWalk.iter_tree(PathRef('foo')):
    print(pr)

About

Python 3.10 or later script to help walk a directory tree following symlinks without infinite recursion

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages