-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
We need low-allocating high-performance extensibility points to build solutions for enumerating files.
(This is the API review for dotnet/designs#24)
Rationale and Usage
Enumerating files in .NET provides limited configurability. You can specify a simple DOS style pattern and whether or not to look recursively. More complicated filtering requires post filtering all results which can introduce a significant performance drain.
Recursive enumeration is also problematic in that there is no way to handle error states such as access issues or cycles created by links.
These restrictions have a significant impact on file system intensive applications, a key example being MSBuild. This document proposes a new set of primitive file and directory traversal APIs that are optimized for providing more flexibility while keeping the overhead to a minimum so that enumeration becomes both more powerful as well as more performant.
To write a wrapper that gets files with a given set of extensions you would need to write something similar to:
public static IEnumerable<string> GetFilePathsWithExtensions(string directory, bool recursive, params string[] extensions)
{
return new DirectoryInfo(directory)
.GetFiles("*", recursive ? SearchOption.AllDirectories : SearchOption.TopDirectoryOnly)
.Where(f => extensions.Any(e => f.Name.EndsWith(e, StringComparison.OrdinalIgnoreCase)))
.Select(r => r.FullName);
}
Not complicated to write, but this can do an enormous amount of extra allocations. You have to create full strings and FileInfo
's for every single item in the file system. We can cut this down significantly with the extension point:
public static IEnumerable<string> GetFileFullPathsWithExtension(string directory,
bool recursive, params string[] extensions)
{
return new FileSystemEnumerable<string>(
directory,
(ref FileSystemEntry entry) => entry.ToFullPath(),
new EnumerationOptions() { RecurseSubdirectories = recursive })
{
ShouldIncludePredicate = (ref FileSystemEntry entry) =>
{
if (entry.IsDirectory) return false;
foreach (string extension in extensions)
{
if (Path.GetExtension(entry.FileName).EndsWith(extension, StringComparison.OrdinalIgnoreCase))
return true;
}
return false;
}
};
}
The number of allocation reductions with the above solution is significant.
- No
FileInfo
allocations - No fullpath string allocations for paths that don't match
- No filename allocations for paths that don't match (as the filename will still be in the native buffer at this point)
Note that while you can write a solution that doesn't allocate a
FileInfo
by using thestring[]
APIs andGetFullPath()
it would still allocate unneeded strings and introduce costly normalization overhead.
Proposed API
namespace System.IO
{
public static partial class Directory
{
public static IEnumerable<string> EnumerateDirectories(string path, string searchPattern, EnumerationOptions enumerationOptions);
public static IEnumerable<string> EnumerateFiles(string path, string searchPattern, EnumerationOptions enumerationOptions);
public static IEnumerable<string> EnumerateFileSystemEntries(string path, string searchPattern, EnumerationOptions enumerationOptions);
public static string[] GetDirectories(string path, string searchPattern, EnumerationOptions enumerationOptions);
public static string[] GetFiles(string path, string searchPattern, EnumerationOptions enumerationOptions);
public static string[] GetFileSystemEntries(string path, string searchPattern, EnumerationOptions enumerationOptions);
}
public sealed partial class DirectoryInfo
{
public IEnumerable<DirectoryInfo> EnumerateDirectories(string searchPattern, EnumerationOptions enumerationOptions);
public IEnumerable<FileSystemInfo> EnumerateFileSystemInfos(string searchPattern, EnumerationOptions enumerationOptions);
public IEnumerable<FileInfo> EnumerateFiles(string searchPattern, EnumerationOptions enumerationOptions);
public DirectoryInfo[] GetDirectories(string searchPattern, EnumerationOptions enumerationOptions);
public FileInfo[] GetFiles(string searchPattern, EnumerationOptions enumerationOptions);
public FileSystemInfo[] GetFileSystemInfos(string searchPattern, EnumerationOptions enumerationOptions);
}
public enum MatchType
{
/// <summary>
/// Match using '*' and '?' wildcards.
/// </summary>
Simple,
/// <summary>
/// Match using DOS style matching semantics. '*', '?', '<', '>', and '"'
/// are all considered wildcards.
/// </summary>
Dos
}
public enum MatchCasing
{
/// <summary>
/// Match the default casing for the given platform
/// </summary>
PlatformDefault,
/// <summary>
/// Match respecting character casing
/// </summary>
CaseSensitive,
/// <summary>
/// Match ignoring character casing
/// </summary>
CaseInsensitive
}
public class EnumerationOptions
{
/// <summary>
/// Should we recurse into subdirectories while enumerating?
/// Default is false.
/// </summary>
public bool RecurseSubdirectories { get; set; }
/// <summary>
/// Skip files/directories when access is denied (e.g. AccessDeniedException/SecurityException).
/// Default is true.
/// </summary>
public bool IgnoreInaccessible { get; set; }
/// <summary>
/// Suggested buffer size, in bytes. Default is 0 (no suggestion).
/// </summary>
/// <remarks>
/// Not all platforms use user allocated buffers, and some require either fixed buffers or a
/// buffer that has enough space to return a full result. One scenario where this option is
/// useful is with remote share enumeration on Windows. Having a large buffer may result in
/// better performance as more results can be batched over the wire (e.g. over a network
/// share). A "large" buffer, for example, would be 16K. Typical is 4K.
///
/// We will not use the suggested buffer size if it has no meaning for the native APIs on the
/// current platform or if it would be too small for getting at least a single result.
/// </remarks>
public int BufferSize { get; set; }
/// <summary>
/// Skip entries with the given attributes. Default is FileAttributes.Hidden | FileAttributes.System.
/// </summary>
public FileAttributes AttributesToSkip { get; set; }
/// <summary>
/// For APIs that allow specifying a match expression this will allow you to specify how
/// to interpret the match expression.
/// </summary>
/// <remarks>
/// The default is simple matching where '*' is always 0 or more characters and '?' is a single character.
/// </remarks>
public MatchType MatchType { get; set; }
/// <summary>
/// For APIs that allow specifying a match expression this will allow you to specify case matching behavior.
/// </summary>
/// <remarks>
/// Default is to match platform defaults, which are gleaned from the case sensitivity of the temporary folder.
/// </remarks>
public MatchCasing MatchCasing { get; set; }
/// <summary>
/// Set to true to return "." and ".." directory entries. Default is false.
/// </summary>
public bool ReturnSpecialDirectories { get; set; }
}
}
namespace System.IO.Enumeration
{
public ref struct FileSystemEntry
{
/// <summary>
/// The full path of the directory this entry resides in.
/// </summary>
public ReadOnlySpan<char> Directory { get; }
/// <summary>
/// The full path of the root directory used for the enumeration.
/// </summary>
public ReadOnlySpan<char> RootDirectory { get; }
/// <summary>
/// The root directory for the enumeration as specified in the constructor.
/// </summary>
public ReadOnlySpan<char> OriginalRootDirectory { get; }
public ReadOnlySpan<char> FileName { get; }
public FileAttributes Attributes { get; }
public long Length { get; }
public DateTimeOffset CreationTimeUtc { get; }
public DateTimeOffset LastAccessTimeUtc { get; }
public DateTimeOffset LastWriteTimeUtc { get; }
public bool IsDirectory { get; }
public FileSystemInfo ToFileSystemInfo();
/// <summary>
/// Returns the full path for find results, based on the initially provided path.
/// </summary>
public string ToSpecifiedFullPath();
/// <summary>
/// Returns the full path of the find result.
/// </summary>
public string ToFullPath();
}
public abstract class FileSystemEnumerator<TResult> : CriticalFinalizerObject, IEnumerator<TResult>
{
public FileSystemEnumerator(string directory, EnumerationOptions options = null);
/// <summary>
/// Return true if the given file system entry should be included in the results.
/// </summary>
protected virtual bool ShouldIncludeEntry(ref FileSystemEntry entry);
/// <summary>
/// Return true if the directory entry given should be recursed into.
/// </summary>
protected virtual bool ShouldRecurseIntoEntry(ref FileSystemEntry entry);
/// <summary>
/// Generate the result type from the current entry;
/// </summary>
protected abstract TResult TransformEntry(ref FileSystemEntry entry);
/// <summary>
/// Called whenever the end of a directory is reached.
/// </summary>
/// <param name="directory">The path of the directory that finished.</param>
protected virtual void OnDirectoryFinished(ReadOnlySpan<char> directory);
/// <summary>
/// Called when a native API returns an error. Return true to continue, or false
/// to throw the default exception for the given error.
/// </summary>
/// <param name="error">The native error code.</param>
protected virtual bool ContinueOnError(int error);
public TResult Current { get; }
object IEnumerator.Current { get; }
public bool MoveNext();
public void Reset();
public void Dispose();
protected virtual void Dispose(bool disposing);
}
/// <summary>
/// Enumerable that allows utilizing custom filter predicates and tranform delegates.
/// </summary>
public class FileSystemEnumerable<TResult> : IEnumerable<TResult>
{
public FileSystemEnumerable(string directory, FindTransform transform, EnumerationOptions options = null) { }
public FindPredicate ShouldRecursePredicate { get; set; }
public FindPredicate ShouldIncludePredicate { get; set; }
public IEnumerator<TResult> GetEnumerator();
IEnumerator GetEnumerator();
/// <summary>
/// Delegate for filtering out find results.
/// </summary>
public delegate bool FindPredicate(ref FileSystemEntry entry);
/// <summary>
/// Delegate for transforming raw find data into a result.
/// </summary>
public delegate TResult FindTransform(ref FileSystemEntry entry);
}
public static class FileSystemName
{
/// <summary>
/// Change unescaped '*' and '?' to '<', '>' and '"' to match Win32 behavior. For compatibility, Windows
/// changes some wildcards to provide a closer match to historical DOS 8.3 filename matching.
/// </summary>
public static string TranslateDosExpression(string expression);
/// <summary>
/// This matcher uses the Windows wildcards (which includes `*`, `?`, `>`, `<`, and `"`).
/// </summary>
public static bool MatchesDosExpression(ReadOnlySpan<char> expression, ReadOnlySpan<char> name, bool ignoreCase = true);
/// <summary>
/// This matcher will only process `*` and `?`.
/// </summary>
public static bool MatchesSimpleExpression(ReadOnlySpan<char> expression, ReadOnlySpan<char> name, bool ignoreCase = true);
}
}
Implementation Notes
Changes to existing behavior
- Match expressions will no longer consider 8.3 filenames
- This obscure behavior is costly and gives unexpected results
- 8.3 filename generation is not always on, can be disabled
*.htm
will no longer match*.html
if 8.3 filenames exist
- Option defaults (when calling new APIs)
- System & hidden files/directories are skipped by default
- Access denied folders are skipped by default (no errors are thrown)
- Simple matching is used by default (
*.*
means any file with a period,foo.*
matchesfoo.txt
, notfoo
)
FileSystemEnumerable
directory
, andtransform
will throwArgumentNullException
if null.- If predicates are not specified, all entries will be accepted
FileSystemEnumerator
directory
, andtransform
will throwArgumentNullException
if null.- all directory entries will be returned before processing additional directories (e.g. subdirectories)
- order of directory entries is not guaranteed
- timing of opening subdirectories is not guaranteed
FileSystemEntry
- translation of data that has non-trivial cost will be lazily done (applies specifically to Unix)
- properties that require an additional OS call
- UTF-8 to UTF-16 conversion
- initial property values can potentially be unexpected based on timing of accessing data (i.e. the underlying file could disappear)
- property values will not change after being accessed
FileSystemEntry
should not be cachedFileName
will only contain valid data for the duration of filter/transform calls, hence the struct being passed by ref
Matchers
- Matchers will support escaping of supported wildcards and
\
using the\
character\*
,\\
,?
(and\>
,\<
,\"
forMatchesDosExpression
)
- Empty
expresion
will match all