-
-
Notifications
You must be signed in to change notification settings - Fork 49
Feedback on new files() and Traverseable API #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In GitLab by @jaraco on Mar 17, 2020, 23:06 Thanks for the feedback. I don't think it's too late to change the implementation. It'll be harder once it lands on Python 3.9. Let me quickly answer your feedback.
Feel free to add them. I'm not very good at these annotations, so someone more experienced with them would be better suited to implement them.
I can see how it would be a bit jarring when you're coming with the prior expectation, but based on my experience with
It's not strictly defined, but I'd say through some reasonable exception and consider raising the same exception that pathlib.Path does when opening a directory.
I'm not either. It came late to the party because I found it was necessary to implement
I tried to document this, but even
In the original design without I'm not opposed to limiting the scope of I may not have answered all your questions, but I think this gives some food for thought. How would you like to proceed? Would you like to draft a proposed change? |
In GitLab by @gregory.szorc on Mar 18, 2020, 21:40
I would propose something that distinguishes between "directories" (which are containers) and files/resources, where the latter cannot cross boundaries from the former. Essentially, you would have a hierarchy of virtual directories/packages. Each of these would map to an iterable of children virtual directories/packages and files/resources. If traversing a filesystem, you could resolve the available children directories/packages by iterating a directory and classifying each entry as a file or directory. In API form: class MetaPathFinder:
def get_resource_reader(self, fullname) -> ResourceReader
"""Obtain an item that can load resources for a fully qualified Python package."""
class ResourceReader:
# Holds the string name of the virtual Python package this is a resource loader for.
# For filesystems, this is effectively a reader for a directory at `fullname.replace('.', '/')`
fullname = None
def child_readers(self) -> [ResourceReader]
"""Obtain an iterable of ResourceReader for available child virtual packages of this one.
On filesystems, this essentially returns instances corresponding to immediate child directories.
"""
def resources(self) -> [str]
"""Obtain available named resources for this virtual package.
On filesystems, this essentially returns files in the current directory.
TODO consider returning a special type that exposes an `open()`, etc.
""""
def open_binary(self, resource) -> File
"""Obtain a File-like for a named resource.
On filesystems, this attempts to open os.path.join(self, resource).
Attempting to open a non-resource entity (such as a subdirectory) or a missing
resource raises NotAResourceError.
""" There are plenty of details to bikeshed (such as whether we need a special type to represent the names of resources to abstract over filesystem encoding oddities). But something like this where this is a clear line between collections of resources and resources themselves is where I would start.
For resource loaders that don't perform filesystem I/O, this possibly is a leaky abstraction, since
I would strongly advocate for the existence of an API that allows incremental resource reading via a File-like interface. Having an
New concern: allowing writable handles. Why? The interface we used to obtain a handle is named I concede that allowing text mode operations is convenient. And
I think defining the interface in terms of I think it is fine to model the interface of the returned object against what
I provided a suggested API in this comment that I'd like to bikeshed. |
In GitLab by @jaraco on Mar 19, 2020, 21:52
This approach runs in conflict with the goal to limit the number of interfaces a Python programmer needs to reason about. If you More importantly, a
Compare that with:
or
I'm far from convinced there is value in separating directories from resources, especially given that most users have a mental model and reference implementation in
You're right, writing is not a concern. That was an early consideration that's no longer a consideration.
The directory tree abstraction works very well for what resources are trying to expose. Previously, importlib demanded that all resources be "files", and the main deficiency was that it didn't support "subdirectories". That's the abstraction the users are looking for. There's no expectation that the resources are backed by a file system, but they should be represented by directories of those resources. Why not call them files? zipp.Path is one proven implementation that isn't backed by a filesystem (example of accessing contents over the network). Can you conceive of a resource provider that would provide the resources in directories where the I do sympathize. While I was working on zipp.Path, I struggled a bit with the logical mismatch between zip files and a file system (such as how to reliably represent a directory).
This is the intent of
If this were the case, then the methods that an implementer would need to implement would be Does that help? Can we focus on refining the interface definition for |
In GitLab by @gregory.szorc on Mar 22, 2020, 14:21 I'm not arguing that having a
With all due respect, custom implementations that use zip files are still effectively backed by filesystems because zip files (or tar archives) contain entry metadata similar to what a regular filesystem would expose. So you just aren't going to see the kinds of abstract problems I'm worried about if zip files are your test subject. I think a better test subject is something like PyOxidizer, which wants to back resources in memory by a Rust
From the perspective of a resource reader that isn't backed by anything resembling a traditional filesystem, to keep things simple we need:
I would encourage you to implement a custom resource reader as follows:
I think if you attempt to do this with the new |
In GitLab by @brettcannon on Mar 23, 2020, 19:27 In regards to resource name encoding, import itself doesn't take an opinion on that, so in general that doesn't have an answer in Python. |
In GitLab by @gregory.szorc on Mar 23, 2020, 21:31
To be nuanced, it is undefined behavior. And that is arguably an opinion in of itself. In reality, the importer has imposed restrictions on the use of encodings:
And in case you are wondering, Python is capable of importing non-ascii module names from the filesystem:
This is a cool feature. But the engineer in me wants module and resource names to be limited to a subset of ASCII because any code point that encodes differently depending on the filesystem encoding is potential for non-determinism and non-portable packages. But the human in me empathizes with non-English speakers (I wouldn't be surprised if non-English speakers used language/locale native module names for some projects) and there can be legitimate reasons for having non-ASCII names. e.g. even as an English speaker I have encountered many test files with filenames outside ASCII that exist to facilitate encoding testing. But this encoding ship has sailed a long time ago and probably can't be reined in. PyOxidizer will likely normalize resource names to UTF-8 and this seems reasonable to me. Any additional restrictions imposed by Python would be desirable from a simplicity and compatibility perspective, but aren't strictly required. |
In GitLab by @jaraco on Mar 24, 2020, 10:43 All good considerations. Thanks for writing them up. I do think that the zipfile defies your expectation, as it presents only a flat list of entries for the entire zip file. There's no tree structure and there's no index, only a pile of files. I believe it falls prey to all of the implementation concerns you raise, including performance concerns, many of which were addressed after the release of Python 3.8 (example). I'm warming up to the idea that
Let's defer this concern for later. It's an anti-goal for me to overly constrain resource names. In particular, I want to aim to support Unicode text wherever possible, even if that means there are edge cases where non-ASCII won't be supported. I went through a similar experience with the path package. I acknowledge that some operating systems and container formats allow creating files or file-like objects with names that don't reliably decode to Unicode text. I consider those use-cases unsupported, and I'm certainly unwilling to compromise the functionality here to accommodate those environments without an extremely compelling use-case. Practicality beats purity here.
I'm not sure this restriction is needed. It's already the case that python module files are resources of its containing package, so why would it be necessary or desirable to restrict access to packages? I suggest we start with a modest proposal to create the lower-level interface similar to your proposal above and we can deal with constraining that interface in a separate discussion/proposal. |
In GitLab by @jaraco on Mar 24, 2020, 11:36 mentioned in merge request !95 |
In GitLab by @jaraco on Mar 24, 2020, 11:37 In the feature/90-low-level-reader branch, I've taken Gregory's proposed low-level interface, implemented as |
In GitLab by @brettcannon on Mar 24, 2020, 12:53
I can solve that potential lack of surprise by telling you people do. 😄 I brought this up because there's a bug filed against importlib about how we don't attempt to normalize names to NKFD or any other specific, normalized form for comparison. So I didn't mean to imply there weren't any restrictions (sorry about that if my comment read that way), just that it isn't fully defined to the point that there's no potential for conflicts and I didn't want you caught off-guard with PyOxidizer by that. |
In GitLab by @jaraco on Mar 30, 2020, 16:47 I'm eager to get feedback on !95 if you have time to review. |
In GitLab by @gregory.szorc on Apr 5, 2020, 11:56 I left some feedback on !95. Overall it is looking good! FWIW I recently made a number of tweaks to PyOxidizer's resource loading code. You may find the following documentation useful to peruse to better understand some of the challenges PyOxidizer faces: |
In GitLab by @jaraco on Jun 13, 2020, 11:29 mentioned in merge request !103 |
@indygreg The 5.1.0 release includes the |
I do want to say, with respect to the leaky abstraction in the files protocol, I do see that as a tradeoff between simplicity and correctness. A correct implementation would indeed be to wrap or transform existing zipfile.Path and pathlib.Path objects into something that only implements the required protocol, but that added complexity is likely to introduce bugs and additional challenges. Moreover, the increasing adoption of type-checkers should help protect against the leaky abstractions. Lacking engagement on this issue, I'm going to close it without prejudice. Happy to revisit later. |
Is this API supposed to superseed the existing ¹ Based on some comments on gitlab I suspect that the |
My intention/expectation is for the Agreed, if docs are unclear or could be enhanced, I'd welcome those changes. One challenge is that the docs for Regarding |
In GitLab by @gregory.szorc on Mar 15, 2020, 18:04
I'm attempting to implement the new
files()
andTraverseable
APIs in PyOxidizer and am struggling to figure it out.First, the new classes in
abc.py
don't have type annotations. I think I can work them out. But being explicit would be much appreciated. Having type annotations would also match the rest of the file.Second, the overloading of
Traverseable
to be both a resource loader and an actual resource is a bit confusing. Elsewhere in theimportlib
world, there is a separation between the interface for loading things and the things returned by that interface. It is weird to me that a singleTraverseable
both represents the returned value fromfiles()
(a logical collection of resources or a mechanism to obtain resources) as well as a handle on an individual resource. I would like for these interfaces to be teased apart, if possible. More on this later.Third - and this is related to the last point - I just don't understand how you are supposed to handle directories in all cases. Because
Traversable
is both a loader and a handle on an individual resource, what are we supposed to do for operations like callingTraverseable.open()
on an object that represents a directory? Are we supposed to returnNone
? Raise anOsError
orIoError
or whatever the OS emits when you try to open a directory? Do we assume the caller callsis_dir()
/is_file()
and isn't dumb enough to callopen()
on a directory? But what if they do?Fourth, I don't understand why there exists an
open
,read_text
, andread_bytes
onTraverseable
.read_text
andread_bytes
can be implemented in terms ofopen()
. So why are they a) abstract b) part of the interface in the first place? If we're going to provideopen()
, I think it makes sense to expose helper functions forread_text
andread_bytes
that are implemented in terms ofopen()
.Fifth, to be honest, I'm not a huge fan of
open()
because it puts a lot of burden on custom implementations. What subset ofPath.open()
needs to be implemented?Path.open()
is supposed to acceptmode
,buffering
,encoding
,errors
, andnewline
arguments. Since these are all*args, **kwrags
-awayed in the interface and the docs only call outmode
, what is supposed to be supported? Is it really reasonable for custom implementations (which may not be able to simply call out tobuiltins.open()
because they aren't using a traditional filesystem) to have to implement all this functionality? I would highly encourage reducing the scope ofopen()
to binary file reading. Remove buffering, encoding, errors, and newline support. If people really want to buffer or read resources as text, they can useio.BufferedReader
and/orio.TextIOWrapper
. And of course, helper APIs that wrap low-level binary-only file-like objects can exist to makeread_text()
a trivial one-liner. Or consider making the text reading APIs optional and having the default resource reading code go through a polyfill that usesio.*
wrappers accordingly.I want to say the new
files()
+Traverseable
API is a net improvement and thank everybody responsible. But at this point in time, I feel like the pile of warts I described in #58 has more or less been replaced by a different pile of warts :/ I failed to see the new API being designed and feel like I missed an opportunity to influence its design towards something more reasonable for advanced use cases like PyOxidizer. For that I feel bad and apologize. If it isn't too late to make backwards-incompatible changes to the interface, I would strongly vote for reducing the scope ofopen()
and removingread_text
andread_bytes
. If we do that and clarify the interface and semantics a bit more, I think this new API will be clearly better than what came before.The text was updated successfully, but these errors were encountered: