diff --git a/docs/source/developer.rst b/docs/source/developer.rst index 6a6180138..311b23eb8 100644 --- a/docs/source/developer.rst +++ b/docs/source/developer.rst @@ -94,7 +94,28 @@ Implementing async ~~~~~~~~~~~~~~~~~~ Starting in version 0.7.5, we provide async operations for some methods -of some implementations. +of some implementations. Async support in storage implementations is +optional. Special considerations are required for async +development, see :doc:`async`. -This section will contain details on how to implement backends offering -async, once the details are ironed out on our end. +Developing the library +~~~~~~~~~~~~~~~~~~~~~~ + +The following can be used to install ``fsspec`` in development mode + +.. code-block:: + + git clone https://github.com/intake/filesystem_spec + cd filesystem_spec + pip install -e . + +A number of additional dependencies are required to run tests, see "ci/environment*.yml", as +well as Docker. Most implementation-specific tests should skip if their requirements are +not met. + +Development happens by submitting pull requests (PRs) on github. +This repo adheres for flake8 and black coding conventions. You may wish to install +commit hooks if you intend to make PRs, as linting is done as part of the CI. + +Docs use sphinx and the numpy docstring style. Please add an entry to the changelog +along with any PR. diff --git a/docs/source/features.rst b/docs/source/features.rst index 4f23c97d9..0a582def4 100644 --- a/docs/source/features.rst +++ b/docs/source/features.rst @@ -1,18 +1,6 @@ Features of fsspec ================== -Consistent API to many different storage backends. The general API and functionality were -proven with the projects `s3fs`_ and `gcsfs`_ (along with `hdfs3`_ and `adlfs`_), within the -context of Dask and independently. These have been tried and tested by many users and shown their -usefulness over some years. ``fsspec`` aims to build on these and unify their models, as well -as extract out file-system handling code from Dask which does not so comfortably fit within a -library designed for task-graph creation and their scheduling. - -.. _s3fs: https://s3fs.readthedocs.io/en/latest/ -.. _gcsfs: https://gcsfs.readthedocs.io/en/latest/ -.. _hdfs3: https://hdfs3.readthedocs.io/en/latest/ -.. _adlfs: https://docs.microsoft.com/en-us/azure/data-lake-store/ - Here follows a brief description of some features of note of ``fsspec`` that provides to make it an interesting project beyond some other file-system abstractions. @@ -50,20 +38,31 @@ the initiation of the context which actually does the work of creating file-like # f is now a real file-like object holding resources f.read(...) -Random Access and Buffering ---------------------------- - -The :func:`fsspec.spec.AbstractBufferedFile` class is provided as an easy way to build file-like -interfaces to some service which is capable of providing blocks of bytes. This class is derived -from in a number of the existing implementations. A subclass of ``AbstractBufferedFile`` provides -random access for the underlying file-like data (without downloading the whole thing) and -configurable read-ahead buffers to minimise the number of the read operations that need to be -performed on the back-end storage. +File Buffering and random access +-------------------------------- -This is also a critical feature in the big-data access model, where each sub-task of an operation +Most implementations create file objects which derive from ``fsspec.spec.AbstractBufferedFile``, and +have many behaviours in common. A subclass of ``AbstractBufferedFile`` provides +random access for the underlying file-like data (without downloading the whole thing). +This is a critical feature in the big-data access model, where each sub-task of an operation may need on a small part of a file, and does not, therefore want to be forced into downloading the whole thing. +These files offer buffering of both read and write operations, so that +communication with the remote resource is limited. The size of the buffer is generally configured +with the ``blocksize=`` kwarg at open time, although the implementation may have some minimum or +maximum sizes that need to be respected. + +For reading, a number of buffering schemes are available, listed in ``fsspec.caching.caches`` +(see :ref:`readbuffering`), or "none" for no buffering at all, e.g., for a simple read-ahead +buffer, you can do + +.. code-block:: python + + fs = fsspec.filesystem(...) + with fs.open(path, mode='rb', cache_type='readahead') as f: + use_for_something(f) + Transparent text-mode and compression ------------------------------------- @@ -195,25 +194,6 @@ is called, so that subsequent listing of the given paths will force a refresh. I addition, some methods like ``ls`` have a ``refresh`` parameter to force fetching the listing again. -File Buffering --------------- - -Most implementations create file objects which derive from ``fsspec.spec.AbstractBufferedFile``, and -have many behaviours in common. These files offer buffering of both read and write operations, so that -communication with the remote resource is limited. The size of the buffer is generally configured -with the ``blocksize=`` kwargs at open time, although the implementation may have some minimum or -maximum sizes that need to be respected. - -For reading, a number of buffering schemes are available, listed in ``fsspec.caching.caches`` -(see :ref:`readbuffering`), or "none" for no buffering at all, e.g., for a simple read-ahead -buffer, you can do - -.. code-block:: python - - fs = fsspec.filesystem(...) - with fs.open(path, mode='rb', cache_type='readahead') as f: - use_for_something(f) - URL chaining ------------ @@ -344,10 +324,10 @@ shown (or if none are selected, all files are shown). The interface provides the following outputs: -- ``.urlpath``: the currently selected item (if any) -- ``.storage_options``: the value of the kwargs box -- ``.fs``: the current filesystem instance -- ``.open_file()``: produces an ``OpenFile`` instance for the current selection +#. ``.urlpath``: the currently selected item (if any) +#. ``.storage_options``: the value of the kwargs box +#. ``.fs``: the current filesystem instance +#. ``.open_file()``: produces an ``OpenFile`` instance for the current selection Configuration ------------- @@ -388,16 +368,16 @@ the style ``FSSPEC_{protocol}_{kwargname}=value``. Configuration is determined in the following order, with later items winning: -- the contents of ini files, and json files in the config directory, sorted - alphabetically -- environment variables -- the contents of ``fsspec.config.conf``, which can be edited at runtime -- kwargs explicitly passed, whether with ``fsspec.open``, ``fsspec.filesystem`` - or directly instantiating the implementation class. +#. the contents of ini files, and json files in the config directory, sorted + alphabetically +#. environment variables +#. the contents of ``fsspec.config.conf``, which can be edited at runtime +#. kwargs explicitly passed, whether with ``fsspec.open``, ``fsspec.filesystem`` + or directly instantiating the implementation class. Asynchronous -============ +------------ Some implementations, those deriving from ``fsspec.asyn.AsyncFileSystem``, have async/coroutine implementations of some file operations. The async methods have diff --git a/docs/source/index.rst b/docs/source/index.rst index b986770bf..35cea83dd 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,58 +1,91 @@ -FSSPEC: Filesystem interfaces for Python +``fsspec``: Filesystem interfaces for Python ====================================== -Filesystem Spec (FSSPEC) is a project to unify various projects and classes to work with remote filesystems and -file-system-like abstractions using a standard pythonic interface. +Filesystem Spec (``fsspec``) is a project to provide a unified pythonic interface to +local, remote and embedded file systems and bytes storage. +Brief Overview +-------------- -.. _highlight: +There are many places to store bytes, from in memory, to the local disk, cluster +distributed storage, to the cloud. Many files also contain internal mappings of names to bytes, +maybe in a hierarchical directory-oriented tree. Working with all these different +storage media, and their associated libraries, is a pain. ``fsspec`` exists to +provide a familiar API that will work the same whatever the storage backend. +As much as possible, we iron out the quirks specific to each implementation, +so you need do no more than provide credentials for each service you access +(if needed) and thereafter not have to worry about the implementation again. -Highlights ----------- +Why +--- -- based on s3fs and gcsfs -- ``fsspec`` instances are serializable and can be passed between processes/machines -- the ``OpenFiles`` file-like instances are also serializable -- implementations provide random access, to enable only the part of a file required to be read; plus a template - to base other file-like classes on -- file access can use transparent compression and text-mode -- any file-system directory can be viewed as a key-value/mapping store -- if installed, all file-system classes also subclass from ``pyarrow.filesystem.FileSystem``, so - can work with any arrow function expecting such an instance -- writes can be transactional: stored in a temporary location and only moved to the final - destination when the transaction is committed -- FUSE: mount any path from any backend to a point on your file-system -- cached instances tokenised on the instance parameters +``fsspec`` provides two main concepts: a set of filesystem classes with uniform APIs +(i.e., functions such as ``cp``, ``rm``, ``cat``, ``mkdir``, ...) supplying operations on a range of +storage systems; and top-level convenience functions like :func:`fsspec.open`, to allow +you to quickly get from a URL to a file-like object that you can use with a third-party +library or your own code. -These are described further in the :doc:`features` section. +The section :doc:`background` gives motivation and history of this project, but +most users will want to skip straight to :doc:`usage` to find out how to use +the package and :doc:`features` to see the long list of added functionality +included along with the basic file-system interface. -Installation ------------- -.. code-block:: sh +Who uses ``fsspec``? +-------------------- - pip install fsspec +You can use ``fsspec``'s file objects with any python function that accepts +file objects, because of *duck typing*. -Not all included filesystems are usable by default without installing extra -dependencies. For example to be able to access data in S3: +You may well be using ``fsspec`` already without knowing it. +The following libraries use ``fsspec`` internally for path and file handling: -.. code-block:: sh +#. `Dask`_, the parallel, out-of-core and distributed + programming platform +#. `Intake`_, the data source cataloguing and loading + library and its plugins +#. `pandas`_, the tabular data analysis package +#. `xarray`_ and `zarr`_, multidimensional array + storage and labelled operations +#. `DVC`_, version control system + for machine learning projects + +``fsspec`` filesystems are also supported by: + +#. `pyarrow`_, the in-memory data layout engine + +... plus many more that we don't know about. + +.. _Dask: https://dask.org/ +.. _Intake: https://intake.readthedocs.io/ +.. _pandas: https://pandas.pydata.org/ +.. _xarray: http://xarray.pydata.org/ +.. _zarr: https://zarr.readthedocs.io/ +.. _DVC: https://dvc.org/ +.. _pyarrow: https://arrow.apache.org/docs/python/ - pip install fsspec[s3] -or +Installation +------------ + +`fsspec` can be installed from PyPI or conda and has no dependencies of its own .. code-block:: sh + pip install fsspec conda install -c conda-forge fsspec -Implementations ---------------- +Not all filesystem implementations are available without installing extra +dependencies. For example to be able to access data in S3, you can use the optional +pip install syntax below, or install the specific package required -This repo contains several file-system implementations, see :ref:`implementations`. However, -the external projects ``s3fs`` and ``gcsfs`` depend on ``fsspec`` and share the same behaviours. -``Dask`` and ``Intake`` use ``fsspec`` internally for their IO needs. +.. code-block:: sh + pip install fsspec[gcs] + conda install -c conda-forge gcsfs + +`fsspec` attempts to provide the right message when you attempt to use a filesystem +for which you need additional dependencies. The current list of known implementations can be found as follows .. code-block:: python @@ -61,12 +94,10 @@ The current list of known implementations can be found as follows known_implementations -These are only imported on request, which may fail if a required dependency is missing. The dictionary -``fsspec.registry`` contains all imported implementations, and can be mutated by user code, if necessary. .. toctree:: - :maxdepth: 2 + :maxdepth: 1 :caption: Contents: intro.rst @@ -76,11 +107,3 @@ These are only imported on request, which may fail if a required dependency is m async.rst api.rst changelog.rst - - -Indices and tables -================== - -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/docs/source/intro.rst b/docs/source/intro.rst index 204296706..9ebbf3fbc 100644 --- a/docs/source/intro.rst +++ b/docs/source/intro.rst @@ -1,11 +1,5 @@ -Introduction -============ - -To get stuck into using the package, rather than reading about its philosophy and history, you can -skip to :doc:`usage`. - Background ----------- +========== Python provides a standard interface for open files, so that alternate implementations of file-like object can work seamlessly with many function which rely only on the methods of that standard interface. A number of libraries @@ -21,7 +15,7 @@ other file-system implementations simpler. History ------- -I (Martin Durant) have been involved in building a number of remote-data file-system implementations, principally +We have been involved in building a number of remote-data file-system implementations, principally in the context of the `Dask`_ project. In particular, several are listed in `docs`_ with links to the specific repositories. With common authorship, there is much that is similar between the implementations, for example posix-like naming @@ -57,21 +51,21 @@ Influences The following places to consider, when choosing the definitions of how we would like the file-system specification to look: -- python's `os`_ module and its `path` namespace; also other file-connected - functionality in the standard library -- posix/bash method naming conventions that linux/unix/osx users are familiar with; or perhaps their Windows variants -- the existing implementations for the various backends (e.g., - `gcsfs`_ or Arrow's - `hdfs`_) -- `pyfilesystems`_, an attempt to do something similar, with a - plugin architecture. This conception has several types of local file-system, and a lot of well-thought-out - validation code. +#. python's `os`_ module and its `path` namespace; also other file-connected + functionality in the standard library +#. posix/bash method naming conventions that linux/unix/osx users are familiar with; or perhaps their Windows variants +#. the existing implementations for the various backends (e.g., + `gcsfs`_ or Arrow's + `hdfs`_) +#. `pyfilesystems`_, an attempt to do something similar, with a + plugin architecture. This conception has several types of local file-system, and a lot of well-thought-out + validation code. .. _os: https://docs.python.org/3/library/os.html .. _gcsfs: http://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem .. _pyfilesystems: https://docs.pyfilesystem.org/en/latest/index.html -Not pyfilesystems? +Other similar work ------------------ It might have been conceivable to reuse code in ``pyfilesystems``, which has an established interface and several @@ -83,6 +77,11 @@ have an interface as close to those as possible. See a .. _discussion: https://github.com/intake/filesystem_spec/issues/5 +Other newer technologies such as `smart_open`_ and ``pyarrow``'s newer file-system rewrite also have some +parts of the functionality presented here, that might suit some use cases better. + +.. _smart_open: https://github.com/RaRe-Technologies/smart_open + Structure of the package ------------------------ diff --git a/docs/source/usage.rst b/docs/source/usage.rst index 2f2a08a8a..9169d8c1d 100644 --- a/docs/source/usage.rst +++ b/docs/source/usage.rst @@ -6,8 +6,8 @@ This is quick-start documentation to help people get familiar with the layout an Instantiate a file-system ------------------------- -``fsspec`` provides an abstract file-system interface as a template for other filesystems. In this context, -"interface" means an API for working with files on the given file-system, which can mean files on some +``fsspec`` provides an abstract file-system interface as a base class, to be used by other filesystems. +A file-system instance is an object for manipulating files on some remote store, local files, files within some wrapper, or anything else that is capable of producing file-like objects. @@ -30,7 +30,8 @@ Look-up via registry: fs = fsspec.filesystem('file') -Many filesystems also take extra parameters, some of which may be options - see :doc:`api`. +Many filesystems also take extra parameters, some of which may be options - see :doc:`api`, or use +:func:`fsspec.get_filesystem_class` to get the class object and inspect its docstring. .. code-block:: python @@ -48,43 +49,56 @@ full list: :class:`fsspec.spec.AbstractFileSystem`). Note that this quick-start will prefer posix-style naming, but many common operations are aliased: ``cp()`` and ``copy()`` are identical, for instance. Functionality is generally chosen to be as close to the builtin ``os`` module's working for things like -``glob`` as possible. +``glob`` as possible. The following block of operations should seem very familiar. + +.. code-block:: python + + fs.mkdir("/remote/output") + fs.touch("/remote/output/success") # creates empty file + assert fs.exists("/remote/output/success") + assert fs.isfile("/remote/output/success") + assert fs.cat("/remote/output/success") == b"" # get content as bytestring + fs.copy("/remote/output/success", "/remote/output/copy") + assert fs.ls("/remote/output", detail=False) == ["/remote/output/success", "/remote/output/copy") + fs.rm("/remote/output", recursive=True) The ``open()`` method will return a file-like object which can be passed to any other library that expects -to work with python files. These will normally be binary-mode only, but may implement internal buffering +to work with python files, or used by your own code as you would a normal python file object. +These will normally be binary-mode only, but may implement internal buffering in order to limit the number of reads from a remote source. They respect the use of ``with`` contexts. If you have ``pandas`` installed, for example, you can do the following: .. code-block:: python - import fsspec - import pandas as pd + f = fs.open("/remote/path/notes.txt", "rb") + lines = f.readline() # read to first b"\n" + f.seek(-10, 2) + foot = f.read() # read last 10 bytes of file + f.close() - with fsspec.open( - 'https://raw.githubusercontent.com/dask/' - 'fastparquet/master/test-data/nation.csv' - ) as f: + import pandas as pd + with fs.open('/remote/data/myfile.csv') as f: df = pd.read_csv(f, sep='|', header=None) Higher-level ------------ For many situations, the only function that will be needed is :func:`fsspec.open_files()`, which will return -:class:`fsspec.core.OpenFile` instances created from a single URL and parameters to pass to the backend. +:class:`fsspec.core.OpenFile` instances created from a single URL and parameters to pass to the backend(s). This supports text-mode and compression on the fly, and the objects can be serialized for passing between processes or machines (so long as each has access to the same backend file-system). The protocol (i.e., backend) is inferred from the URL passed, and glob characters are expanded in read mode (search for files) or write mode (create names). Critically, the file on the backend system is not actually opened until the -``OpenFile`` instance is used in a ``with`` context. For the example above: +``OpenFile`` instance is used in a ``with`` context. .. code-block:: python - of = fsspec.open( - 'https://raw.githubusercontent.com/dask/' - 'fastparquet/master/test-data/nation.csv', - mode='r', - ) - # files is a not-yet-open OpenFile object. The "with" context actually opens it + of = fsspec.open("github://dask:fastparquet@main/test-data/nation.csv", "rt") + # of is an OpenFile container object. The "with" context below actually opens it with of as f: # now f is a text-mode file - df = pd.read_csv(f, sep='|', header=None) + for line in f: + # iterate text lines + print(line) + if "KENYA" in line: + break diff --git a/fsspec/implementations/github.py b/fsspec/implementations/github.py index c3a07434f..6f146aca0 100644 --- a/fsspec/implementations/github.py +++ b/fsspec/implementations/github.py @@ -37,15 +37,22 @@ class GithubFileSystem(AbstractFileSystem): rurl = "https://raw.githubusercontent.com/{org}/{repo}/{sha}/{path}" protocol = "github" - def __init__(self, org, repo, sha="master", username=None, token=None, **kwargs): + def __init__(self, org, repo, sha=None, username=None, token=None, **kwargs): super().__init__(**kwargs) self.org = org self.repo = repo - self.root = sha if (username is None) ^ (token is None): raise ValueError("Auth required both username and token") self.username = username self.token = token + if sha is None: + # look up default branch (not necessarily "master") + u = "https://api.github.com/repos/{org}/{repo}" + r = requests.get(u.format(org=org, repo=repo), **self.kw) + r.raise_for_status() + sha = r.json()["default_branch"] + + self.root = sha self.ls("") @property