Skip to content

Revamp docs #674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 17, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 24 additions & 3 deletions docs/source/developer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,28 @@ Implementing async
~~~~~~~~~~~~~~~~~~

Starting in version 0.7.5, we provide async operations for some methods
of some implementations.
of some implementations. Async support in storage implementations is
optional. Special considerations are required for async
development, see :doc:`async`.

This section will contain details on how to implement backends offering
async, once the details are ironed out on our end.
Developing the library
~~~~~~~~~~~~~~~~~~~~~~

The following can be used to install ``fsspec`` in development mode

.. code-block::

git clone https://github.com/intake/filesystem_spec
cd filesystem_spec
pip install -e .

A number of additional dependencies are required to run tests, see "ci/environment*.yml", as
well as Docker. Most implementation-specific tests should skip if their requirements are
not met.

Development happens by submitting pull requests (PRs) on github.
This repo adheres for flake8 and black coding conventions. You may wish to install
commit hooks if you intend to make PRs, as linting is done as part of the CI.

Docs use sphinx and the numpy docstring style. Please add an entry to the changelog
along with any PR.
84 changes: 32 additions & 52 deletions docs/source/features.rst
Original file line number Diff line number Diff line change
@@ -1,18 +1,6 @@
Features of fsspec
==================

Consistent API to many different storage backends. The general API and functionality were
proven with the projects `s3fs`_ and `gcsfs`_ (along with `hdfs3`_ and `adlfs`_), within the
context of Dask and independently. These have been tried and tested by many users and shown their
usefulness over some years. ``fsspec`` aims to build on these and unify their models, as well
as extract out file-system handling code from Dask which does not so comfortably fit within a
library designed for task-graph creation and their scheduling.

.. _s3fs: https://s3fs.readthedocs.io/en/latest/
.. _gcsfs: https://gcsfs.readthedocs.io/en/latest/
.. _hdfs3: https://hdfs3.readthedocs.io/en/latest/
.. _adlfs: https://docs.microsoft.com/en-us/azure/data-lake-store/

Here follows a brief description of some features of note of ``fsspec`` that provides to make
it an interesting project beyond some other file-system abstractions.

Expand Down Expand Up @@ -50,20 +38,31 @@ the initiation of the context which actually does the work of creating file-like
# f is now a real file-like object holding resources
f.read(...)

Random Access and Buffering
---------------------------

The :func:`fsspec.spec.AbstractBufferedFile` class is provided as an easy way to build file-like
interfaces to some service which is capable of providing blocks of bytes. This class is derived
from in a number of the existing implementations. A subclass of ``AbstractBufferedFile`` provides
random access for the underlying file-like data (without downloading the whole thing) and
configurable read-ahead buffers to minimise the number of the read operations that need to be
performed on the back-end storage.
File Buffering and random access
--------------------------------

This is also a critical feature in the big-data access model, where each sub-task of an operation
Most implementations create file objects which derive from ``fsspec.spec.AbstractBufferedFile``, and
have many behaviours in common. A subclass of ``AbstractBufferedFile`` provides
random access for the underlying file-like data (without downloading the whole thing).
This is a critical feature in the big-data access model, where each sub-task of an operation
may need on a small part of a file, and does not, therefore want to be forced into downloading the
whole thing.

These files offer buffering of both read and write operations, so that
communication with the remote resource is limited. The size of the buffer is generally configured
with the ``blocksize=`` kwarg at open time, although the implementation may have some minimum or
maximum sizes that need to be respected.

For reading, a number of buffering schemes are available, listed in ``fsspec.caching.caches``
(see :ref:`readbuffering`), or "none" for no buffering at all, e.g., for a simple read-ahead
buffer, you can do

.. code-block:: python

fs = fsspec.filesystem(...)
with fs.open(path, mode='rb', cache_type='readahead') as f:
use_for_something(f)

Transparent text-mode and compression
-------------------------------------

Expand Down Expand Up @@ -195,25 +194,6 @@ is called, so that subsequent listing of the given paths will force a refresh. I
addition, some methods like ``ls`` have a ``refresh`` parameter to force fetching
the listing again.

File Buffering
--------------

Most implementations create file objects which derive from ``fsspec.spec.AbstractBufferedFile``, and
have many behaviours in common. These files offer buffering of both read and write operations, so that
communication with the remote resource is limited. The size of the buffer is generally configured
with the ``blocksize=`` kwargs at open time, although the implementation may have some minimum or
maximum sizes that need to be respected.

For reading, a number of buffering schemes are available, listed in ``fsspec.caching.caches``
(see :ref:`readbuffering`), or "none" for no buffering at all, e.g., for a simple read-ahead
buffer, you can do

.. code-block:: python

fs = fsspec.filesystem(...)
with fs.open(path, mode='rb', cache_type='readahead') as f:
use_for_something(f)

URL chaining
------------

Expand Down Expand Up @@ -344,10 +324,10 @@ shown (or if none are selected, all files are shown).

The interface provides the following outputs:

- ``.urlpath``: the currently selected item (if any)
- ``.storage_options``: the value of the kwargs box
- ``.fs``: the current filesystem instance
- ``.open_file()``: produces an ``OpenFile`` instance for the current selection
#. ``.urlpath``: the currently selected item (if any)
#. ``.storage_options``: the value of the kwargs box
#. ``.fs``: the current filesystem instance
#. ``.open_file()``: produces an ``OpenFile`` instance for the current selection

Configuration
-------------
Expand Down Expand Up @@ -388,16 +368,16 @@ the style ``FSSPEC_{protocol}_{kwargname}=value``.

Configuration is determined in the following order, with later items winning:

- the contents of ini files, and json files in the config directory, sorted
alphabetically
- environment variables
- the contents of ``fsspec.config.conf``, which can be edited at runtime
- kwargs explicitly passed, whether with ``fsspec.open``, ``fsspec.filesystem``
or directly instantiating the implementation class.
#. the contents of ini files, and json files in the config directory, sorted
alphabetically
#. environment variables
#. the contents of ``fsspec.config.conf``, which can be edited at runtime
#. kwargs explicitly passed, whether with ``fsspec.open``, ``fsspec.filesystem``
or directly instantiating the implementation class.


Asynchronous
============
------------

Some implementations, those deriving from ``fsspec.asyn.AsyncFileSystem``, have
async/coroutine implementations of some file operations. The async methods have
Expand Down
113 changes: 68 additions & 45 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,58 +1,91 @@
FSSPEC: Filesystem interfaces for Python
``fsspec``: Filesystem interfaces for Python
======================================

Filesystem Spec (FSSPEC) is a project to unify various projects and classes to work with remote filesystems and
file-system-like abstractions using a standard pythonic interface.
Filesystem Spec (``fsspec``) is a project to provide a unified pythonic interface to
local, remote and embedded file systems and bytes storage.

Brief Overview
--------------

.. _highlight:
There are many places to store bytes, from in memory, to the local disk, cluster
distributed storage, to the cloud. Many files also contain internal mappings of names to bytes,
maybe in a hierarchical directory-oriented tree. Working with all these different
storage media, and their associated libraries, is a pain. ``fsspec`` exists to
provide a familiar API that will work the same whatever the storage backend.
As much as possible, we iron out the quirks specific to each implementation,
so you need do no more than provide credentials for each service you access
(if needed) and thereafter not have to worry about the implementation again.

Highlights
----------
Why
---

- based on s3fs and gcsfs
- ``fsspec`` instances are serializable and can be passed between processes/machines
- the ``OpenFiles`` file-like instances are also serializable
- implementations provide random access, to enable only the part of a file required to be read; plus a template
to base other file-like classes on
- file access can use transparent compression and text-mode
- any file-system directory can be viewed as a key-value/mapping store
- if installed, all file-system classes also subclass from ``pyarrow.filesystem.FileSystem``, so
can work with any arrow function expecting such an instance
- writes can be transactional: stored in a temporary location and only moved to the final
destination when the transaction is committed
- FUSE: mount any path from any backend to a point on your file-system
- cached instances tokenised on the instance parameters
``fsspec`` provides two main concepts: a set of filesystem classes with uniform APIs
(i.e., functions such as ``cp``, ``rm``, ``cat``, ``mkdir``, ...) supplying operations on a range of
storage systems; and top-level convenience functions like :func:`fsspec.open`, to allow
you to quickly get from a URL to a file-like object that you can use with a third-party
library or your own code.

These are described further in the :doc:`features` section.
The section :doc:`background` gives motivation and history of this project, but
most users will want to skip straight to :doc:`usage` to find out how to use
the package and :doc:`features` to see the long list of added functionality
included along with the basic file-system interface.

Installation
------------

.. code-block:: sh
Who uses ``fsspec``?
--------------------

pip install fsspec
You can use ``fsspec``'s file objects with any python function that accepts
file objects, because of *duck typing*.

Not all included filesystems are usable by default without installing extra
dependencies. For example to be able to access data in S3:
You may well be using ``fsspec`` already without knowing it.
The following libraries use ``fsspec`` internally for path and file handling:

.. code-block:: sh
#. `Dask`_, the parallel, out-of-core and distributed
programming platform
#. `Intake`_, the data source cataloguing and loading
library and its plugins
#. `pandas`_, the tabular data analysis package
#. `xarray`_ and `zarr`_, multidimensional array
storage and labelled operations
#. `DVC`_, version control system
for machine learning projects

``fsspec`` filesystems are also supported by:

#. `pyarrow`_, the in-memory data layout engine

... plus many more that we don't know about.

.. _Dask: https://dask.org/
.. _Intake: https://intake.readthedocs.io/
.. _pandas: https://pandas.pydata.org/
.. _xarray: http://xarray.pydata.org/
.. _zarr: https://zarr.readthedocs.io/
.. _DVC: https://dvc.org/
.. _pyarrow: https://arrow.apache.org/docs/python/

pip install fsspec[s3]

or
Installation
------------

`fsspec` can be installed from PyPI or conda and has no dependencies of its own

.. code-block:: sh

pip install fsspec
conda install -c conda-forge fsspec

Implementations
---------------
Not all filesystem implementations are available without installing extra
dependencies. For example to be able to access data in S3, you can use the optional
pip install syntax below, or install the specific package required

This repo contains several file-system implementations, see :ref:`implementations`. However,
the external projects ``s3fs`` and ``gcsfs`` depend on ``fsspec`` and share the same behaviours.
``Dask`` and ``Intake`` use ``fsspec`` internally for their IO needs.
.. code-block:: sh

pip install fsspec[gcs]
conda install -c conda-forge gcsfs

`fsspec` attempts to provide the right message when you attempt to use a filesystem
for which you need additional dependencies.
The current list of known implementations can be found as follows

.. code-block:: python
Expand All @@ -61,12 +94,10 @@ The current list of known implementations can be found as follows

known_implementations

These are only imported on request, which may fail if a required dependency is missing. The dictionary
``fsspec.registry`` contains all imported implementations, and can be mutated by user code, if necessary.


.. toctree::
:maxdepth: 2
:maxdepth: 1
:caption: Contents:

intro.rst
Expand All @@ -76,11 +107,3 @@ These are only imported on request, which may fail if a required dependency is m
async.rst
api.rst
changelog.rst


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
35 changes: 17 additions & 18 deletions docs/source/intro.rst
Original file line number Diff line number Diff line change
@@ -1,11 +1,5 @@
Introduction
============

To get stuck into using the package, rather than reading about its philosophy and history, you can
skip to :doc:`usage`.

Background
----------
==========

Python provides a standard interface for open files, so that alternate implementations of file-like object can
work seamlessly with many function which rely only on the methods of that standard interface. A number of libraries
Expand All @@ -21,7 +15,7 @@ other file-system implementations simpler.
History
-------

I (Martin Durant) have been involved in building a number of remote-data file-system implementations, principally
We have been involved in building a number of remote-data file-system implementations, principally
in the context of the `Dask`_ project. In particular, several are listed
in `docs`_ with links to the specific repositories.
With common authorship, there is much that is similar between the implementations, for example posix-like naming
Expand Down Expand Up @@ -57,21 +51,21 @@ Influences
The following places to consider, when choosing the definitions of how we would like the file-system specification
to look:

- python's `os`_ module and its `path` namespace; also other file-connected
functionality in the standard library
- posix/bash method naming conventions that linux/unix/osx users are familiar with; or perhaps their Windows variants
- the existing implementations for the various backends (e.g.,
`gcsfs`_ or Arrow's
`hdfs`_)
- `pyfilesystems`_, an attempt to do something similar, with a
plugin architecture. This conception has several types of local file-system, and a lot of well-thought-out
validation code.
#. python's `os`_ module and its `path` namespace; also other file-connected
functionality in the standard library
#. posix/bash method naming conventions that linux/unix/osx users are familiar with; or perhaps their Windows variants
#. the existing implementations for the various backends (e.g.,
`gcsfs`_ or Arrow's
`hdfs`_)
#. `pyfilesystems`_, an attempt to do something similar, with a
plugin architecture. This conception has several types of local file-system, and a lot of well-thought-out
validation code.

.. _os: https://docs.python.org/3/library/os.html
.. _gcsfs: http://gcsfs.readthedocs.io/en/latest/api.html#gcsfs.core.GCSFileSystem
.. _pyfilesystems: https://docs.pyfilesystem.org/en/latest/index.html

Not pyfilesystems?
Other similar work
------------------

It might have been conceivable to reuse code in ``pyfilesystems``, which has an established interface and several
Expand All @@ -83,6 +77,11 @@ have an interface as close to those as possible. See a

.. _discussion: https://github.com/intake/filesystem_spec/issues/5

Other newer technologies such as `smart_open`_ and ``pyarrow``'s newer file-system rewrite also have some
parts of the functionality presented here, that might suit some use cases better.

.. _smart_open: https://github.com/RaRe-Technologies/smart_open

Structure of the package
------------------------

Expand Down
Loading