Merge pull request #25 from adrinjalali/slep012/DataArray

amueller · web-flow · commit 37ab0c104ade · 2020-02-18T14:59:40.000-05:00
SLEP012 - DataArray
diff --git a/slep012/proposal.rst b/slep012/proposal.rst
@@ -0,0 +1,135 @@
+.. _slep_012:
+
+==========
+InputArray
+==========
+
+:Author: Adrin jalali
+:Status: Draft
+:Type: Standards Track
+:Created: 2019-12-20
+
+Motivation
+**********
+
+This proposal results in a solution to propagating feature names through
+transformers, pipelines, and the column transformer. Ideally, we would have::
+
+    df = pd.readcsv('tabular.csv')
+    # transforming the data in an arbitrary way
+    transformer0 = ColumnTransformer(...)
+    # a pipeline preprocessing the data and then a classifier (or a regressor)
+    clf = make_pipeline(transfoemer0, ..., SVC())
+
+    # now we can investigate features at each stage of the pipeline
+    clf[-1].input_feature_names_
+
+The feature names are propagated throughout the pipeline and the user can
+investigate them at each step of the pipeline.
+
+This proposal suggests adding a new data structure, called ``InputArray``,
+which augments the data array ``X`` with additional meta-data. In this proposal
+we assume the feature names (and other potential meta-data) are attached to the
+data when passed to an estimator. Alternative solutions are discussed later in
+this document.
+
+A main constraint of this data structure is that is should be backward
+compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a
+transformer, would not break. This SLEP focuses on *feature names* as the only
+meta-data attached to the data. Support for other meta-data can be added later.
+
+Backward/NumPy/Pandas Compatibility
+***********************************
+
+Since currently transformers return a ``numpy`` or a ``scipy`` array, backward
+compatibility in this context means the operations which are valid on those
+arrays should also be valid on the new data structure.
+
+All operations are delegated to the *data* part of the container, and the
+meta-data is lost immediately after each operation and operations result in a
+``numpy.ndarray``. This includes indexing and slicing, *i.e.* to avoid
+performance degradation, ``__getitem__`` is not overloaded and if the user
+wishes to preserve the meta-data, they shall do so via explicitly calling a
+method such as ``select()``. Operations between two ``InpuArray``s will not
+try to align rows and/or columns of the two given objects.
+
+``pandas`` compatibility comes ideally as a ``pd.DataFrame(inputarray)``, for
+which ``pandas`` does not provide a clean API at the moment. Alternatively,
+``inputarray.todataframe()`` would return a ``pandas.DataFrame`` with the
+relevant meta-data attached.
+
+Feature Names
+*************
+
+Feature names are an object ``ndarray`` of strings aligned with the columns.
+They can be ``None``.
+
+Operations
+**********
+
+Estimators understand the ``InputArray`` and extract the feature names from the
+given data before applying the operations and transformations on the data.
+
+All transformers return an ``InputArray`` with feature names attached to it.
+The way feature names are generated is discussed in *SLEP007 - The Style of The
+Feature Names*.
+
+Sparse Arrays
+*************
+
+Ideally sparse arrays follow the same pattern, but since ``scipy.sparse`` does
+not provide the kinda of API provided by ``numpy``, we may need to find
+compromises.
+
+Factory Methods
+***************
+
+There will be factory methods creating an ``InputArray`` given a
+``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or
+an ``sp.SparseMatrix`` and a given set of feature names.
+
+An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a
+``todataframe()`` method.
+
+``X`` being an ``InputArray``::
+
+    >>> np.array(X)
+    >>> X.todataframe()
+    >>> pd.DataFrame(X) # only if pandas implements the API
+
+And given ``X`` a ``np.ndarray`` or an ``sp.sparse`` matrix and a set of
+feature names, one can make the right ``InputArray`` using::
+
+    >>> make_inputarray(X, feature_names)
+
+Alternative Solutions
+*********************
+
+Since we expect the feature names to be attached to the data given to an
+estimator, there are a few potential approaches we can take:
+
+- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data
+  as a ``pandas.DataFrame``, and if so, the transformer would output a
+  ``pandas.DataFrame`` which also includes the [generated] feature names. This
+  is not a feasible solution since ``pandas`` plans to move to a per column
+  representation, which means ``pd.DataFrame(np.asarray(df))`` has two
+  guaranteed memory copies.
+- ``XArray``: we could accept a `pandas.DataFrame``, and use
+  ``xarray.DataArray`` as the output of transformers, including feature names.
+  However, ``xarray`` has a hard dependency on ``pandas``, and uses
+  ``pandas.Index`` to handle row labels and aligns rows when an operation
+  between two ``xarray.DataArray`` is done, which can be time consuming, and is
+  not the semantic expected in ``scikit-learn``; we only expect the number of
+  rows to be equal, and that the rows always correspond to one another in the
+  same order.
+
+As a result, we need to have another data structure which we'll use to transfer
+data related information (such as feature names), which is lightweight and
+doesn't interfere with existing user code.
+
+Another alternative to the problem of passing meta-data around is to pass that
+as a parameter to ``fit``. This would heavily involve modifying meta-estimators
+since they'd need to pass that information, and extract the relevant
+information from the estimators to pass that along to the next estimator. Our
+prototype implementations showed significant challenges compared to when the
+meta-data is attached to the data.
diff --git a/under_review.rst b/under_review.rst
@@ -9,3 +9,4 @@ SLEPs under review
     :maxdepth: 1
 
     slep007/proposal
+    slep012/proposal

Original file line number	Diff line number	Diff line change
`@@ -9,3 +9,4 @@ SLEPs under review`
`9`	`9`	`:maxdepth: 1`
`10`	`10`
`11`	`11`	`slep007/proposal`
	`12`	`+ slep012/proposal`