Skip to content

Commit 37ab0c1

Browse files
authored
Merge pull request #25 from adrinjalali/slep012/DataArray
SLEP012 - DataArray
2 parents 34a82fd + 48ce7f4 commit 37ab0c1

File tree

2 files changed

+136
-0
lines changed

2 files changed

+136
-0
lines changed

slep012/proposal.rst

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
.. _slep_012:
2+
3+
==========
4+
InputArray
5+
==========
6+
7+
:Author: Adrin jalali
8+
:Status: Draft
9+
:Type: Standards Track
10+
:Created: 2019-12-20
11+
12+
Motivation
13+
**********
14+
15+
This proposal results in a solution to propagating feature names through
16+
transformers, pipelines, and the column transformer. Ideally, we would have::
17+
18+
df = pd.readcsv('tabular.csv')
19+
# transforming the data in an arbitrary way
20+
transformer0 = ColumnTransformer(...)
21+
# a pipeline preprocessing the data and then a classifier (or a regressor)
22+
clf = make_pipeline(transfoemer0, ..., SVC())
23+
24+
# now we can investigate features at each stage of the pipeline
25+
clf[-1].input_feature_names_
26+
27+
The feature names are propagated throughout the pipeline and the user can
28+
investigate them at each step of the pipeline.
29+
30+
This proposal suggests adding a new data structure, called ``InputArray``,
31+
which augments the data array ``X`` with additional meta-data. In this proposal
32+
we assume the feature names (and other potential meta-data) are attached to the
33+
data when passed to an estimator. Alternative solutions are discussed later in
34+
this document.
35+
36+
A main constraint of this data structure is that is should be backward
37+
compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a
38+
transformer, would not break. This SLEP focuses on *feature names* as the only
39+
meta-data attached to the data. Support for other meta-data can be added later.
40+
41+
Backward/NumPy/Pandas Compatibility
42+
***********************************
43+
44+
Since currently transformers return a ``numpy`` or a ``scipy`` array, backward
45+
compatibility in this context means the operations which are valid on those
46+
arrays should also be valid on the new data structure.
47+
48+
All operations are delegated to the *data* part of the container, and the
49+
meta-data is lost immediately after each operation and operations result in a
50+
``numpy.ndarray``. This includes indexing and slicing, *i.e.* to avoid
51+
performance degradation, ``__getitem__`` is not overloaded and if the user
52+
wishes to preserve the meta-data, they shall do so via explicitly calling a
53+
method such as ``select()``. Operations between two ``InpuArray``s will not
54+
try to align rows and/or columns of the two given objects.
55+
56+
``pandas`` compatibility comes ideally as a ``pd.DataFrame(inputarray)``, for
57+
which ``pandas`` does not provide a clean API at the moment. Alternatively,
58+
``inputarray.todataframe()`` would return a ``pandas.DataFrame`` with the
59+
relevant meta-data attached.
60+
61+
Feature Names
62+
*************
63+
64+
Feature names are an object ``ndarray`` of strings aligned with the columns.
65+
They can be ``None``.
66+
67+
Operations
68+
**********
69+
70+
Estimators understand the ``InputArray`` and extract the feature names from the
71+
given data before applying the operations and transformations on the data.
72+
73+
All transformers return an ``InputArray`` with feature names attached to it.
74+
The way feature names are generated is discussed in *SLEP007 - The Style of The
75+
Feature Names*.
76+
77+
Sparse Arrays
78+
*************
79+
80+
Ideally sparse arrays follow the same pattern, but since ``scipy.sparse`` does
81+
not provide the kinda of API provided by ``numpy``, we may need to find
82+
compromises.
83+
84+
Factory Methods
85+
***************
86+
87+
There will be factory methods creating an ``InputArray`` given a
88+
``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or
89+
an ``sp.SparseMatrix`` and a given set of feature names.
90+
91+
An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a
92+
``todataframe()`` method.
93+
94+
``X`` being an ``InputArray``::
95+
96+
>>> np.array(X)
97+
>>> X.todataframe()
98+
>>> pd.DataFrame(X) # only if pandas implements the API
99+
100+
And given ``X`` a ``np.ndarray`` or an ``sp.sparse`` matrix and a set of
101+
feature names, one can make the right ``InputArray`` using::
102+
103+
>>> make_inputarray(X, feature_names)
104+
105+
Alternative Solutions
106+
*********************
107+
108+
Since we expect the feature names to be attached to the data given to an
109+
estimator, there are a few potential approaches we can take:
110+
111+
- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data
112+
as a ``pandas.DataFrame``, and if so, the transformer would output a
113+
``pandas.DataFrame`` which also includes the [generated] feature names. This
114+
is not a feasible solution since ``pandas`` plans to move to a per column
115+
representation, which means ``pd.DataFrame(np.asarray(df))`` has two
116+
guaranteed memory copies.
117+
- ``XArray``: we could accept a `pandas.DataFrame``, and use
118+
``xarray.DataArray`` as the output of transformers, including feature names.
119+
However, ``xarray`` has a hard dependency on ``pandas``, and uses
120+
``pandas.Index`` to handle row labels and aligns rows when an operation
121+
between two ``xarray.DataArray`` is done, which can be time consuming, and is
122+
not the semantic expected in ``scikit-learn``; we only expect the number of
123+
rows to be equal, and that the rows always correspond to one another in the
124+
same order.
125+
126+
As a result, we need to have another data structure which we'll use to transfer
127+
data related information (such as feature names), which is lightweight and
128+
doesn't interfere with existing user code.
129+
130+
Another alternative to the problem of passing meta-data around is to pass that
131+
as a parameter to ``fit``. This would heavily involve modifying meta-estimators
132+
since they'd need to pass that information, and extract the relevant
133+
information from the estimators to pass that along to the next estimator. Our
134+
prototype implementations showed significant challenges compared to when the
135+
meta-data is attached to the data.

under_review.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ SLEPs under review
99
:maxdepth: 1
1010

1111
slep007/proposal
12+
slep012/proposal

0 commit comments

Comments
 (0)