|
| 1 | +.. _slep_012: |
| 2 | + |
| 3 | +========== |
| 4 | +InputArray |
| 5 | +========== |
| 6 | + |
| 7 | +:Author: Adrin jalali |
| 8 | +:Status: Draft |
| 9 | +:Type: Standards Track |
| 10 | +:Created: 2019-12-20 |
| 11 | + |
| 12 | +Motivation |
| 13 | +********** |
| 14 | + |
| 15 | +This proposal results in a solution to propagating feature names through |
| 16 | +transformers, pipelines, and the column transformer. Ideally, we would have:: |
| 17 | + |
| 18 | + df = pd.readcsv('tabular.csv') |
| 19 | + # transforming the data in an arbitrary way |
| 20 | + transformer0 = ColumnTransformer(...) |
| 21 | + # a pipeline preprocessing the data and then a classifier (or a regressor) |
| 22 | + clf = make_pipeline(transfoemer0, ..., SVC()) |
| 23 | + |
| 24 | + # now we can investigate features at each stage of the pipeline |
| 25 | + clf[-1].input_feature_names_ |
| 26 | + |
| 27 | +The feature names are propagated throughout the pipeline and the user can |
| 28 | +investigate them at each step of the pipeline. |
| 29 | + |
| 30 | +This proposal suggests adding a new data structure, called ``InputArray``, |
| 31 | +which augments the data array ``X`` with additional meta-data. In this proposal |
| 32 | +we assume the feature names (and other potential meta-data) are attached to the |
| 33 | +data when passed to an estimator. Alternative solutions are discussed later in |
| 34 | +this document. |
| 35 | + |
| 36 | +A main constraint of this data structure is that is should be backward |
| 37 | +compatible, *i.e.* code which expects a ``numpy.ndarray`` as the output of a |
| 38 | +transformer, would not break. This SLEP focuses on *feature names* as the only |
| 39 | +meta-data attached to the data. Support for other meta-data can be added later. |
| 40 | + |
| 41 | +Backward/NumPy/Pandas Compatibility |
| 42 | +*********************************** |
| 43 | + |
| 44 | +Since currently transformers return a ``numpy`` or a ``scipy`` array, backward |
| 45 | +compatibility in this context means the operations which are valid on those |
| 46 | +arrays should also be valid on the new data structure. |
| 47 | + |
| 48 | +All operations are delegated to the *data* part of the container, and the |
| 49 | +meta-data is lost immediately after each operation and operations result in a |
| 50 | +``numpy.ndarray``. This includes indexing and slicing, *i.e.* to avoid |
| 51 | +performance degradation, ``__getitem__`` is not overloaded and if the user |
| 52 | +wishes to preserve the meta-data, they shall do so via explicitly calling a |
| 53 | +method such as ``select()``. Operations between two ``InpuArray``s will not |
| 54 | +try to align rows and/or columns of the two given objects. |
| 55 | +
|
| 56 | +``pandas`` compatibility comes ideally as a ``pd.DataFrame(inputarray)``, for |
| 57 | +which ``pandas`` does not provide a clean API at the moment. Alternatively, |
| 58 | +``inputarray.todataframe()`` would return a ``pandas.DataFrame`` with the |
| 59 | +relevant meta-data attached. |
| 60 | + |
| 61 | +Feature Names |
| 62 | +************* |
| 63 | + |
| 64 | +Feature names are an object ``ndarray`` of strings aligned with the columns. |
| 65 | +They can be ``None``. |
| 66 | + |
| 67 | +Operations |
| 68 | +********** |
| 69 | + |
| 70 | +Estimators understand the ``InputArray`` and extract the feature names from the |
| 71 | +given data before applying the operations and transformations on the data. |
| 72 | + |
| 73 | +All transformers return an ``InputArray`` with feature names attached to it. |
| 74 | +The way feature names are generated is discussed in *SLEP007 - The Style of The |
| 75 | +Feature Names*. |
| 76 | + |
| 77 | +Sparse Arrays |
| 78 | +************* |
| 79 | + |
| 80 | +Ideally sparse arrays follow the same pattern, but since ``scipy.sparse`` does |
| 81 | +not provide the kinda of API provided by ``numpy``, we may need to find |
| 82 | +compromises. |
| 83 | + |
| 84 | +Factory Methods |
| 85 | +*************** |
| 86 | + |
| 87 | +There will be factory methods creating an ``InputArray`` given a |
| 88 | +``pandas.DataFrame`` or an ``xarray.DataArray`` or simply an ``np.ndarray`` or |
| 89 | +an ``sp.SparseMatrix`` and a given set of feature names. |
| 90 | + |
| 91 | +An ``InputArray`` can also be converted to a `pandas.DataFrame`` using a |
| 92 | +``todataframe()`` method. |
| 93 | + |
| 94 | +``X`` being an ``InputArray``:: |
| 95 | + |
| 96 | + >>> np.array(X) |
| 97 | + >>> X.todataframe() |
| 98 | + >>> pd.DataFrame(X) # only if pandas implements the API |
| 99 | + |
| 100 | +And given ``X`` a ``np.ndarray`` or an ``sp.sparse`` matrix and a set of |
| 101 | +feature names, one can make the right ``InputArray`` using:: |
| 102 | + |
| 103 | + >>> make_inputarray(X, feature_names) |
| 104 | + |
| 105 | +Alternative Solutions |
| 106 | +********************* |
| 107 | + |
| 108 | +Since we expect the feature names to be attached to the data given to an |
| 109 | +estimator, there are a few potential approaches we can take: |
| 110 | + |
| 111 | +- ``pandas`` in, ``pandas`` out: this means we expect the user to give the data |
| 112 | + as a ``pandas.DataFrame``, and if so, the transformer would output a |
| 113 | + ``pandas.DataFrame`` which also includes the [generated] feature names. This |
| 114 | + is not a feasible solution since ``pandas`` plans to move to a per column |
| 115 | + representation, which means ``pd.DataFrame(np.asarray(df))`` has two |
| 116 | + guaranteed memory copies. |
| 117 | +- ``XArray``: we could accept a `pandas.DataFrame``, and use |
| 118 | + ``xarray.DataArray`` as the output of transformers, including feature names. |
| 119 | + However, ``xarray`` has a hard dependency on ``pandas``, and uses |
| 120 | + ``pandas.Index`` to handle row labels and aligns rows when an operation |
| 121 | + between two ``xarray.DataArray`` is done, which can be time consuming, and is |
| 122 | + not the semantic expected in ``scikit-learn``; we only expect the number of |
| 123 | + rows to be equal, and that the rows always correspond to one another in the |
| 124 | + same order. |
| 125 | + |
| 126 | +As a result, we need to have another data structure which we'll use to transfer |
| 127 | +data related information (such as feature names), which is lightweight and |
| 128 | +doesn't interfere with existing user code. |
| 129 | + |
| 130 | +Another alternative to the problem of passing meta-data around is to pass that |
| 131 | +as a parameter to ``fit``. This would heavily involve modifying meta-estimators |
| 132 | +since they'd need to pass that information, and extract the relevant |
| 133 | +information from the estimators to pass that along to the next estimator. Our |
| 134 | +prototype implementations showed significant challenges compared to when the |
| 135 | +meta-data is attached to the data. |
0 commit comments