Skip to content

Commit 263828c

Browse files
authored
ENH: Add new implementation of DataFrame.stack (#53921)
* DEPR: Add new implementation of DataFrame.stack and deprecate old * Merge cleanup * Revert filterwarnings in conf.py * Merge fixup * Rename inner function * v3->future_stack; other refinements * Fixup docstring * Docstring fixup
1 parent 46386f0 commit 263828c

28 files changed

+662
-275
lines changed

doc/source/getting_started/comparison/comparison_with_r.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -438,7 +438,7 @@ In Python, the :meth:`~pandas.melt` method is the R equivalent:
438438
)
439439
440440
pd.melt(cheese, id_vars=["first", "last"])
441-
cheese.set_index(["first", "last"]).stack() # alternative way
441+
cheese.set_index(["first", "last"]).stack(future_stack=True) # alternative way
442442
443443
For more details and examples see :ref:`the reshaping documentation
444444
<reshaping.melt>`.

doc/source/user_guide/10min.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -579,7 +579,7 @@ columns:
579579

580580
.. ipython:: python
581581
582-
stacked = df2.stack()
582+
stacked = df2.stack(future_stack=True)
583583
stacked
584584
585585
With a "stacked" DataFrame or Series (having a :class:`MultiIndex` as the

doc/source/user_guide/cookbook.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -311,7 +311,7 @@ The :ref:`multindexing <advanced.hierarchical>` docs.
311311
df.columns = pd.MultiIndex.from_tuples([tuple(c.split("_")) for c in df.columns])
312312
df
313313
# Now stack & Reset
314-
df = df.stack(0).reset_index(1)
314+
df = df.stack(0, future_stack=True).reset_index(1)
315315
df
316316
# And fix the labels (Notice the label 'level_1' got added automatically)
317317
df.columns = ["Sample", "All_X", "All_Y"]
@@ -688,7 +688,7 @@ The :ref:`Pivot <reshaping.pivot>` docs.
688688
aggfunc="sum",
689689
margins=True,
690690
)
691-
table.stack("City")
691+
table.stack("City", future_stack=True)
692692
693693
`Frequency table like plyr in R
694694
<https://stackoverflow.com/questions/15589354/frequency-tables-in-pandas-like-plyr-in-r>`__

doc/source/user_guide/groupby.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1713,4 +1713,4 @@ column index name will be used as the name of the inserted column:
17131713
17141714
result
17151715
1716-
result.stack()
1716+
result.stack(future_stack=True)

doc/source/user_guide/reshaping.rst

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ stacked level becomes the new lowest level in a :class:`MultiIndex` on the colum
127127

128128
.. ipython:: python
129129
130-
stacked = df2.stack()
130+
stacked = df2.stack(future_stack=True)
131131
stacked
132132
133133
With a "stacked" :class:`DataFrame` or :class:`Series` (having a :class:`MultiIndex` as the
@@ -163,7 +163,7 @@ will result in a **sorted** copy of the original :class:`DataFrame` or :class:`S
163163
index = pd.MultiIndex.from_product([[2, 1], ["a", "b"]])
164164
df = pd.DataFrame(np.random.randn(4), index=index, columns=["A"])
165165
df
166-
all(df.unstack().stack() == df.sort_index())
166+
all(df.unstack().stack(future_stack=True) == df.sort_index())
167167
168168
The above code will raise a ``TypeError`` if the call to :meth:`~DataFrame.sort_index` is
169169
removed.
@@ -191,16 +191,16 @@ processed individually.
191191
df = pd.DataFrame(np.random.randn(4, 4), columns=columns)
192192
df
193193
194-
df.stack(level=["animal", "hair_length"])
194+
df.stack(level=["animal", "hair_length"], future_stack=True)
195195
196196
The list of levels can contain either level names or level numbers (but
197197
not a mixture of the two).
198198

199199
.. ipython:: python
200200
201-
# df.stack(level=['animal', 'hair_length'])
201+
# df.stack(level=['animal', 'hair_length'], future_stack=True)
202202
# from above is equivalent to:
203-
df.stack(level=[1, 2])
203+
df.stack(level=[1, 2], future_stack=True)
204204
205205
Missing data
206206
~~~~~~~~~~~~
@@ -233,8 +233,8 @@ which level in the columns to stack:
233233

234234
.. ipython:: python
235235
236-
df2.stack("exp")
237-
df2.stack("animal")
236+
df2.stack("exp", future_stack=True)
237+
df2.stack("animal", future_stack=True)
238238
239239
Unstacking can result in missing values if subgroups do not have the same
240240
set of labels. By default, missing values will be replaced with the default
@@ -345,12 +345,12 @@ some very expressive and fast data manipulations.
345345
.. ipython:: python
346346
347347
df
348-
df.stack().mean(1).unstack()
348+
df.stack(future_stack=True).mean(1).unstack()
349349
350350
# same result, another way
351351
df.T.groupby(level=1).mean()
352352
353-
df.stack().groupby(level=1).mean()
353+
df.stack(future_stack=True).groupby(level=1).mean()
354354
355355
df.mean().unstack(0)
356356
@@ -460,7 +460,7 @@ as having a multi-level index:
460460

461461
.. ipython:: python
462462
463-
table.stack()
463+
table.stack(future_stack=True)
464464
465465
.. _reshaping.crosstabulations:
466466

doc/source/whatsnew/v2.1.0.rst

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ Copy-on-Write improvements
7878
- DataFrame.fillna / Series.fillna
7979
- DataFrame.replace / Series.replace
8080

81-
.. _whatsnew_210.enhancements.enhancement2:
81+
.. _whatsnew_210.enhancements.map_na_action:
8282

8383
``map(func, na_action="ignore")`` now works for all array types
8484
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -128,6 +128,45 @@ Also, note that :meth:`Categorical.map` implicitly has had its ``na_action`` set
128128
This has been deprecated and will :meth:`Categorical.map` in the future change the default
129129
to ``na_action=None``, like for all the other array types.
130130

131+
.. _whatsnew_210.enhancements.new_stack:
132+
133+
New implementation of :meth:`DataFrame.stack`
134+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135+
136+
pandas has reimplemented :meth:`DataFrame.stack`. To use the new implementation, pass the argument ``future_stack=True``. This will become the only option in pandas 3.0.
137+
138+
The previous implementation had two main behavioral downsides.
139+
140+
1. The previous implementation would unnecessarily introduce NA values into the result. The user could have NA values automatically removed by passing ``dropna=True`` (the default), but doing this could also remove NA values from the result that existed in the input. See the examples below.
141+
2. The previous implementation with ``sort=True`` (the default) would sometimes sort part of the resulting index, and sometimes not. If the input's columns are *not* a :class:`MultiIndex`, then the resulting index would never be sorted. If the columns are a :class:`MultiIndex`, then in most cases the level(s) in the resulting index that come from stacking the column level(s) would be sorted. In rare cases such level(s) would be sorted in a non-standard order, depending on how the columns were created.
142+
143+
The new implementation (``future_stack=True``) will no longer unnecessarily introduce NA values when stacking multiple levels and will never sort. As such, the arguments ``dropna`` and ``sort`` are not utilized and must remain unspecified when using ``future_stack=True``. These arguments will be removed in the next major release.
144+
145+
.. ipython:: python
146+
147+
columns = pd.MultiIndex.from_tuples([("B", "d"), ("A", "c")])
148+
df = pd.DataFrame([[0, 2], [1, 3]], index=["z", "y"], columns=columns)
149+
df
150+
151+
In the previous version (``future_stack=False``), the default of ``dropna=True`` would remove unnecessarily introduced NA values but still coerce the dtype to ``float64`` in the process. In the new version, no NAs are introduced and so there is no coercion of the dtype.
152+
153+
.. ipython:: python
154+
:okwarning:
155+
156+
df.stack([0, 1], future_stack=False, dropna=True)
157+
df.stack([0, 1], future_stack=True)
158+
159+
If the input contains NA values, the previous version would drop those as well with ``dropna=True`` or introduce new NA values with ``dropna=False``. The new version persists all values from the input.
160+
161+
.. ipython:: python
162+
:okwarning:
163+
164+
df = pd.DataFrame([[0, 2], [np.nan, np.nan]], columns=columns)
165+
df
166+
df.stack([0, 1], future_stack=False, dropna=True)
167+
df.stack([0, 1], future_stack=False, dropna=False)
168+
df.stack([0, 1], future_stack=True)
169+
131170
.. _whatsnew_210.enhancements.other:
132171

133172
Other enhancements

pandas/core/frame.py

Lines changed: 69 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -9166,7 +9166,13 @@ def pivot_table(
91669166
sort=sort,
91679167
)
91689168

9169-
def stack(self, level: IndexLabel = -1, dropna: bool = True, sort: bool = True):
9169+
def stack(
9170+
self,
9171+
level: IndexLabel = -1,
9172+
dropna: bool | lib.NoDefault = lib.no_default,
9173+
sort: bool | lib.NoDefault = lib.no_default,
9174+
future_stack: bool = False,
9175+
):
91709176
"""
91719177
Stack the prescribed level(s) from columns to index.
91729178
@@ -9194,6 +9200,11 @@ def stack(self, level: IndexLabel = -1, dropna: bool = True, sort: bool = True):
91949200
section.
91959201
sort : bool, default True
91969202
Whether to sort the levels of the resulting MultiIndex.
9203+
future_stack : bool, default False
9204+
Whether to use the new implementation that will replace the current
9205+
implementation in pandas 3.0. When True, dropna and sort have no impact
9206+
on the result and must remain unspecified. See :ref:`pandas 2.1.0 Release
9207+
notes <whatsnew_210.enhancements.new_stack>` for more details.
91979208
91989209
Returns
91999210
-------
@@ -9233,7 +9244,7 @@ def stack(self, level: IndexLabel = -1, dropna: bool = True, sort: bool = True):
92339244
weight height
92349245
cat 0 1
92359246
dog 2 3
9236-
>>> df_single_level_cols.stack()
9247+
>>> df_single_level_cols.stack(future_stack=True)
92379248
cat weight 0
92389249
height 1
92399250
dog weight 2
@@ -9255,7 +9266,7 @@ def stack(self, level: IndexLabel = -1, dropna: bool = True, sort: bool = True):
92559266
kg pounds
92569267
cat 1 2
92579268
dog 2 4
9258-
>>> df_multi_level_cols1.stack()
9269+
>>> df_multi_level_cols1.stack(future_stack=True)
92599270
weight
92609271
cat kg 1
92619272
pounds 2
@@ -9280,7 +9291,7 @@ def stack(self, level: IndexLabel = -1, dropna: bool = True, sort: bool = True):
92809291
kg m
92819292
cat 1.0 2.0
92829293
dog 3.0 4.0
9283-
>>> df_multi_level_cols2.stack()
9294+
>>> df_multi_level_cols2.stack(future_stack=True)
92849295
weight height
92859296
cat kg 1.0 NaN
92869297
m NaN 2.0
@@ -9291,17 +9302,17 @@ def stack(self, level: IndexLabel = -1, dropna: bool = True, sort: bool = True):
92919302
92929303
The first parameter controls which level or levels are stacked:
92939304
9294-
>>> df_multi_level_cols2.stack(0)
9305+
>>> df_multi_level_cols2.stack(0, future_stack=True)
92959306
kg m
9296-
cat height NaN 2.0
9297-
weight 1.0 NaN
9298-
dog height NaN 4.0
9299-
weight 3.0 NaN
9300-
>>> df_multi_level_cols2.stack([0, 1])
9301-
cat height m 2.0
9302-
weight kg 1.0
9303-
dog height m 4.0
9304-
weight kg 3.0
9307+
cat weight 1.0 NaN
9308+
height NaN 2.0
9309+
dog weight 3.0 NaN
9310+
height NaN 4.0
9311+
>>> df_multi_level_cols2.stack([0, 1], future_stack=True)
9312+
cat weight kg 1.0
9313+
height m 2.0
9314+
dog weight kg 3.0
9315+
height m 4.0
93059316
dtype: float64
93069317
93079318
**Dropping missing values**
@@ -9331,15 +9342,52 @@ def stack(self, level: IndexLabel = -1, dropna: bool = True, sort: bool = True):
93319342
dog kg 2.0 NaN
93329343
m NaN 3.0
93339344
"""
9334-
from pandas.core.reshape.reshape import (
9335-
stack,
9336-
stack_multiple,
9337-
)
9345+
if not future_stack:
9346+
from pandas.core.reshape.reshape import (
9347+
stack,
9348+
stack_multiple,
9349+
)
9350+
9351+
if dropna is lib.no_default:
9352+
dropna = True
9353+
if sort is lib.no_default:
9354+
sort = True
93389355

9339-
if isinstance(level, (tuple, list)):
9340-
result = stack_multiple(self, level, dropna=dropna, sort=sort)
9356+
if isinstance(level, (tuple, list)):
9357+
result = stack_multiple(self, level, dropna=dropna, sort=sort)
9358+
else:
9359+
result = stack(self, level, dropna=dropna, sort=sort)
93419360
else:
9342-
result = stack(self, level, dropna=dropna, sort=sort)
9361+
from pandas.core.reshape.reshape import stack_v3
9362+
9363+
if dropna is not lib.no_default:
9364+
raise ValueError(
9365+
"dropna must be unspecified with future_stack=True as the new "
9366+
"implementation does not introduce rows of NA values. This "
9367+
"argument will be removed in a future version of pandas."
9368+
)
9369+
9370+
if sort is not lib.no_default:
9371+
raise ValueError(
9372+
"Cannot specify sort with future_stack=True, this argument will be "
9373+
"removed in a future version of pandas. Sort the result using "
9374+
".sort_index instead."
9375+
)
9376+
9377+
if (
9378+
isinstance(level, (tuple, list))
9379+
and not all(lev in self.columns.names for lev in level)
9380+
and not all(isinstance(lev, int) for lev in level)
9381+
):
9382+
raise ValueError(
9383+
"level should contain all level names or all level "
9384+
"numbers, not a mixture of the two."
9385+
)
9386+
9387+
if not isinstance(level, (tuple, list)):
9388+
level = [level]
9389+
level = [self.columns._get_level_number(lev) for lev in level]
9390+
result = stack_v3(self, level)
93439391

93449392
return result.__finalize__(self, method="stack")
93459393

pandas/core/groupby/generic.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -416,7 +416,7 @@ def _wrap_applied_output(
416416
res_df = self._reindex_output(res_df)
417417
# if self.observed is False,
418418
# keep all-NaN rows created while re-indexing
419-
res_ser = res_df.stack(dropna=self.observed)
419+
res_ser = res_df.stack(future_stack=True)
420420
res_ser.name = self.obj.name
421421
return res_ser
422422
elif isinstance(values[0], (Series, DataFrame)):

pandas/core/indexes/multi.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2440,6 +2440,10 @@ def reorder_levels(self, order) -> MultiIndex:
24402440
names=['y', 'x'])
24412441
"""
24422442
order = [self._get_level_number(i) for i in order]
2443+
result = self._reorder_ilevels(order)
2444+
return result
2445+
2446+
def _reorder_ilevels(self, order) -> MultiIndex:
24432447
if len(order) != self.nlevels:
24442448
raise AssertionError(
24452449
f"Length of order must be same as number of levels ({self.nlevels}), "

pandas/core/resample.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1497,7 +1497,7 @@ def size(self):
14971497
# If the result is a non-empty DataFrame we stack to get a Series
14981498
# GH 46826
14991499
if isinstance(result, ABCDataFrame) and not result.empty:
1500-
result = result.stack()
1500+
result = result.stack(future_stack=True)
15011501

15021502
if not len(self.ax):
15031503
from pandas import Series

pandas/core/reshape/pivot.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -418,7 +418,7 @@ def _all_key(key):
418418

419419
if len(cols) > 0:
420420
row_margin = data[cols + values].groupby(cols, observed=observed).agg(aggfunc)
421-
row_margin = row_margin.stack()
421+
row_margin = row_margin.stack(future_stack=True)
422422

423423
# slight hack
424424
new_order = [len(cols)] + list(range(len(cols)))

0 commit comments

Comments
 (0)