-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
BugDeprecateFunctionality to remove in pandasFunctionality to remove in pandasNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionReshapingConcat, Merge/Join, Stack/Unstack, ExplodeConcat, Merge/Join, Stack/Unstack, Explode
Description
With MultiIndex columns, we get incorrect results
columns = pd.MultiIndex(
levels=[["a", "b"], ["x", "y"]],
codes=[[0, 1, 0], [0, 1, 0]],
names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(6).reshape((2, 3)), columns=columns)
print(df)
# l1 a b a
# l2 x y x
# 0 0 1 2
# 1 3 4 5
print(df.stack(0))
# l2 x y
# l1
# 0 a 0 NaN
# b 2 1.0
# 1 a 3 NaN
# b 5 4.0
In particular, the value of df
indexed by (0, (a, x))
is 2, and this gets moved to the value indexed by ((0, b), x)
.
Taking the same example but with an Index gives a more reasonable result:
df = df.droplevel(1, axis=1)
print(df)
# l1 a b a
# 0 0 1 2
# 1 3 4 5
print(df.stack(0))
# l1
# 0 a 0
# b 1
# a 2
# 1 a 3
# b 4
# a 5
# dtype: int64
However, I think this is still inconsistent with the rest of stack: it's the only case where the values in the index coming from the columns have duplicate values.
Accessing particular subsets of a DataFrame when the columns have duplicate values is fraught with difficulty. I think we should deprecate supporting duplicate values and in the future raise instead.
Metadata
Metadata
Assignees
Labels
BugDeprecateFunctionality to remove in pandasFunctionality to remove in pandasNeeds DiscussionRequires discussion from core team before further actionRequires discussion from core team before further actionReshapingConcat, Merge/Join, Stack/Unstack, ExplodeConcat, Merge/Join, Stack/Unstack, Explode