-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
API DesignCompatpandas objects compatability with Numpy or Python functionspandas objects compatability with Numpy or Python functionsGroupbyMultiIndex
Milestone
Description
Opening a new issue so this isn't lost.
In #18882 banned duplicate names in a MultiIndex. I think this is a good change since allowing duplicates hit a lot of edge cases when you went to actually do something. I want to make sure we understand all the cases that actually produce duplicate names in the MI though, specifically groupby.apply.
In [1]: import dask.dataframe as dd
In [2]: import pandas as pd
In [3]: pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
...: 'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]},
...: index=[0, 1, 3, 5, 6, 8, 9, 9, 9]).set_index("a")
...:
...:
In [4]: pdf.groupby(pdf.index).apply(lambda x: x.b)
Another, more realistic example: groupwise drop_duplicates:
In [18]: df = pd.DataFrame({"B": [0, 0, 0, 1, 1, 1, 2, 2, 2]}, index=pd.Index([0, 1, 1, 2, 2, 2, 0, 0, 1], name='a'))
In [19]: df
Out[19]:
B
a
0 0
1 0
1 0
2 1
2 1
2 1
0 2
0 2
1 2
In [20]: df.groupby('a').apply(pd.DataFrame.drop_duplicates)
Out[20]:
B
a a
0 0 0
0 2
1 1 0
1 2
2 2 1
Is it possible to throw a warning on this for now, in case duplicate names are more common than we thought?
Metadata
Metadata
Assignees
Labels
API DesignCompatpandas objects compatability with Numpy or Python functionspandas objects compatability with Numpy or Python functionsGroupbyMultiIndex