Skip to content

drop_duplicates() is dropping more than just duplicates in 0.17.0 #11512

@andersonjacob

Description

@andersonjacob

When I upgraded from 0.16.2 to 0.17.0, I was met with a nasty surprise when dropping duplicates. It looks like DataFrame.drop_duplicates() is not working as I would expect it to based on the previous version. I have a dataframe

test_ids = df['test_id'].unique()
print('N test ids: {}'.format(test_ids.shape))
print('N tests: {}'.format(df[['test_id', <some other columns>]].drop_duplicates().shape))

the output is:

N test ids: (341334,)
N tests: (237426, 10)

when I run the same in 0.16.2 the output is:

N test ids: (341334,)
N tests: (341334, 10)

I don't think you should be able to get fewer rows than the number of unique entries in a single column.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Duplicate ReportDuplicate issue or pull requestReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions