Skip to content

Why .sort_values() on column containing same values shuffles entire dataframe ? #39877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
artsheiko opened this issue Feb 18, 2021 · 7 comments
Closed
Labels
Closing Candidate May be closeable, needs more eyeballs

Comments

@artsheiko
Copy link

df = pd.DataFrame({
    'col1': ['A', 'A', 'A', 'A', 'A', 'A']*15,
    'col2': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]*15,
    'col3': [0, 1, 9, 4, 2, 3]*15,
    'col4': ['a', 'B', 'c', 'D', 'e', 'F']*15
})
df.sort_values('col2', ascending=True)

Why this operation disorders the entire dataframe ? The same will be produced if the function will be applied for the col1.

image

@lithomas1
Copy link
Member

@artsheiko I think this behavior is intended for sort_values(Nothing is moved for col1 since its already sorted), and the entire row including index is swapped in the DataFrame during sorting. If you want to reset the index(which I think you're trying to do), you can add .reset_index(drop=True) afterwards or pass ignore_index=True

@lithomas1 lithomas1 added the Closing Candidate May be closeable, needs more eyeballs label Feb 18, 2021
@artsheiko
Copy link
Author

@lithomas1,
reset_index() and ignore_index=True do not archive the expected result. Yes, both alter index column, so it will be 0, 1, ... n, but the dataframe is still shuffled randomly.
We obtain 0, 2, 4 ... in col3, but must be 0, 1, 9 ...

@attack68
Copy link
Contributor

col2 contains the same value so sorting by it as a key is ambiguous. It could leave the dataframe unchanged or it could re-arrange any row to any location based on efficient memory use of the sorting algorithm, since the resultant col2 will still be 'sorted' according to non-decreasing order.

What do you expect?

@artsheiko
Copy link
Author

The problem is that the result of sorting is shuffled randomly dataframe which is not so obvious for cases when we do not know in advance whether a column consists of one unique value or not.
The expected could be achieved with df.sort_values('col1', ascending=True, kind='mergesort'). Do we need to take into account any considerations / restrictions in terms of sorting big dataframes using this approach ?

@attack68
Copy link
Contributor

The documentation points to np.ndarray.sort for info on the algorithms. The only option that is stable, where items are retained in relative order is mergesort as you have detected. There does not appear to be many downsides except the workspace is larger than the other two options. Not sure there is anything to do here, pls close if you agree.

@Quetzalcohuatl
Copy link

I just got burnt by this. I sorted my dataframe once on GPU Machine 1, then sorted it on CPU Machine 2, and found out that they have different indices! Now I have to write a workaround for my project.

@attack68 Would you be interested if we wrote a warning that sort values is not reproducible across machines, or maybe adding a random_state param to ensure reproducibility? Or is it not worthwhile? It was certainly a surprise to me!!!

@sodisga
Copy link

sodisga commented Jan 5, 2024

df.sort_values('col2', ascending=True, kind='stable') will not disoder the same values' index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs
Projects
None yet
Development

No branches or pull requests

5 participants