Skip to content

DataFrame.corr(method="spearman") much slower than DataFrame.rank().corr(method="pearson") #28139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsaxton opened this issue Aug 25, 2019 · 4 comments · Fixed by #28151
Closed
Labels
Performance Memory or execution speed performance
Milestone

Comments

@dsaxton
Copy link
Member

dsaxton commented Aug 25, 2019

import numpy as np 
import pandas as pd 
 
np.random.seed(1234)                                                                                                                   

df = pd.DataFrame(np.random.random((5000, 100)))                                                                                       

%timeit df.corr(method="spearman")                                                                                                     
# 4.54 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.rank().corr(method="pearson")                                                                                               
# 105 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

pd.testing.assert_frame_equal(df.corr(method="spearman"), df.rank().corr(method="pearson"))
# No error

pd.__version__                                                                                                                         
# '0.25.0'

Problem description

DataFrame.corr(method="spearman") seems to be an order of magnitude slower than DataFrame.rank().corr(method="pearson") even though the two are ultimately doing the same calculation. While probably not the cleanest option, I would think that corr(method="spearman") could at least be made an alias for rank().corr(method="pearson"), or maybe a change could be made in algos.pyx.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Darwin OS-release : 18.7.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.0
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.0
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

@WillAyd
Copy link
Member

WillAyd commented Aug 25, 2019

Any chance you can profile to see where the overhead is? I think both are cythonized in pandas._libs.algos.pyx so surprised to see such a difference

@WillAyd WillAyd added the Performance Memory or execution speed performance label Aug 25, 2019
@WillAyd WillAyd added this to the Contributions Welcome milestone Aug 25, 2019
@dsaxton
Copy link
Member Author

dsaxton commented Aug 25, 2019

Any chance you can profile to see where the overhead is? I think both are cythonized in pandas._libs.algos.pyx so surprised to see such a difference

Here's the truncated output for both:

cProfile.run("df.corr(method='spearman')", sort="cumtime")

#          81453 function calls (81435 primitive calls) in 4.742 seconds
# 
#    Ordered by: cumulative time
# 
#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#         1    0.000    0.000    4.742    4.742 {built-in method builtins.exec}
#         1    0.000    0.000    4.742    4.742 <string>:1(<module>)
#         1    0.000    0.000    4.742    4.742 frame.py:7440(corr)
#         1    4.614    4.614    4.741    4.741 {pandas._libs.algos.nancorr_spearman}
#     10100    0.104    0.000    0.118    0.000 function_base.py:1149(diff)
#     20205    0.009    0.000    0.009    0.000 {built-in method numpy.array}
#     10103    0.004    0.000    0.009    0.000 numeric.py:469(asarray)
#     10101    0.003    0.000    0.008    0.000 numeric.py:541(asanyarray)
#     10100    0.004    0.000    0.004    0.000 {built-in method numpy.core._multiarray_umath.normalize_axis_index}
#     ...

cProfile.run("df.rank().corr(method='pearson')", sort="cumtime")

#          1017 function calls (987 primitive calls) in 0.117 seconds
# 
#    Ordered by: cumulative time
# 
#    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
#         1    0.000    0.000    0.117    0.117 {built-in method builtins.exec}
#         1    0.000    0.000    0.117    0.117 <string>:1(<module>)
#         1    0.000    0.000    0.070    0.070 frame.py:7440(corr)
#         1    0.069    0.069    0.069    0.069 {pandas._libs.algos.nancorr}
#         1    0.000    0.000    0.047    0.047 generic.py:8578(rank)
#         1    0.000    0.000    0.047    0.047 generic.py:8674(ranker)
#         1    0.000    0.000    0.046    0.046 algorithms.py:887(rank)
#         1    0.046    0.046    0.046    0.046 {pandas._libs.algos.rank_2d_float64}
#    ...

Looks like pretty much all the time is spent in the Cythonized nancorr_spearman (seems strange that there's also a huge number of calls to these other functions for method="spearman").

@dsaxton
Copy link
Member Author

dsaxton commented Aug 25, 2019

@WillAyd Not sure if this would fully explain it, but it looks to me like we're ranking the columns inside a nested for loop within nancorr_spearman: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos.pyx#L328, in which case we'd be duplicating a lot of work since each column only needs to be ranked once. Am I reading this right?

@WillAyd
Copy link
Member

WillAyd commented Aug 25, 2019

Not overly familiar with this code but a quick glance I think your inclination is correct. Probably a more efficient way to do it once outside the loop and handled accordingly within.

If you've got a way to make it work would certainly take a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants