-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
REGR: Performance regression on RollingGroupby #38038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@cpmbailey thanks for the report! I can confirm this; for me, the snippet (without the dataframe creation) takes 2s on pandas 1.0.5 and 20s on pandas 1.1.4. @mroeschke do you know if this is something "known"? From looking at the profile, it seems now the And it seems that on master it slowed down a bit further |
Looking at a profile, it seems that most of the time comes from constructing the MultiIndex. Illustrating this with the
(the profiler adds a lot of overhead (ca x3/4 in total time), so the relative numbers are not necessarily reliable, but the overall picture is certainly interesting) |
Edit: Messed something up. MultiIndex takes way longer
takes 40 Seconds on my machine, the other example took 18-19. |
My initial performance test case was only using 1000 points unlike in this issue's example where 10000000 points are used: #34052 (comment) I didn't anticipate, but it makes sense, that the bottleneck at this scale is the creation of the resulting MultiIndex creation is dominating the timings. |
We could maybe avoid the inner loop, but that reduces the time only by 50%, does not get us anywhere near 2 seconds |
you can likely use |
Indeed, my guess is that we should be able to reduce most of the time taken by the index creation by avoiding creating all the tuples |
I did an attempt rewriting this index creation, see #38057. Now, it has some different behaviour in a few corner cases (like dtype of empty index, NaNs in the levels vs in the codes) .. |
But it brings the time of the snippet above back to 2s, so as it was for pandas 1.0 |
pd.DataFrame({'a': np.random.randn(10000000), 'b': 1}).groupby('b').rolling(3).mean() is approximately 10x slower between 1.0.5 and 1.1.x
The text was updated successfully, but these errors were encountered: