-
Notifications
You must be signed in to change notification settings - Fork 734
[ENH] Improve performance of TimeSeriesDataSet.__getitem__
#806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #806 +/- ##
=======================================
Coverage 89.05% 89.06%
=======================================
Files 24 24
Lines 3829 3832 +3
=======================================
+ Hits 3410 3413 +3
Misses 419 419
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
I am tempted to merge this. Think we should run the example notebooks also because things might change there - even if only visual. |
|
any news on this? |
TimeSeriesDataSet.__getitem__
fkiraly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged this manually, as I had already edited the same location in __getitem__, and the file has moved.
How would we know this is an actual performance improvement? Have you tested it, @jobs-git?
Let's see if the tests pass.
fkiraly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears the changes in this PR break internal API assumptions in other methods, e.g., get_groups - so it cannot be merged in its current state.
Still worth to keep open as long as we are reworking for v2.
Description
Pandas DataFrame is quite slow in comparison to numpy due to additional checks.
By replacing it with np.recarray I was able to improve performance by 5-10%.
Recarray allows us to have nice attribute access as in pandas, while improving performance.
The raw numpy arrays are a bit faster than recarray, however the difference is not as big as between pandas and recarray.
I have tested on Demand Forecasting with gpu=1, 0 workers and pin_memory=True.