Skip to content

Conversation

@denix56
Copy link

@denix56 denix56 commented Dec 21, 2021

Description

Pandas DataFrame is quite slow in comparison to numpy due to additional checks.
By replacing it with np.recarray I was able to improve performance by 5-10%.
Recarray allows us to have nice attribute access as in pandas, while improving performance.
The raw numpy arrays are a bit faster than recarray, however the difference is not as big as between pandas and recarray.
I have tested on Demand Forecasting with gpu=1, 0 workers and pin_memory=True.

@codecov-commenter
Copy link

codecov-commenter commented Dec 28, 2021

Codecov Report

Merging #806 (eb706f9) into master (0b5892a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #806   +/-   ##
=======================================
  Coverage   89.05%   89.06%           
=======================================
  Files          24       24           
  Lines        3829     3832    +3     
=======================================
+ Hits         3410     3413    +3     
  Misses        419      419           
Flag Coverage Δ
cpu 89.06% <100.00%> (+<0.01%) ⬆️
pytest 89.06% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pytorch_forecasting/data/timeseries.py 93.12% <100.00%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b5892a...eb706f9. Read the comment docs.

@jdb78
Copy link
Collaborator

jdb78 commented Feb 20, 2022

I am tempted to merge this. Think we should run the example notebooks also because things might change there - even if only visual.

@jobs-git
Copy link
Contributor

jobs-git commented Jun 7, 2025

any news on this?

@fkiraly fkiraly changed the title Improve performance of __getitem__ of TimeSeriesDataSet [ENH] Improve performance of TimeSeriesDataSet.__getitem__ Jun 8, 2025
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged this manually, as I had already edited the same location in __getitem__, and the file has moved.

How would we know this is an actual performance improvement? Have you tested it, @jobs-git?

Let's see if the tests pass.

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears the changes in this PR break internal API assumptions in other methods, e.g., get_groups - so it cannot be merged in its current state.

Still worth to keep open as long as we are reworking for v2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants