Skip to content

PERF: improve resample perf #7673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 7, 2014
Merged

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Jul 5, 2014

Related to #7633. It gets better than the result attached #7633, but still slower more than 1.2 times compared to 1.4.0

Modified:

  • Avoid every time module import in Index.max/min
  • Avoid duplicated max call from resample/_get_time_bins and _get_range_edges.
  • Optimize lib/generate_bins_dt64 and tslib/period_asfreq_arr.

Remaining bottlenecks are NaT masking performed in lib/generate_bins_dt64 and tslib/period_asfreq_arr. Is there any better way to do that?

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
dataframe_resample_mean_numpy                |   4.9963 |   3.7940 |   1.3169 |
dataframe_resample_mean_string               |   5.0424 |   3.8280 |   1.3172 |
dataframe_resample_max_numpy                 |   4.1796 |   3.0069 |   1.3900 |
dataframe_resample_min_numpy                 |   4.2127 |   2.9987 |   1.4049 |
dataframe_resample_min_string                |   4.1687 |   2.9490 |   1.4136 |
dataframe_resample_max_string                |   4.3443 |   2.9283 |   1.4835 |
timeseries_timestamp_downsample_mean         |  16.1959 |   8.6366 |   1.8753 |
timeseries_period_downsample_mean            |  47.6096 |  19.7030 |   2.4164 |
-------------------------------------------------------------------------------


Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [54fb875] : PERF: Improve index.min and max perf
Base   [da0f7ae] : RLS: 0.14.0 final

else:
max_stamp = masked.asi8.max()
try:
max_stamp = self[self.asi8 != tslib.iNaT].asi8.max()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since nats are normally not present
you should check mask.any()
if this fails that you don't need to do self[mask] u can just use self

@sinhrks
Copy link
Member Author

sinhrks commented Jul 6, 2014

Thanks. Modified. Based on the cProfile, min/max no longer seems to be a critical.

dataframe_resample_mean_string               |   4.8401 |   3.6057 |   1.3423 |
dataframe_resample_max_string                |   3.9593 |   2.8160 |   1.4060 |
dataframe_resample_min_string                |   4.2106 |   2.8293 |   1.4882 |
dataframe_resample_mean_numpy                |   5.5540 |   3.6797 |   1.5094 |
dataframe_resample_max_numpy                 |   5.5197 |   3.4370 |   1.6059 |
dataframe_resample_min_numpy                 |   6.0006 |   3.1863 |   1.8832 |
timeseries_timestamp_downsample_mean         |  20.7520 |  10.1076 |   2.0531 |
timeseries_period_downsample_mean            |  45.5034 |  19.4256 |   2.3424 |

@jreback
Copy link
Contributor

jreback commented Jul 6, 2014

it's the re sampling this shouldn't be 2x slower

@sinhrks
Copy link
Member Author

sinhrks commented Jul 6, 2014

Period resample issue has been solved.

timeseries_period_downsample_mean            |  20.4194 |  19.6237 |   1.0405 |
dataframe_resample_min_string                |   3.9610 |   3.1497 |   1.2576 |
dataframe_resample_mean_numpy                |   4.8030 |   3.6697 |   1.3088 |
dataframe_resample_max_numpy                 |   4.0140 |   3.0653 |   1.3095 |
dataframe_resample_mean_string               |   4.7546 |   3.6027 |   1.3198 |
dataframe_resample_min_numpy                 |   4.0313 |   2.8987 |   1.3907 |
dataframe_resample_max_string                |   3.9481 |   2.7956 |   1.4122 |
timeseries_timestamp_downsample_mean         |  16.6246 |   8.6427 |   1.9235 |

@jreback jreback added this to the 0.14.1 milestone Jul 6, 2014
@jreback
Copy link
Contributor

jreback commented Jul 6, 2014

ok, looks a lot better. ping me when green (and see if the perf diff in timeseries resample is fixable).
That is used quite a bit.

jreback added a commit that referenced this pull request Jul 7, 2014
@jreback jreback merged commit f6ba5c4 into pandas-dev:master Jul 7, 2014
@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

@sinhrks merged, but still havev a look at resample perf degredation

@sinhrks sinhrks deleted the minmax_perf branch July 9, 2014 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants