Skip to content

PERF: performance problem when comparing timestamp to datetimindex #52080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
phofl opened this issue Mar 19, 2023 · 6 comments · Fixed by #52111
Closed
3 tasks done

PERF: performance problem when comparing timestamp to datetimindex #52080

phofl opened this issue Mar 19, 2023 · 6 comments · Fixed by #52111
Labels
Non-Nano datetime64/timedelta64 with non-nanosecond resolution Performance Memory or execution speed performance Timestamp pd.Timestamp and associated methods
Milestone

Comments

@phofl
Copy link
Member

phofl commented Mar 19, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this issue exists on the latest version of pandas.

  • I have confirmed this issue exists on the main branch of pandas.

Reproducible Example

rg = pd.date_range("2020-01-01", periods=100_000, freq="s")

ts_ns = pd.Timestamp("1996-01-01 00:00:00.00000000000")
ts_s = pd.Timestamp("1996-01-01")

Following timings:

%timeit rg < ts_s
2.27 ms ± 44.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit rg < ts_ns
108 µs ± 572 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

I guess a bunch of users will define timestamps not up to the nanosecond and hence getting mismatched resolutions which causes a really big slowdown. Can we fix this somehow for 2.0?

Time is almost exclusively spent in

{pandas._libs.tslibs.np_datetime.compare_mismatched_resolutions}

cc @jbrockmendel @MarcoGorelli

Installed Versions

main

Prior Performance

No response

@phofl phofl added Performance Memory or execution speed performance Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 19, 2023
@phofl phofl added this to the 2.0 milestone Mar 19, 2023
@phofl phofl added the Non-Nano datetime64/timedelta64 with non-nanosecond resolution label Mar 19, 2023
@jbrockmendel
Copy link
Member

could do a try/except for lossless conversion to shared reso and fall back to compare_mismatches_resolutions

@MarcoGorelli
Copy link
Member

thanks for noticing!

tbh I'm a bit surprised that pd.date_range("2020-01-01", periods=100_000, freq="s") isn't of unit 's' - if it was then the performance issue would be addressed (you can try this by passing unit='s' to date_range)

@phofl
Copy link
Member Author

phofl commented Mar 20, 2023

This is just a small reproducer, the initial problem came from parquet files where the timestamps where stored as ns reso

@MarcoGorelli
Copy link
Member

could do a try/except for lossless conversion to shared reso and fall back to compare_mismatches_resolutions

is this something you have time to take on?

@jbrockmendel
Copy link
Member

I think so yes

@jbrockmendel
Copy link
Member

tbh I'm a bit surprised that pd.date_range("2020-01-01", periods=100_000, freq="s") isn't of unit 's'

I considered inferring reso in date_range but it became really messy bc you could have start/end with different resos (which themselves might be inferred or already present in Timestamps).

@DeaMariaLeon DeaMariaLeon added Timestamp pd.Timestamp and associated methods and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Non-Nano datetime64/timedelta64 with non-nanosecond resolution Performance Memory or execution speed performance Timestamp pd.Timestamp and associated methods
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants