Skip to content

BUG: Fix Timestamp constructor changes value on ambiguous DST #30995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

AlexKirko
Copy link
Member

@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 14, 2020

This took some digging, so please bear with me.

Short version
When we make a Timestamp using epoch time that corresponds to the second time an ambigous DST time occurs, the Timestamp gets created and gets assigned a correct Timestamp.value. However, when we call the Timestamp constructor again on this Timestamp, this is what happens:

  1. convert_datetime_to_tsobject in conversion.pyx gets called.
  2. It eventually calls npy_datetimestruct_to_datetime which doesn't care about DST. As it simply adds up date components after converting them to seconds.
  3. Then we attempt to shift the Timestamp.value using get_utcoffset. It sees the state its arguments are in, and, if we use dateutil assumes that we are in DST. pytz assumes we aren't in DST, which means that only dateutil breaks, because we end up shifting Timezone.value by DST timedelta (npy_datetimestruct_to_datetime doesn't shift, and then get_utcoffset shifts).

Notes
The core of the problem is that when we go from summer to winter time in DST timezones, the same time occurs twice as the clock is moved back one hour at a certain time. In Europe/London (which is used in the test), the clocks are moved at 27th of November 2:00 A.M. This means that we get 1:00 A.M. twice: before and after the shift.
dateutil handles the move differently from pytz here. For the ambiguous period it assumes that we are still in DST, and this is why the bug manifested only with dateutil timezones.

PS
I've come upon another bug while working on this, but I think it's out-of-scope for this PR and would like to handle it in a new issue and PR:

IN:
import pandas as pd
t = pd.Timestamp('2013-10-27 01:00:00+0000', tz='dateutil/Europe/London')
t

OUT:
Timestamp('2013-10-27 01:00:00+0100', tz='dateutil/GB-Eire')

@AlexKirko AlexKirko force-pushed the FIX-timestamp-constructor-ambiguous branch from b4bec5f to 2abc37b Compare January 14, 2020 07:27
@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 14, 2020

Wasn't able to reproduce the test fail on Windows. Will need to set up a dev environment in Linux and try to reproduce it. Ideas on why this might be happening would be appreciated.
Update: for some reason a dateutil timezone object belongs to different classes on Windows and Linux/Mac. Both classes are called tzfile which makes aliases in the import section necessary.
Update 2: imported dateutil.tz.tzfile instead, same way it's used in timezones.pyx. Works like a charm.

# Two check in if necessary because class
# differs on Windows and Linux
if (isinstance(ts.tzinfo, du_tzfile1) or
isinstance(ts.tzinfo, du_tzfile2)):

This comment was marked as outdated.

isinstance(ts.tzinfo, du_tzfile2)):
if ts.tzinfo.is_ambiguous(ts):
dst_offset = ts.tzinfo.dst(ts)
obj.value += int(dst_offset.total_seconds() * 1e9)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because pydatetime_to_dt64 doesn't take DST into account but get_utcoffset does, we need to add the DST timedelta to the result of pydatetime_to_dt64. This is necessary only for dateutil, because for ambiguous dates pytz immediately switches off DST and reduces UTC offset, so no correction is necessary for pytz timezones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to this effect here

assert times[-1] == Timestamp("2013-10-27 01:00:00+0100", tz=tz, freq="H")
else:
assert times[-1] == Timestamp("2013-10-27 01:00:00+0000", tz=tz, freq="H")
assert times[-1] == Timestamp("2013-10-27 01:00:00+0000", tz=tz, freq="H")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now no matter whether we use dateutil or pytz timezones, we get the same date range, so separate testing conditions are no longer necessary.

@@ -5,6 +5,8 @@ cimport numpy as cnp
from numpy cimport int64_t, int32_t, intp_t, ndarray
cnp.import_array()

from dateutil.tz import tzfile as _dateutil_tzfile
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in timezones.pyx when we need to check for dateutil timezone and make sure it works both on Windows and Linux. Seems to work.

# GH 24329 Take DST offset into account
# use dateutil.tz.tzfile to check type
if ts.tzinfo is not None:
if isinstance(ts.tzinfo, _dateutil_tzfile):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you can usetreat_tz_as_dateutil

cdef inline bint treat_tz_as_dateutil(object tz):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can also remove the if ts.tzinfo is not None check if you use treat_tz_as_dateutil directly.

@mroeschke
Copy link
Member

Does this also fix the case for nonexistent times near DST?

#24329 (comment)

@mroeschke mroeschke added Timezones Timezone data dtype Bug labels Jan 14, 2020
@AlexKirko
Copy link
Member Author

@mroeschke Unfortunately, no. Shifts it back by an hour. I would have to look into why this happens.

@AlexKirko
Copy link
Member Author

AlexKirko commented Jan 14, 2020

@mroeschke Okay, looked into it. You'll be happy to know that, according to dateutil, summer time starts 128 nanoseconds before 2 A.M. Looks to be a bug in dateutil that isn't connected to the issue that this PR is trying to handle. The bug manifests starting with 128 seconds before 2 A.M. both on master and my branch.
Details
This is fine:

>>> pd.__version__
'0.26.0.dev0+1773.g664d928fd'
>>> epoch =  1552211999999999871
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t
Timestamp('2019-03-10 01:59:59.999999871-0800', tz='dateutil/US/Pacific')
>>> t.value
1552211999999999871
>>> pd.Timestamp(t)
Timestamp('2019-03-10 01:59:59.999999871-0800', tz='dateutil/US/Pacific')
>>> pd.Timestamp(t).value
1552211999999999871

This is also fine:

>>> epoch =  1552212000000000000
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t
Timestamp('2019-03-10 03:00:00-0700', tz='dateutil/US/Pacific')
>>>
>>> t.value
1552212000000000000
>>> pd.Timestamp(t)
Timestamp('2019-03-10 03:00:00-0700', tz='dateutil/US/Pacific')
>>> pd.Timestamp(t).value
1552212000000000000

Meanwhile, this breaks representation and gets us nonexistent times:

>>> epoch =  1552211999999999872
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t
Timestamp('2019-03-10 01:59:59.999999872-0700', tz='dateutil/US/Pacific')
>>> t.value
1552211999999999872
>>> pd.Timestamp(t)
Timestamp('2019-03-10 01:59:59.999999872-0800', tz='dateutil/US/Pacific')
>>> pd.Timestamp(t).value
1552208399999999872

And right on the cusp, the value breaks too:

>>> epoch =  1552211999999999999
>>> t = pd.Timestamp(epoch, tz='dateutil/US/Pacific')
>>> t
Timestamp('2019-03-10 01:59:59.999999999-0700', tz='dateutil/US/Pacific')
>>> t.value
1552211999999999999
>>> pd.Timestamp(t)
Timestamp('2019-03-10 01:59:59.999999999-0800', tz='dateutil/US/Pacific')
>>> pd.Timestamp(t).value
1552208399999999999

So the bug is there, but it turns out to be unrelated and matters only for 128 nanoseconds before the switch to summer time.
I can make a separate issue, but frankly, I think it should be a dateutil issue. The chances that a user will encounter this are miniscule, so it's not like we have to compensate for dateutil.

@@ -382,6 +382,11 @@ cdef _TSObject convert_datetime_to_tsobject(datetime ts, object tz,
obj.tzinfo = tz
else:
obj.value = pydatetime_to_dt64(ts, &obj.dts)
# GH 24329 Take DST offset into account
if treat_tz_as_dateutil(ts.tzinfo):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Switched to treat_tz_as_dateutil. Much cleaner now, thank you.

@AlexKirko AlexKirko requested a review from mroeschke January 14, 2020 18:37
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice investigative work! LGTM.

Yes, please make a separate issue for the nonexistent cases. Your examples are really good.

@@ -968,6 +968,7 @@ Datetimelike
- Bug in :func:`date_range` with custom business hours as ``freq`` and given number of ``periods`` (:issue:`30593`)
- Bug in :class:`PeriodIndex` comparisons with incorrectly casting integers to :class:`Period` objects, inconsistent with the :class:`Period` comparison behavior (:issue:`30722`)
- Bug in :meth:`DatetimeIndex.insert` raising a ``ValueError`` instead of a ``TypeError`` when trying to insert a timezone-aware :class:`Timestamp` into a timezone-naive :class:`DatetimeIndex`, or vice-versa (:issue:`30806`)
- Bug in :class:`Timestamp` where constructing :class:`Timestamp` from ambiguous epoch time and calling constructor again changed :meth:`Timestamp.value` property (:issue:`24329`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry one more thing, can you move to whatsnew version 1.1.0?

@mroeschke mroeschke added this to the 1.1 milestone Jan 14, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, otherwise lgtm (ex moving the whatsnew to 1.1). @mroeschke pls merge when satisfied.

isinstance(ts.tzinfo, du_tzfile2)):
if ts.tzinfo.is_ambiguous(ts):
dst_offset = ts.tzinfo.dst(ts)
obj.value += int(dst_offset.total_seconds() * 1e9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to this effect here

@AlexKirko AlexKirko force-pushed the FIX-timestamp-constructor-ambiguous branch from 2a14b57 to 322178c Compare January 15, 2020 07:25
@AlexKirko AlexKirko force-pushed the FIX-timestamp-constructor-ambiguous branch from d980836 to 85767af Compare January 15, 2020 07:39
# GH 24329 When datetime is ambiguous,
# pydatetime_to_dt64 doesn't take DST into account
# but with dateutil timezone, get_utcoffset does
# so we need to correct for it
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment explaining the reasoning for this solution. Tried to make it as brief as possible.

@@ -59,6 +59,7 @@ Categorical

Datetimelike
^^^^^^^^^^^^
- Bug in :class:`Timestamp` where constructing :class:`Timestamp` from ambiguous epoch time and calling constructor again changed :meth:`Timestamp.value` property (:issue:`24329`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to whatsnew for v1.1.0

@AlexKirko
Copy link
Member Author

@mroeschke Made the required changes. Also performed the final cleanup: rebased to updated master and squashed tiny commits.
Please take a look if it's ready for merging.

@mroeschke mroeschke merged commit bc9d329 into pandas-dev:master Jan 15, 2020
@mroeschke
Copy link
Member

Thanks @AlexKirko!

If you'd like to continue investigating #31043, it would be much appreciated.

AlexKirko added a commit to AlexKirko/pandas that referenced this pull request Jan 15, 2020
test_dti_construction_nonexistent_endpoint had an expected fail
because of unresolved 24329. This fix in conjunction with pandas-dev#30995
means it now returns the expected value and does not fail
@AlexKirko AlexKirko deleted the FIX-timestamp-constructor-ambiguous branch January 16, 2020 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Timestamp(Timestamp(Ambiguous time)) modifies .value with dateutil tz
3 participants