Skip to content

BUG: Avoid Timedelta rounding when specifying unit and integer (#12690) #19732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 21, 2018

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Feb 16, 2018

@mroeschke mroeschke changed the title BUG: Avoid rounding when specifying unit and integer (#12690) BUG: Avoid Timedelta rounding when specifying unit and integer (#12690) Feb 16, 2018
@pep8speaks
Copy link

pep8speaks commented Feb 16, 2018

Hello @mroeschke! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 19, 2018 at 23:42 Hours UTC

result = Timedelta(value, unit=unit)
assert result.value == expected
result = Timedelta(str(value) + unit)
assert result.value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the last line is missing == expected

@@ -716,6 +716,7 @@ Datetimelike
- Bug in :class:`Timestamp` and :func:`to_datetime` where a string representing a barely out-of-bounds timestamp would be incorrectly rounded down instead of raising ``OutOfBoundsDatetime`` (:issue:`19382`)
- Bug in :func:`Timestamp.floor` :func:`DatetimeIndex.floor` where time stamps far in the future and past were not rounded correctly (:issue:`19206`)
- Bug in :func:`to_datetime` where passing an out-of-bounds datetime with ``errors='coerce'`` and ``utc=True`` would raise ``OutOfBoundsDatetime`` instead of parsing to ``NaT`` (:issue:`19612`)
- Bug in :class:`Timedelta`: where a numerical value with a unit would round values (:issue: `12690`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an extra colon at the end of :class:`Timedelta`: that should be removed, and the space between :issue: and the number should also be removed

@mroeschke
Copy link
Member Author

Although I believe these rounding constants are correct, it appears there are precision issues now for larger integers.

# Example from the failing test

In [6]: v
Out[6]: 946688461.0005

In [7]: pd._libs.tslibs.timedeltas.cast_from_unit(v, 's')
Out[7]: 946688461000499968

@@ -200,22 +200,22 @@ cpdef inline int64_t cast_from_unit(object ts, object unit) except? -1:

if unit == 'D' or unit == 'd':
m = 1000000000L * 86400
p = 6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were set originally to avoid precision issues.

@@ -229,10 +229,10 @@ cpdef inline int64_t cast_from_unit(object ts, object unit) except? -1:
# cast the unit, multiply base/frace separately
# to avoid precision issues from float -> int
base = <int64_t> ts
frac = ts -base
frac = ts - base
if p:
frac = round(frac, p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might be able to do something like this

round(frac * 1000, p - 3) // 1000 (for p >- 3) but haven't looked really closely.

@jreback jreback added Bug Timedelta Timedelta data type labels Feb 18, 2018
@mroeschke
Copy link
Member Author

Unfortunately using a 1000 multiplier didn't solve the issue, @jreback.

The issue with the example above is with floating point errors before rounding.

In [2]: v = 946688461.0005

In [3]: v - int(v)
Out[3]: 0.0004999637603759766

In [4]: round(v - int(v), 6) # precision in master
Out[4]: 0.0005

In [5]: round(v - int(v), 9) # precision in this PR (in order to preserve nanoseconds if specified)
Out[5]: 0.000499964

Would it be reasonable to cast v to a string in order to parse the whole number from the decimal?

In [8]: str(v).split('.')
Out[8]: ['946688461', '0005']

@jreback
Copy link
Contributor

jreback commented Feb 25, 2018

you cannot use strings ; rather you might look how CPython does this

Add tests

Add whatsnew

flake8

Add aditional formatting

remove tabs

address review
@jorisvandenbossche
Copy link
Member

To come back to this, the question is if we should care. In the example of v = 946688461.0005, even in the full number (not only after splitting the fractional part), the actual precision is 946688461.00049996. So this is inherent problem of floating points even in the input, not only in our calculation.

As a user you can always solve this by eg doing *1000 your data and specify a different unit.

@Sup3rGeo
Copy link
Contributor

Sup3rGeo commented Mar 26, 2018

Also complementing @jorisvandenbossche, I believe that the current rounding behavior is way, way more detrimental to users than this precision issue with the changes.

This means that even if we care about this precision issue, it would be better to merge this changes (reflecting correctly the timedelta value for a given integer and unit) and just open another issue on the precision thing for large numbers.

@mroeschke
Copy link
Member Author

I agree as well that premature rounding is less desirable than floating point errors. I could issue a warning for now for floats with high significant digits.

@mroeschke mroeschke force-pushed the timedelta_rounding branch from 457a265 to 3ffd35c Compare March 29, 2018 04:41
@jorisvandenbossche
Copy link
Member

I could issue a warning for now for floats with high significant digits.

Could we easily detect the case when the number of decimals is too high to preserve precision? In that case, a warning might be a good idea (then we can also give a hint on how to solve it)

@mroeschke
Copy link
Member Author

@jorisvandenbossche after searching I don't think there's a reliable way to detect significant digits since not all floats are stored exactly.

An alternative for now is to live with the precision issue and offer a note in the docs?

@jorisvandenbossche
Copy link
Member

An alternative for now is to live with the precision issue and offer a note in the docs?

Yes, that would be my preference. @jreback what's your view on this?

@jreback
Copy link
Contributor

jreback commented Apr 9, 2018

sure ok with simply documenting this as a known limitation. (doc-string and/or main docs)

@codecov
Copy link

codecov bot commented Apr 24, 2018

Codecov Report

Merging #19732 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #19732   +/-   ##
=======================================
  Coverage   91.84%   91.84%           
=======================================
  Files         153      153           
  Lines       49499    49499           
=======================================
  Hits        45460    45460           
  Misses       4039     4039
Flag Coverage Δ
#multiple 90.23% <ø> (ø) ⬆️
#single 41.88% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc37ea2...7ba45b4. Read the comment docs.

@mroeschke
Copy link
Member Author

Sorry for the delay on this PR.

I had to adjust some existing tests to account for floating point errors after this fix. Additionally, it appears we already have a note in the main docs about floating point errors when creating timestamps at the bottom of this section:

https://pandas.pydata.org/pandas-docs/stable/timeseries.html#epoch-timestamps

Ready for a final look and all green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show a copy-pastable example which works now (and didn't before). I know your tests cover this, but easier to copy-paste it.

@@ -623,7 +623,7 @@ def test_basics_nanos(self):

def test_unit(self):

def check(val, unit=None, h=1, s=1, us=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to parameterize this test

@@ -182,6 +182,11 @@ def test_date_time():
fname = os.path.join(dirpath, "datetime.csv")
df0 = pd.read_csv(fname, parse_dates=['Date1', 'Date2', 'DateTime',
'DateTimeHi', 'Taiw'])
# GH 19732: Timestamps imported from sas will incur floating point errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this need to be changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are floating point errors when importing dates from the sas file.

In [5]: pd.read_sas('pandas/tests/io/sas/data/datetime.sas7bdat')['DateTimeHi']
Out[5]:
0   1677-09-21 00:12:43.145225525
1   1960-01-01 00:00:00.000000000
2   2016-02-29 23:59:59.123456001
3   2262-04-11 23:47:16.854774475
Name: DateTimeHi, dtype: datetime64[ns]

# This matches dates in datetime.csv
In [6]: pd.read_csv('pandas/tests/io/sas/data/datetime.csv')['DateTimeHi']
Out[6]:
0    1677-09-21 00:12:43.145226
1    1960-01-01 00:00:00.000000
2    2016-02-29 23:59:59.123456
3    2262-04-11 23:47:16.854774
Name: DateTimeHi, dtype: object

I am not familiar with read_sas or the sas file that was created, but I am fairly certain it's due to the floating point errors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you round these instead of specifying exact values

@jreback
Copy link
Contributor

jreback commented May 10, 2018

@mroeschke to my last, is there an explicit test from the OP?

@mroeschke
Copy link
Member Author

Added an explicit test from the original issue (which was from you @jreback) and paramed an existing test.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, otherwise lgtm.

@@ -182,6 +182,11 @@ def test_date_time():
fname = os.path.join(dirpath, "datetime.csv")
df0 = pd.read_csv(fname, parse_dates=['Date1', 'Date2', 'DateTime',
'DateTimeHi', 'Taiw'])
# GH 19732: Timestamps imported from sas will incur floating point errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you round these instead of specifying exact values


Timedelta
^^^^^^^^^
- Bug in :class:`Timedelta`: where a numerical value with a unit would round values (:issue: `12690`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be a little bit precise here, e.g. i think a reader might not get what this change is

@mroeschke
Copy link
Member Author

Clarified the whatsnew entry and rounded the datetimes from the imported sas file. All green.


Timedelta
^^^^^^^^^
- Bug in :class:`Timedelta`: where passing a float with a unit would prematurely round the float precision (:issue: `12690`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be 14156?

@jreback jreback added this to the 0.23.1 milestone May 19, 2018
@mroeschke
Copy link
Member Author

Thanks @jreback. I had reference the older issue that was replaced by OPs.

@jreback jreback merged commit 81358e8 into pandas-dev:master May 21, 2018
@jreback
Copy link
Contributor

jreback commented May 21, 2018

thanks @mroeschke

@mroeschke mroeschke deleted the timedelta_rounding branch May 21, 2018 16:58
@jorisvandenbossche
Copy link
Member

As this is strictly spoken an API change for those that used a float with less precision and now get higher precision with floating point errors, I would rather keep this for 0.24.0 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timedelta Timedelta data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: conversion precision on Timedeltas
6 participants