Skip to content

ENH: parse 8 or 9 digit delimited dates #47880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 of 3 tasks
MarcoGorelli opened this issue Jul 27, 2022 · 0 comments · Fixed by #47894
Closed
1 of 3 tasks

ENH: parse 8 or 9 digit delimited dates #47880

MarcoGorelli opened this issue Jul 27, 2022 · 0 comments · Fixed by #47894
Labels
Enhancement Timestamp pd.Timestamp and associated methods

Comments

@MarcoGorelli
Copy link
Member

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Currently, a string such as 01-01-2020 is parsed as a delimited date, whereas 1-1-2020 is parsed by dateutil

One consequence of this is that warnings about e.g. dayfirst aren't shown in the latter case, e.g.:

>>> pd.to_datetime(['13-01-2020'], dayfirst=False)
<stdin>:1: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
DatetimeIndex(['2020-01-13'], dtype='datetime64[ns]', freq=None)
>>> pd.to_datetime(['13-1-2020'], dayfirst=False)
DatetimeIndex(['2020-01-13'], dtype='datetime64[ns]', freq=None)

Feature Description

In

cdef inline object _parse_delimited_date(str date_string, bint dayfirst):
"""
Parse special cases of dates: MM/DD/YYYY, DD/MM/YYYY, MM/YYYY.
At the beginning function tries to parse date in MM/DD/YYYY format, but
if month > 12 - in DD/MM/YYYY (`dayfirst == False`).
With `dayfirst == True` function makes an attempt to parse date in
DD/MM/YYYY, if an attempt is wrong - in DD/MM/YYYY
For MM/DD/YYYY, DD/MM/YYYY: delimiter can be a space or one of /-.
For MM/YYYY: delimiter can be a space or one of /-
If `date_string` can't be converted to date, then function returns
None, None
Parameters
----------
date_string : str
dayfirst : bool
Returns:
--------
datetime or None
str or None
Describing resolution of the parsed string.
"""
cdef:
const char* buf
Py_ssize_t length
int day = 1, month = 1, year
bint can_swap = 0
buf = get_c_string_buf_and_size(date_string, &length)
if length == 10:
# parsing MM?DD?YYYY and DD?MM?YYYY dates
if _is_not_delimiter(buf[2]) or _is_not_delimiter(buf[5]):
return None, None
month = _parse_2digit(buf)
day = _parse_2digit(buf + 3)
year = _parse_4digit(buf + 6)
reso = 'day'
can_swap = 1
elif length == 7:
# parsing MM?YYYY dates
if buf[2] == b'.' or _is_not_delimiter(buf[2]):
# we cannot reliably tell whether e.g. 10.2010 is a float
# or a date, thus we refuse to parse it here
return None, None
month = _parse_2digit(buf)
year = _parse_4digit(buf + 3)
reso = 'month'
else:
return None, None

some code could be added to deal with cases where buf is of length 8 or 9, and either the date or the month are of length 1

Alternative Solutions

Always warn when using dateutil, but I don't a warning should be necessary here

Additional Context

If we wanted to warn whenever dateutil is called (e.g. #47828), then this'd really simplify the adjustments necessary to the test suite, as a lot of tests could be kept as they are

@MarcoGorelli MarcoGorelli added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member Timestamp pd.Timestamp and associated methods and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Timestamp pd.Timestamp and associated methods
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant