Skip to content

feat(python): Allow %f in strptime format strings #8404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 22, 2023

Conversation

stinodego
Copy link
Contributor

@stinodego stinodego commented Apr 21, 2023

I looked into this and I could not find a reason why this would not be permitted. I updated the tests and it all works as expected.

Pinging @MarcoGorelli as I expect he introduced this check - could you elaborate?

I also removed a check on the dtype input - this is already handled by the if/elif/else statement.

@github-actions github-actions bot added fix Bug fix python Related to Python Polars labels Apr 21, 2023
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Apr 21, 2023

hey @stinodego

it's because it differs from the Python stdlib and so is a recipe for unexpected behaviour

e.g. compare:

In [1]: pl.Series(['2020-01-01T00:00:00.12']).str.strptime(pl.Datetime, '%Y-%m-%dT%H:%M:%S.%f')
Out[1]:
shape: (1,)
Series: '' [datetime[ns]]
[
        2020-01-01 00:00:00.000000012
]

In [2]: datetime.strptime('2020-01-01T00:00:00.12', '%Y-%m-%dT%H:%M:%S.%f')
Out[2]: datetime.datetime(2020, 1, 1, 0, 0, 0, 120000)

whereas the expected behaviour was almost certainly

In [2]: datetime.strptime('2020-01-01T00:00:00.12', '%Y-%m-%dT%H:%M:%S.%f')
Out[2]: datetime.datetime(2020, 1, 1, 0, 0, 0, 120000)

In [3]: pl.Series(['2020-01-01T00:00:00.12']).str.strptime(pl.Datetime, '%Y-%m-%dT%H:%M:%S%.f')
Out[3]:
shape: (1,)
Series: '' [datetime[ns]]
[
        2020-01-01 00:00:00.120
]

@stinodego
Copy link
Contributor Author

stinodego commented Apr 21, 2023

I see. The chrono crate does very clearly document this behaviour, however. %f = number of nanoseconds.

So it feels strange to disable this functionality, which might be useful to some users, purely because some other users do not read the docs and may get unexpected behavior. We clearly link to the chrono docs, we may even add a warning that %f has different behavior than Python stdlib, but disabling it entirely seems like a strange choice.

It also introduced differences between Python Polars and Rust Polars.

Do you have an opinion here, @ritchie46 ?

@MarcoGorelli
Copy link
Collaborator

You're correct that it's documented. I'll give a pandas experience example if I may: the behaviour in this issue had been clearly documented for at least a decade, as had the supported workaround (specify dayfirst=True). However, it would still cause endless amounts of confusion and duplicate bug reports because nobody was expecting that default behaviour. The eventual solution was to change the default (PDEP4)

There's already been one issue on github and one post on Discord about unexpected behaviour due to this, it's already biting people
Adding a warning instead might be OK, but would there be any way for users to silence it (other than filterwarnings) if it's really what they want to use?

I don't think anyone reads "10.5 seconds" and thinks "ah, that's 10 seconds and 5 nanoseconds", but that's exactly what %S.%f does in Chrono and I find it really scary

In [15]: pl.Series(['2020-01-01T00:00:10.5']).str.strptime(pl.Datetime, '%Y-%m-%dT%H:%M:%S.%f')
Out[15]:
shape: (1,)
Series: '' [datetime[ns]]
[
        2020-01-01 00:00:10.000000005
]

That's for questioning the decision anyway, it's good to talk it through!

@stinodego
Copy link
Contributor Author

stinodego commented Apr 21, 2023

Here's some example data where I would like to use %f:

s = pl.Series(["05:10:10+ns12345", "05:10:10+ns1234567"])
result = s.str.strptime(pl.Time, "%H:%M:%S+ns%f")

Resulting in:

shape: (2,)
Series: '' [time]
[
        05:10:10.000012345
        05:10:10.001234567
]

The other options %9f, .9f etc do not suffice to parse this, as far as I've seen.

So now we've disabled some perfectly good functionality, that the chrono team has worked hard on, just because some people do not read the docs. We deviate slightly from what Python users would expect coming from the stdlib, but we do that all the time. Our philosophy is that we are not afraid to break with standards if the standards can clearly be improved.

The chrono formats are actually very clear and well thought out. In the formats that you present as confusing, it's seconds followed by a period followed by some numbers. In other words, decimals. Then you should use %.f.
But there are also formats where you want to parse subseconds where they are not decimals. See my example. There we want to use %f. Of course you will get confusing behaviour if you use this in the wrong context (decimals).

So in this case, I'd vouch for educating Python users in the (perhaps superior) chrono format, rather than disabling perfectly good functionality. That does leave open the question of how to teach this properly, which is not easy as you have stated...

@MarcoGorelli
Copy link
Collaborator

Here's some example data where I would like to use %f:

that's a good example, thanks! maybe it's just %S.%f that could raise then?

@stinodego
Copy link
Contributor Author

maybe it's just %S.%f that could raise then?

That would be better than the existing check. However, even in this case, you're making assumptions for the user. Maybe they really intended to use %S.%f - we don't know, and we shouldn't try to guess. And we definitely shouldn't disable functionality based on a guess.

Maybe a compromise would be to throw a warning in this case. We can define a custom Python warning ChronoFormatWarning or something and throw this when the format contains %S.%f. If users really intend to use this format then they can disable it explicitly with warnings.filterwarnings("ignore", category=ChronoFormatWarning).

Does that seem like a good solution to you?

@MarcoGorelli
Copy link
Collaborator

Yup, nice one! Thanks for the discussion, that sounds like the best solution to me

@stinodego
Copy link
Contributor Author

Great! I'll update the PR accordingly 👍 thanks for the insights!

@stinodego stinodego marked this pull request as draft April 21, 2023 11:57
@stinodego stinodego changed the title fix(python): Allow %f in strptime format strings feat(python): Allow %f in strptime format strings Apr 22, 2023
@stinodego stinodego removed the fix Bug fix label Apr 22, 2023
@github-actions github-actions bot added the enhancement New feature or an improvement of an existing feature label Apr 22, 2023
@stinodego stinodego force-pushed the fix-strptime-format-f branch from e23276c to 2aa84b0 Compare April 22, 2023 08:59
@stinodego stinodego marked this pull request as ready for review April 22, 2023 09:10
@stinodego stinodego requested a review from ritchie46 April 22, 2023 09:11
Copy link
Collaborator

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, it's a better solution! Over to Ritchie

@ritchie46
Copy link
Member

Thanks for the discourse/breakdown. Interesting read. And we have a better solution now! 💯

Great stuff. 👍

@ritchie46 ritchie46 merged commit d23bbd2 into pola-rs:main Apr 22, 2023
@stinodego stinodego deleted the fix-strptime-format-f branch May 25, 2023 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants