-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
DOC: add pandas 3.0 migration guide for the string dtype #61705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
DOC: add pandas 3.0 migration guide for the string dtype #61705
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche i'll post these few now rather than doing too many in a batch, but feel free to wait until i'm done, whatever is more convenient for you.
not yet been made the default, and uses the ``pd.NA`` scalar to represent | ||
missing values. | ||
|
||
Pandas 3.0 changes the default dtype for strings to a new string data type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas 3.0 changes the default dtype for strings to a new string data type, | |
Pandas 3.0 changes the default inferred dtype for strings to a new string data type, |
.. - Breaking changes: | ||
.. - dtype is no longer object dtype | ||
.. - None gets coerced to NaN | ||
.. - setitem raises an error for non-string data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above is not rendered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this are comments, it was my outline when writing it (can remove this in the end)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. Thanks @jorisvandenbossche
True | ||
|
||
One caveat: this function works both on scalars and on array-likes, and in the | ||
latter case it will return an array of boolean dtype. When using it in a boolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latter case it will return an array of boolean dtype. When using it in a boolean | |
latter case it will return an array of Boolean dtype. When using it in a Boolean |
not to confuse with pandas nullable type should capitalize as named after George Boole?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy uses "boolean" as well, so would rather leave it like this, or can make it an "array of bools"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
.. code-block:: python | ||
|
||
>>> ser = pd.Series(["a", "b", None], dtype="str") | ||
>>> ser[1] = 2.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i notice you can do ser[1] = pd.NA
so we are accepting this as a missing value. Should we disallow this or perhaps encourage it instead to perhaps make migration to the pd.NA variant simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am not a super big fan of already allowing to assign pd.NA
for dtypes that don't use pd.NA
, although I am also fine with keeping it as is.
But so this also works this way for other dtypes (such as numpy float64 or datetime64, coercing NA to NaN or NaT respectively, similarly as we also coerce None
for those dtypes), so changing that is a bigger discussions not just about the string dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am not a super big fan of already allowing to assign
pd.NA
for dtypes that don't usepd.NA
, although I am also fine with keeping it as is.
sure. It doesn't create any issues really like with object
dtype.
@simonjayhawkins thanks a lot for the proofreading! |
Co-authored-by: Simon Hawkins <[email protected]>
/preview |
Added three more sections based on the items listed in #59328 |
This new string dtype should otherwise work the same as how you have been | ||
using pandas with string data today. For example, all string-specific methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new string dtype should otherwise work the same as how you have been | |
using pandas with string data today. For example, all string-specific methods | |
This new string dtype should otherwise behave the same as the existing ``object`` dtype users are used to. For example, all string-specific methods |
... | ||
TypeError: Cannot perform reduction 'prod' with string dtype | ||
|
||
For existing users of the nullable ``StringDtype`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you really want to keep writing i have no objection, but by construction these are advanced users who i dont think need as much hand-holding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly want to briefly mention in the docs (as I don't think we really do that anywhere, except for in the PDEP) that we made this backcompat as if you were using "string"
, that should keep working, except that we also switched the default from "python" to "pyarrow" storage.
(and maybe mention that if you were using it for getting the faster pyarrow one, but don't care about the missing value sentinel, you could also just use the default dtype now. But that might be a bit subjective/controversial to say, and indeed at that point they probably understand that themselves as well)
This PR starts adding a migration guide with some typical issues one might run into regarding the new string dtype when upgrading to pandas 3.0 (or when enabling it in pandas 2.3).
(for now I just added it to the user guide, which is already a long list of pages, so we might need to think about better organizing this or putting it elsewhere)
Closes #59328