Skip to content

Commit 561de87

Browse files
address part of the feedback
1 parent f03f54d commit 561de87

File tree

1 file changed

+16
-10
lines changed

1 file changed

+16
-10
lines changed

web/pandas/pdeps/00xx-string-dtype.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
- Created: May 3, 2024
44
- Status: Under discussion
5-
- Discussion:
5+
- Discussion: https://github.com/pandas-dev/pandas/pull/58551
66
- Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche)
77
- Revision: 1
88

@@ -71,10 +71,11 @@ data type in pandas that is not backed by Python objects.
7171
After acceptance of PDEP-10, two aspects of the proposal have been under
7272
reconsideration:
7373

74-
- Based on user feedback, it has been considered to relax the new `pyarrow`
75-
requirement to not be a _hard_ runtime dependency. In addition, NumPy 2.0 can
76-
potentially reduce the need to make PyArrow a required dependency specifically
77-
for a dedicated pandas string dtype.
74+
- Based on user feedback (mostly around installation complexity and size), it
75+
has been considered to relax the new `pyarrow` requirement to not be a _hard_
76+
runtime dependency. In addition, NumPy 2.0 could in the future potentially
77+
reduce the need to make PyArrow a required dependency specifically for a
78+
dedicated pandas string dtype.
7879
- The PDEP did not consider the usage of the experimental `pd.NA` as a
7980
consequence of adopting one of the existing implementations of the
8081
`StringDtype`.
@@ -105,6 +106,9 @@ To be able to move forward with a string data type in pandas 3.0, this PDEP prop
105106
4. We update installation guidelines to clearly encourage users to install
106107
pyarrow for the default user experience.
107108

109+
Those string dtypes enabled by default will then no longer be considered as
110+
experimental.
111+
108112
### Default inference of a string dtype
109113

110114
By default, pandas will infer this new string dtype for string data (when
@@ -141,15 +145,17 @@ existing dtypes that uses `NaN` and default data types is needed.
141145

142146
To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep
143147
a "fallback" option in case PyArrow is not installed. The original `StringDtype`
144-
backed by a numpy object-dtype array of Python strings can be used for this, and
145-
only need minor updates to follow the above-mentioned missing value semantics
148+
backed by a numpy object-dtype array of Python strings can be mostly reused for
149+
this (adding a new variant of the dtype) and a new `StringArray` subclass only
150+
needs minor changes to follow the above-mentioned missing value semantics
146151
([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)).
147152

148153
For pandas 3.0, this is the most realistic option given this implementation is
149154
already available for a long time. Beyond 3.0, we can still explore further
150-
improvements such as using nanoarrow or NumPy 2.0, but at that point that is an
151-
implementation detail that should not have a direct impact on users (except for
152-
performance).
155+
improvements such as using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503))
156+
or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)),
157+
but at that point that is an implementation detail that should not have a
158+
direct impact on users (except for performance).
153159

154160
### Naming
155161

0 commit comments

Comments
 (0)