@@ -13,7 +13,7 @@ default in pandas 3.0:
13
13
14
14
* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
15
15
or otherwise the numpy object-dtype alternative.
16
- * The default string dtype will use missing value semantics using NaN consistent
16
+ * The default string dtype will use missing value semantics ( using NaN) consistent
17
17
with the other default data types.
18
18
19
19
This will give users a long-awaited proper string dtype for 3.0, while 1) not
@@ -26,12 +26,12 @@ using NumPy 2.0, etc).
26
26
## Background
27
27
28
28
Currently, pandas by default stores text data in an ` object ` -dtype NumPy array.
29
- The current implementation has two primary drawbacks: First, ` object ` - dtype is
29
+ The current implementation has two primary drawbacks. First, ` object ` dtype is
30
30
not specific to strings: any Python object can be stored in an ` object ` -dtype
31
31
array, not just strings, and seeing ` object ` as the dtype for a column with
32
32
strings is confusing for users. Second: this is not efficient (all string
33
- methods on a Series are eventually done by calling Python methods on the
34
- individual string objects).
33
+ methods on a Series are eventually calling Python methods on the individual
34
+ string objects).
35
35
36
36
To solve the first issue, a dedicated extension dtype for string data has
37
37
already been
@@ -51,8 +51,9 @@ This could be specified with the `storage` keyword in the opt-in string dtype
51
51
Since its introduction, the ` StringDtype ` has always been opt-in, and has used
52
52
the experimental ` pd.NA ` sentinel for missing values (which was also [ introduced
53
53
in pandas 1.0] ( https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values ) ).
54
- However, up to this date, pandas has not yet made the step to use ` pd.NA ` by
55
- default.
54
+ However, up to this date, pandas has not yet taken the step to use ` pd.NA ` by
55
+ default, and thus the ` StringDtype ` deviates in missing value behaviour compared
56
+ to the default data types.
56
57
57
58
In 2023, [ PDEP-10] ( https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html )
58
59
proposed to start using a PyArrow-backed string dtype by default in pandas 3.0
@@ -125,15 +126,15 @@ This option will be expanded to also work when PyArrow is not installed.
125
126
126
127
### Missing value semantics
127
128
128
- Given that all other default data types uses NaN semantics for missing values,
129
+ Given that all other default data types use NaN semantics for missing values,
129
130
this proposal says that a new default string dtype should still use the same
130
131
default semantics. Further, it should result in default data types when doing
131
132
operations on the string column that result in a boolean or numeric data type
132
133
(e.g., methods like ` .str.startswith(..) ` or ` .str.len(..) ` , or comparison
133
134
operators like ` == ` , should result in default ` int64 ` and ` bool ` data types).
134
135
135
- Because the current original ` StringDtype ` implementations already use ` pd.NA `
136
- and return masked integer and boolean arrays in operations, a new variant of the
136
+ Because the original ` StringDtype ` implementations already use ` pd.NA ` and
137
+ return masked integer and boolean arrays in operations, a new variant of the
137
138
existing dtypes that uses ` NaN ` and default data types is needed.
138
139
139
140
### Object-dtype "fallback" implementation
@@ -175,7 +176,7 @@ To avoid introducing a new string dtype while other discussions and changes are
175
176
in flux (eventually making pyarrow a required dependency? adopting ` pd.NA ` as
176
177
the default missing value sentinel? using the new NumPy 2.0 capabilities?), we
177
178
could also delay introducing a default string dtype until there is more clarity
178
- for those other discussions.
179
+ in those other discussions.
179
180
180
181
However:
181
182
@@ -184,11 +185,12 @@ However:
184
185
significant part of the user base that has PyArrow installed) in performance.
185
186
2 . In case we eventually transition to use ` pd.NA ` as the default missing value
186
187
sentinel, we will need a migration path for _ all_ our data types, and thus
187
- the challenges around this will not be unique to the string dtype.
188
+ the challenges around this will not be unique to the string dtype and
189
+ therefore not a reason to delay this.
188
190
189
191
### Why not use the existing StringDtype with ` pd.NA ` ?
190
192
191
- Because adding even more variants of the string dtype will make things only more
193
+ Wouldn't adding even more variants of the string dtype will make things only more
192
194
confusing? Indeed, this proposal unfortunately introduces more variants of the
193
195
string dtype. However, the reason for this is to ensure the actual default user
194
196
experience is _ less_ confusing, and the new string dtype fits better with the
0 commit comments