Skip to content

Commit f03f54d

Browse files
small textual edits and typos
1 parent fbeb69d commit f03f54d

File tree

1 file changed

+14
-12
lines changed

1 file changed

+14
-12
lines changed

web/pandas/pdeps/00xx-string-dtype.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ default in pandas 3.0:
1313

1414
* In pandas 3.0, enable a "string" dtype by default, using PyArrow if available
1515
or otherwise the numpy object-dtype alternative.
16-
* The default string dtype will use missing value semantics using NaN consistent
16+
* The default string dtype will use missing value semantics (using NaN) consistent
1717
with the other default data types.
1818

1919
This will give users a long-awaited proper string dtype for 3.0, while 1) not
@@ -26,12 +26,12 @@ using NumPy 2.0, etc).
2626
## Background
2727

2828
Currently, pandas by default stores text data in an `object`-dtype NumPy array.
29-
The current implementation has two primary drawbacks: First, `object`-dtype is
29+
The current implementation has two primary drawbacks. First, `object` dtype is
3030
not specific to strings: any Python object can be stored in an `object`-dtype
3131
array, not just strings, and seeing `object` as the dtype for a column with
3232
strings is confusing for users. Second: this is not efficient (all string
33-
methods on a Series are eventually done by calling Python methods on the
34-
individual string objects).
33+
methods on a Series are eventually calling Python methods on the individual
34+
string objects).
3535

3636
To solve the first issue, a dedicated extension dtype for string data has
3737
already been
@@ -51,8 +51,9 @@ This could be specified with the `storage` keyword in the opt-in string dtype
5151
Since its introduction, the `StringDtype` has always been opt-in, and has used
5252
the experimental `pd.NA` sentinel for missing values (which was also [introduced
5353
in pandas 1.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html#experimental-na-scalar-to-denote-missing-values)).
54-
However, up to this date, pandas has not yet made the step to use `pd.NA` by
55-
default.
54+
However, up to this date, pandas has not yet taken the step to use `pd.NA` by
55+
default, and thus the `StringDtype` deviates in missing value behaviour compared
56+
to the default data types.
5657

5758
In 2023, [PDEP-10](https://pandas.pydata.org/pdeps/0010-required-pyarrow-dependency.html)
5859
proposed to start using a PyArrow-backed string dtype by default in pandas 3.0
@@ -125,15 +126,15 @@ This option will be expanded to also work when PyArrow is not installed.
125126

126127
### Missing value semantics
127128

128-
Given that all other default data types uses NaN semantics for missing values,
129+
Given that all other default data types use NaN semantics for missing values,
129130
this proposal says that a new default string dtype should still use the same
130131
default semantics. Further, it should result in default data types when doing
131132
operations on the string column that result in a boolean or numeric data type
132133
(e.g., methods like `.str.startswith(..)` or `.str.len(..)`, or comparison
133134
operators like `==`, should result in default `int64` and `bool` data types).
134135

135-
Because the current original `StringDtype` implementations already use `pd.NA`
136-
and return masked integer and boolean arrays in operations, a new variant of the
136+
Because the original `StringDtype` implementations already use `pd.NA` and
137+
return masked integer and boolean arrays in operations, a new variant of the
137138
existing dtypes that uses `NaN` and default data types is needed.
138139

139140
### Object-dtype "fallback" implementation
@@ -175,7 +176,7 @@ To avoid introducing a new string dtype while other discussions and changes are
175176
in flux (eventually making pyarrow a required dependency? adopting `pd.NA` as
176177
the default missing value sentinel? using the new NumPy 2.0 capabilities?), we
177178
could also delay introducing a default string dtype until there is more clarity
178-
for those other discussions.
179+
in those other discussions.
179180

180181
However:
181182

@@ -184,11 +185,12 @@ However:
184185
significant part of the user base that has PyArrow installed) in performance.
185186
2. In case we eventually transition to use `pd.NA` as the default missing value
186187
sentinel, we will need a migration path for _all_ our data types, and thus
187-
the challenges around this will not be unique to the string dtype.
188+
the challenges around this will not be unique to the string dtype and
189+
therefore not a reason to delay this.
188190

189191
### Why not use the existing StringDtype with `pd.NA`?
190192

191-
Because adding even more variants of the string dtype will make things only more
193+
Wouldn't adding even more variants of the string dtype will make things only more
192194
confusing? Indeed, this proposal unfortunately introduces more variants of the
193195
string dtype. However, the reason for this is to ensure the actual default user
194196
experience is _less_ confusing, and the new string dtype fits better with the

0 commit comments

Comments
 (0)