|
2 | 2 |
|
3 | 3 | - Created: May 3, 2024
|
4 | 4 | - Status: Under discussion
|
5 |
| -- Discussion: |
| 5 | +- Discussion: https://github.com/pandas-dev/pandas/pull/58551 |
6 | 6 | - Author: [Joris Van den Bossche](https://github.com/jorisvandenbossche)
|
7 | 7 | - Revision: 1
|
8 | 8 |
|
@@ -71,10 +71,11 @@ data type in pandas that is not backed by Python objects.
|
71 | 71 | After acceptance of PDEP-10, two aspects of the proposal have been under
|
72 | 72 | reconsideration:
|
73 | 73 |
|
74 |
| -- Based on user feedback, it has been considered to relax the new `pyarrow` |
75 |
| - requirement to not be a _hard_ runtime dependency. In addition, NumPy 2.0 can |
76 |
| - potentially reduce the need to make PyArrow a required dependency specifically |
77 |
| - for a dedicated pandas string dtype. |
| 74 | +- Based on user feedback (mostly around installation complexity and size), it |
| 75 | + has been considered to relax the new `pyarrow` requirement to not be a _hard_ |
| 76 | + runtime dependency. In addition, NumPy 2.0 could in the future potentially |
| 77 | + reduce the need to make PyArrow a required dependency specifically for a |
| 78 | + dedicated pandas string dtype. |
78 | 79 | - The PDEP did not consider the usage of the experimental `pd.NA` as a
|
79 | 80 | consequence of adopting one of the existing implementations of the
|
80 | 81 | `StringDtype`.
|
@@ -105,6 +106,9 @@ To be able to move forward with a string data type in pandas 3.0, this PDEP prop
|
105 | 106 | 4. We update installation guidelines to clearly encourage users to install
|
106 | 107 | pyarrow for the default user experience.
|
107 | 108 |
|
| 109 | +Those string dtypes enabled by default will then no longer be considered as |
| 110 | +experimental. |
| 111 | + |
108 | 112 | ### Default inference of a string dtype
|
109 | 113 |
|
110 | 114 | By default, pandas will infer this new string dtype for string data (when
|
@@ -141,15 +145,17 @@ existing dtypes that uses `NaN` and default data types is needed.
|
141 | 145 |
|
142 | 146 | To avoid a hard dependency on PyArrow for pandas 3.0, this PDEP proposes to keep
|
143 | 147 | a "fallback" option in case PyArrow is not installed. The original `StringDtype`
|
144 |
| -backed by a numpy object-dtype array of Python strings can be used for this, and |
145 |
| -only need minor updates to follow the above-mentioned missing value semantics |
| 148 | +backed by a numpy object-dtype array of Python strings can be mostly reused for |
| 149 | +this (adding a new variant of the dtype) and a new `StringArray` subclass only |
| 150 | +needs minor changes to follow the above-mentioned missing value semantics |
146 | 151 | ([GH-58451](https://github.com/pandas-dev/pandas/pull/58451)).
|
147 | 152 |
|
148 | 153 | For pandas 3.0, this is the most realistic option given this implementation is
|
149 | 154 | already available for a long time. Beyond 3.0, we can still explore further
|
150 |
| -improvements such as using nanoarrow or NumPy 2.0, but at that point that is an |
151 |
| -implementation detail that should not have a direct impact on users (except for |
152 |
| -performance). |
| 155 | +improvements such as using NumPy 2.0 ([GH-58503](https://github.com/pandas-dev/pandas/issues/58503)) |
| 156 | +or nanoarrow ([GH-58552](https://github.com/pandas-dev/pandas/issues/58552)), |
| 157 | +but at that point that is an implementation detail that should not have a |
| 158 | +direct impact on users (except for performance). |
153 | 159 |
|
154 | 160 | ### Naming
|
155 | 161 |
|
|
0 commit comments