Skip to content

Commit 83812e1

Browse files
TomAugspurgerjreback
authored andcommitted
API: Infer extension types in array (#29799)
1 parent 23bb61b commit 83812e1

File tree

9 files changed

+161
-64
lines changed

9 files changed

+161
-64
lines changed

doc/source/user_guide/integer_na.rst

Lines changed: 27 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -25,8 +25,7 @@ numbers.
2525

2626
Pandas can represent integer data with possibly missing values using
2727
:class:`arrays.IntegerArray`. This is an :ref:`extension types <extending.extension-types>`
28-
implemented within pandas. It is not the default dtype for integers, and will not be inferred;
29-
you must explicitly pass the dtype into :meth:`array` or :class:`Series`:
28+
implemented within pandas.
3029

3130
.. ipython:: python
3231
@@ -50,24 +49,43 @@ NumPy array.
5049
You can also pass the list-like object to the :class:`Series` constructor
5150
with the dtype.
5251

53-
.. ipython:: python
52+
.. warning::
5453

55-
s = pd.Series([1, 2, np.nan], dtype="Int64")
56-
s
54+
Currently :meth:`pandas.array` and :meth:`pandas.Series` use different
55+
rules for dtype inference. :meth:`pandas.array` will infer a nullable-
56+
integer dtype
5757

58-
By default (if you don't specify ``dtype``), NumPy is used, and you'll end
59-
up with a ``float64`` dtype Series:
58+
.. ipython:: python
6059
61-
.. ipython:: python
60+
pd.array([1, None])
61+
pd.array([1, 2])
62+
63+
For backwards-compatibility, :class:`Series` infers these as either
64+
integer or float dtype
65+
66+
.. ipython:: python
67+
68+
pd.Series([1, None])
69+
pd.Series([1, 2])
6270
63-
pd.Series([1, 2, np.nan])
71+
We recommend explicitly providing the dtype to avoid confusion.
72+
73+
.. ipython:: python
74+
75+
pd.array([1, None], dtype="Int64")
76+
pd.Series([1, None], dtype="Int64")
77+
78+
In the future, we may provide an option for :class:`Series` to infer a
79+
nullable-integer dtype.
6480

6581
Operations involving an integer array will behave similar to NumPy arrays.
6682
Missing values will be propagated, and the data will be coerced to another
6783
dtype if needed.
6884

6985
.. ipython:: python
7086
87+
s = pd.Series([1, 2, None], dtype="Int64")
88+
7189
# arithmetic
7290
s + 1
7391

doc/source/whatsnew/v1.0.0.rst

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -303,6 +303,38 @@ The following methods now also correctly output values for unobserved categories
303303
304304
df.groupby(["cat_1", "cat_2"], observed=False)["value"].count()
305305
306+
:meth:`pandas.array` inference changes
307+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
308+
309+
:meth:`pandas.array` now infers pandas' new extension types in several cases (:issue:`29791`):
310+
311+
1. String data (including missing values) now returns a :class:`arrays.StringArray`.
312+
2. Integer data (including missing values) now returns a :class:`arrays.IntegerArray`.
313+
3. Boolean data (including missing values) now returns the new :class:`arrays.BooleanArray`
314+
315+
*pandas 0.25.x*
316+
317+
.. code-block:: python
318+
319+
>>> pd.array(["a", None])
320+
<PandasArray>
321+
['a', None]
322+
Length: 2, dtype: object
323+
324+
>>> pd.array([1, None])
325+
<PandasArray>
326+
[1, None]
327+
Length: 2, dtype: object
328+
329+
330+
*pandas 1.0.0*
331+
332+
.. ipython:: python
333+
334+
pd.array(["a", None])
335+
pd.array([1, None])
336+
337+
As a reminder, you can specify the ``dtype`` to disable all inference.
306338

307339
By default :meth:`Categorical.min` now returns the minimum instead of np.nan
308340
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -408,7 +440,6 @@ Other API changes
408440
- :meth:`Series.dropna` has dropped its ``**kwargs`` argument in favor of a single ``how`` parameter.
409441
Supplying anything else than ``how`` to ``**kwargs`` raised a ``TypeError`` previously (:issue:`29388`)
410442
- When testing pandas, the new minimum required version of pytest is 5.0.1 (:issue:`29664`)
411-
-
412443

413444

414445
.. _whatsnew_1000.api.documentation:

pandas/_libs/lib.pyx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1313,7 +1313,7 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
13131313

13141314
elif isinstance(val, str):
13151315
if is_string_array(values, skipna=skipna):
1316-
return 'string'
1316+
return "string"
13171317

13181318
elif isinstance(val, bytes):
13191319
if is_bytes_array(values, skipna=skipna):

pandas/core/construction.py

Lines changed: 57 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -94,10 +94,19 @@ def array(
9494
:class:`pandas.Period` :class:`pandas.arrays.PeriodArray`
9595
:class:`datetime.datetime` :class:`pandas.arrays.DatetimeArray`
9696
:class:`datetime.timedelta` :class:`pandas.arrays.TimedeltaArray`
97+
:class:`int` :class:`pandas.arrays.IntegerArray`
98+
:class:`str` :class:`pandas.arrays.StringArray`
99+
:class:`bool` :class:`pandas.arrays.BooleanArray`
97100
============================== =====================================
98101
99102
For all other cases, NumPy's usual inference rules will be used.
100103
104+
.. versionchanged:: 1.0.0
105+
106+
Pandas infers nullable-integer dtype for integer data,
107+
string dtype for string data, and nullable-boolean dtype
108+
for boolean data.
109+
101110
copy : bool, default True
102111
Whether to copy the data, even if not necessary. Depending
103112
on the type of `data`, creating the new array may require
@@ -154,14 +163,6 @@ def array(
154163
['a', 'b']
155164
Length: 2, dtype: str32
156165
157-
Or use the dedicated constructor for the array you're expecting, and
158-
wrap that in a PandasArray
159-
160-
>>> pd.array(np.array(['a', 'b'], dtype='<U1'))
161-
<PandasArray>
162-
['a', 'b']
163-
Length: 2, dtype: str32
164-
165166
Finally, Pandas has arrays that mostly overlap with NumPy
166167
167168
* :class:`arrays.DatetimeArray`
@@ -184,20 +185,28 @@ def array(
184185
185186
Examples
186187
--------
187-
If a dtype is not specified, `data` is passed through to
188-
:meth:`numpy.array`, and a :class:`arrays.PandasArray` is returned.
188+
If a dtype is not specified, pandas will infer the best dtype from the values.
189+
See the description of `dtype` for the types pandas infers for.
189190
190191
>>> pd.array([1, 2])
191-
<PandasArray>
192+
<IntegerArray>
192193
[1, 2]
193-
Length: 2, dtype: int64
194+
Length: 2, dtype: Int64
194195
195-
Or the NumPy dtype can be specified
196+
>>> pd.array([1, 2, np.nan])
197+
<IntegerArray>
198+
[1, 2, NaN]
199+
Length: 3, dtype: Int64
196200
197-
>>> pd.array([1, 2], dtype=np.dtype("int32"))
198-
<PandasArray>
199-
[1, 2]
200-
Length: 2, dtype: int32
201+
>>> pd.array(["a", None, "c"])
202+
<StringArray>
203+
['a', nan, 'c']
204+
Length: 3, dtype: string
205+
206+
>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")])
207+
<PeriodArray>
208+
['2000-01-01', '2000-01-01']
209+
Length: 2, dtype: period[D]
201210
202211
You can use the string alias for `dtype`
203212
@@ -212,29 +221,24 @@ def array(
212221
[a, b, a]
213222
Categories (3, object): [a < b < c]
214223
215-
Because omitting the `dtype` passes the data through to NumPy,
216-
a mixture of valid integers and NA will return a floating-point
217-
NumPy array.
224+
If pandas does not infer a dedicated extension type a
225+
:class:`arrays.PandasArray` is returned.
218226
219-
>>> pd.array([1, 2, np.nan])
227+
>>> pd.array([1.1, 2.2])
220228
<PandasArray>
221-
[1.0, 2.0, nan]
222-
Length: 3, dtype: float64
223-
224-
To use pandas' nullable :class:`pandas.arrays.IntegerArray`, specify
225-
the dtype:
229+
[1.1, 2.2]
230+
Length: 2, dtype: float64
226231
227-
>>> pd.array([1, 2, np.nan], dtype='Int64')
228-
<IntegerArray>
229-
[1, 2, NaN]
230-
Length: 3, dtype: Int64
232+
As mentioned in the "Notes" section, new extension types may be added
233+
in the future (by pandas or 3rd party libraries), causing the return
234+
value to no longer be a :class:`arrays.PandasArray`. Specify the `dtype`
235+
as a NumPy dtype if you need to ensure there's no future change in
236+
behavior.
231237
232-
Pandas will infer an ExtensionArray for some types of data:
233-
234-
>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")])
235-
<PeriodArray>
236-
['2000-01-01', '2000-01-01']
237-
Length: 2, dtype: period[D]
238+
>>> pd.array([1, 2], dtype=np.dtype("int32"))
239+
<PandasArray>
240+
[1, 2]
241+
Length: 2, dtype: int32
238242
239243
`data` must be 1-dimensional. A ValueError is raised when the input
240244
has the wrong dimensionality.
@@ -246,21 +250,26 @@ def array(
246250
"""
247251
from pandas.core.arrays import (
248252
period_array,
253+
BooleanArray,
254+
IntegerArray,
249255
IntervalArray,
250256
PandasArray,
251257
DatetimeArray,
252258
TimedeltaArray,
259+
StringArray,
253260
)
254261

255262
if lib.is_scalar(data):
256263
msg = "Cannot pass scalar '{}' to 'pandas.array'."
257264
raise ValueError(msg.format(data))
258265

259-
data = extract_array(data, extract_numpy=True)
260-
261-
if dtype is None and isinstance(data, ABCExtensionArray):
266+
if dtype is None and isinstance(
267+
data, (ABCSeries, ABCIndexClass, ABCExtensionArray)
268+
):
262269
dtype = data.dtype
263270

271+
data = extract_array(data, extract_numpy=True)
272+
264273
# this returns None for not-found dtypes.
265274
if isinstance(dtype, str):
266275
dtype = registry.find(dtype) or dtype
@@ -270,7 +279,7 @@ def array(
270279
return cls._from_sequence(data, dtype=dtype, copy=copy)
271280

272281
if dtype is None:
273-
inferred_dtype = lib.infer_dtype(data, skipna=False)
282+
inferred_dtype = lib.infer_dtype(data, skipna=True)
274283
if inferred_dtype == "period":
275284
try:
276285
return period_array(data, copy=copy)
@@ -298,7 +307,14 @@ def array(
298307
# timedelta, timedelta64
299308
return TimedeltaArray._from_sequence(data, copy=copy)
300309

301-
# TODO(BooleanArray): handle this type
310+
elif inferred_dtype == "string":
311+
return StringArray._from_sequence(data, copy=copy)
312+
313+
elif inferred_dtype == "integer":
314+
return IntegerArray._from_sequence(data, copy=copy)
315+
316+
elif inferred_dtype == "boolean":
317+
return BooleanArray._from_sequence(data, copy=copy)
302318

303319
# Pandas overrides NumPy for
304320
# 1. datetime64[ns]

pandas/tests/arrays/test_array.py

Lines changed: 33 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -19,14 +19,18 @@
1919
"data, dtype, expected",
2020
[
2121
# Basic NumPy defaults.
22-
([1, 2], None, PandasArray(np.array([1, 2]))),
22+
([1, 2], None, pd.arrays.IntegerArray._from_sequence([1, 2])),
2323
([1, 2], object, PandasArray(np.array([1, 2], dtype=object))),
2424
(
2525
[1, 2],
2626
np.dtype("float32"),
2727
PandasArray(np.array([1.0, 2.0], dtype=np.dtype("float32"))),
2828
),
29-
(np.array([1, 2]), None, PandasArray(np.array([1, 2]))),
29+
(
30+
np.array([1, 2], dtype="int64"),
31+
None,
32+
pd.arrays.IntegerArray._from_sequence([1, 2]),
33+
),
3034
# String alias passes through to NumPy
3135
([1, 2], "float32", PandasArray(np.array([1, 2], dtype="float32"))),
3236
# Period alias
@@ -113,6 +117,20 @@
113117
# IntegerNA
114118
([1, None], "Int16", integer_array([1, None], dtype="Int16")),
115119
(pd.Series([1, 2]), None, PandasArray(np.array([1, 2], dtype=np.int64))),
120+
# String
121+
(["a", None], "string", pd.arrays.StringArray._from_sequence(["a", None])),
122+
(
123+
["a", None],
124+
pd.StringDtype(),
125+
pd.arrays.StringArray._from_sequence(["a", None]),
126+
),
127+
# Boolean
128+
([True, None], "boolean", pd.arrays.BooleanArray._from_sequence([True, None])),
129+
(
130+
[True, None],
131+
pd.BooleanDtype(),
132+
pd.arrays.BooleanArray._from_sequence([True, None]),
133+
),
116134
# Index
117135
(pd.Index([1, 2]), None, PandasArray(np.array([1, 2], dtype=np.int64))),
118136
# Series[EA] returns the EA
@@ -139,15 +157,15 @@ def test_array(data, dtype, expected):
139157
def test_array_copy():
140158
a = np.array([1, 2])
141159
# default is to copy
142-
b = pd.array(a)
160+
b = pd.array(a, dtype=a.dtype)
143161
assert np.shares_memory(a, b._ndarray) is False
144162

145163
# copy=True
146-
b = pd.array(a, copy=True)
164+
b = pd.array(a, dtype=a.dtype, copy=True)
147165
assert np.shares_memory(a, b._ndarray) is False
148166

149167
# copy=False
150-
b = pd.array(a, copy=False)
168+
b = pd.array(a, dtype=a.dtype, copy=False)
151169
assert np.shares_memory(a, b._ndarray) is True
152170

153171

@@ -211,6 +229,15 @@ def test_array_copy():
211229
np.array([1, 2], dtype="m8[us]"),
212230
pd.arrays.TimedeltaArray(np.array([1000, 2000], dtype="m8[ns]")),
213231
),
232+
# integer
233+
([1, 2], pd.arrays.IntegerArray._from_sequence([1, 2])),
234+
([1, None], pd.arrays.IntegerArray._from_sequence([1, None])),
235+
# string
236+
(["a", "b"], pd.arrays.StringArray._from_sequence(["a", "b"])),
237+
(["a", None], pd.arrays.StringArray._from_sequence(["a", None])),
238+
# Boolean
239+
([True, False], pd.arrays.BooleanArray._from_sequence([True, False])),
240+
([True, None], pd.arrays.BooleanArray._from_sequence([True, None])),
214241
],
215242
)
216243
def test_array_inference(data, expected):
@@ -241,7 +268,7 @@ def test_array_inference_fails(data):
241268
@pytest.mark.parametrize("data", [np.array([[1, 2], [3, 4]]), [[1, 2], [3, 4]]])
242269
def test_nd_raises(data):
243270
with pytest.raises(ValueError, match="PandasArray must be 1-dimensional"):
244-
pd.array(data)
271+
pd.array(data, dtype="int64")
245272

246273

247274
def test_scalar_raises():

pandas/tests/dtypes/test_inference.py

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -732,12 +732,17 @@ def test_string(self):
732732
def test_unicode(self):
733733
arr = ["a", np.nan, "c"]
734734
result = lib.infer_dtype(arr, skipna=False)
735+
# This currently returns "mixed", but it's not clear that's optimal.
736+
# This could also return "string" or "mixed-string"
735737
assert result == "mixed"
736738

737739
arr = ["a", np.nan, "c"]
738740
result = lib.infer_dtype(arr, skipna=True)
739-
expected = "string"
740-
assert result == expected
741+
assert result == "string"
742+
743+
arr = ["a", "c"]
744+
result = lib.infer_dtype(arr, skipna=False)
745+
assert result == "string"
741746

742747
@pytest.mark.parametrize(
743748
"dtype, missing, skipna, expected",

0 commit comments

Comments
 (0)