Skip to content

Backport PR #31591 on branch 1.0.x #32192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 23, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions doc/source/user_guide/boolean.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@ Nullable Boolean Data Type
Indexing with NA values
-----------------------

pandas does not allow indexing with NA values. Attempting to do so
will raise a ``ValueError``.
pandas allows indexing with ``NA`` values in a boolean array, which are treated as ``False``.

.. versionchanged:: 1.0.2

.. ipython:: python
:okexcept:
Expand All @@ -30,12 +31,11 @@ will raise a ``ValueError``.
mask = pd.array([True, False, pd.NA], dtype="boolean")
s[mask]

The missing values will need to be explicitly filled with True or False prior
to using the array as a mask.
If you would prefer to keep the ``NA`` values you can manually fill them with ``fillna(True)``.

.. ipython:: python

s[mask.fillna(False)]
s[mask.fillna(True)]

.. _boolean.kleene:

Expand Down
12 changes: 10 additions & 2 deletions doc/source/user_guide/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ of multi-axis indexing.
slices, **both** the start and the stop are included, when present in the
index! See :ref:`Slicing with labels <indexing.slicing_with_labels>`
and :ref:`Endpoints are inclusive <advanced.endpoints_are_inclusive>`.)
* A boolean array
* A boolean array (any ``NA`` values will be treated as ``False``).
* A ``callable`` function with one argument (the calling Series or DataFrame) and
that returns valid output for indexing (one of the above).

Expand All @@ -75,7 +75,7 @@ of multi-axis indexing.
* An integer e.g. ``5``.
* A list or array of integers ``[4, 3, 0]``.
* A slice object with ints ``1:7``.
* A boolean array.
* A boolean array (any ``NA`` values will be treated as ``False``).
* A ``callable`` function with one argument (the calling Series or DataFrame) and
that returns valid output for indexing (one of the above).

Expand Down Expand Up @@ -374,6 +374,14 @@ For getting values with a boolean array:
df1.loc['a'] > 0
df1.loc[:, df1.loc['a'] > 0]

NA values in a boolean array propogate as ``False``:

.. versionchanged:: 1.0.2

mask = pd.array([True, False, True, False, pd.NA, False], dtype="boolean")
mask
df1[mask]

For getting a value explicitly:

.. ipython:: python
Expand Down
29 changes: 27 additions & 2 deletions doc/source/whatsnew/v1.0.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,33 @@ Fixed regressions

.. ---------------------------------------------------------------------------

Indexing with Nullable Boolean Arrays
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Previously indexing with a nullable Boolean array containing ``NA`` would raise a ``ValueError``, however this is now permitted with ``NA`` being treated as ``False``. (:issue:`31503`)

.. ipython:: python

s = pd.Series([1, 2, 3, 4])
mask = pd.array([True, True, False, None], dtype="boolean")
s
mask

*pandas 1.0.0-1.0.1*

.. code-block:: python

>>> s[mask]
Traceback (most recent call last):
...
ValueError: cannot mask with array containing NA / NaN values

*pandas 1.0.2*

.. ipython:: python

s[mask]

.. _whatsnew_102.bug_fixes:

Bug fixes
Expand All @@ -45,8 +72,6 @@ Bug fixes
- Using ``pd.NA`` with :meth:`DataFrame.to_json` now correctly outputs a null value instead of an empty object (:issue:`31615`)
- Fixed bug in parquet roundtrip with nullable unsigned integer dtypes (:issue:`31896`).



**Experimental dtypes**

- Fix bug in :meth:`DataFrame.convert_dtypes` for columns that were already using the ``"string"`` dtype (:issue:`31731`).
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -520,7 +520,9 @@ def __getitem__(self, key):
if com.is_bool_indexer(key):
# first convert to boolean, because check_array_indexer doesn't
# allow object dtype
key = np.asarray(key, dtype=bool)
if is_object_dtype(key):
key = np.asarray(key, dtype=bool)

key = check_array_indexer(self, key)
if key.all():
key = slice(0, None, None)
Expand Down
7 changes: 1 addition & 6 deletions pandas/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,24 +124,19 @@ def is_bool_indexer(key: Any) -> bool:
check_array_indexer : Check that `key` is a valid array to index,
and convert to an ndarray.
"""
na_msg = "cannot mask with array containing NA / NaN values"
if isinstance(key, (ABCSeries, np.ndarray, ABCIndex)) or (
is_array_like(key) and is_extension_array_dtype(key.dtype)
):
if key.dtype == np.object_:
key = np.asarray(values_from_object(key))

if not lib.is_bool_array(key):
na_msg = "Cannot mask with non-boolean array containing NA / NaN values"
if isna(key).any():
raise ValueError(na_msg)
return False
return True
elif is_bool_dtype(key.dtype):
# an ndarray with bool-dtype by definition has no missing values.
# So we only need to check for NAs in ExtensionArrays
if is_extension_array_dtype(key.dtype):
if np.any(key.isna()):
raise ValueError(na_msg)
return True
elif isinstance(key, list):
try:
Expand Down
14 changes: 6 additions & 8 deletions pandas/core/indexers.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from pandas.core.dtypes.common import (
is_array_like,
is_bool_dtype,
is_extension_array_dtype,
is_integer_dtype,
is_list_like,
)
Expand Down Expand Up @@ -333,14 +334,11 @@ def check_array_indexer(array: AnyArrayLike, indexer: Any) -> Any:
...
IndexError: Boolean index has wrong length: 3 instead of 2.

A ValueError is raised when the mask cannot be converted to
a bool-dtype ndarray.
NA values in a boolean array are treated as False.

>>> mask = pd.array([True, pd.NA])
>>> pd.api.indexers.check_array_indexer(arr, mask)
Traceback (most recent call last):
...
ValueError: Cannot mask with a boolean indexer containing NA values
array([ True, False])

A numpy boolean mask will get passed through (if the length is correct):

Expand Down Expand Up @@ -392,10 +390,10 @@ def check_array_indexer(array: AnyArrayLike, indexer: Any) -> Any:

dtype = indexer.dtype
if is_bool_dtype(dtype):
try:
if is_extension_array_dtype(dtype):
indexer = indexer.to_numpy(dtype=bool, na_value=False)
else:
indexer = np.asarray(indexer, dtype=bool)
except ValueError:
raise ValueError("Cannot mask with a boolean indexer containing NA values")

# GH26658
if len(indexer) != len(array):
Expand Down
7 changes: 5 additions & 2 deletions pandas/core/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
is_iterator,
is_list_like,
is_numeric_dtype,
is_object_dtype,
is_scalar,
is_sequence,
)
Expand Down Expand Up @@ -2319,10 +2320,12 @@ def check_bool_indexer(index: Index, key) -> np.ndarray:
"the indexed object do not match)."
)
result = result.astype(bool)._values
else:
# key might be sparse / object-dtype bool, check_array_indexer needs bool array
elif is_object_dtype(key):
# key might be object-dtype bool, check_array_indexer needs bool array
result = np.asarray(result, dtype=bool)
result = check_array_indexer(index, result)
else:
result = check_array_indexer(index, result)

return result

Expand Down
9 changes: 6 additions & 3 deletions pandas/tests/arrays/categorical/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,14 +240,17 @@ def test_mask_with_boolean(index):


@pytest.mark.parametrize("index", [True, False])
def test_mask_with_boolean_raises(index):
def test_mask_with_boolean_na_treated_as_false(index):
# https://github.com/pandas-dev/pandas/issues/31503
s = Series(range(3))
idx = Categorical([True, False, None])
if index:
idx = CategoricalIndex(idx)

with pytest.raises(ValueError, match="NA / NaN"):
s[idx]
result = s[idx]
expected = s[idx.fillna(False)]

tm.assert_series_equal(result, expected)


@pytest.fixture
Expand Down
20 changes: 11 additions & 9 deletions pandas/tests/extension/base/getitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,21 +158,23 @@ def test_getitem_boolean_array_mask(self, data):
result = pd.Series(data)[mask]
self.assert_series_equal(result, expected)

def test_getitem_boolean_array_mask_raises(self, data):
def test_getitem_boolean_na_treated_as_false(self, data):
# https://github.com/pandas-dev/pandas/issues/31503
mask = pd.array(np.zeros(data.shape, dtype="bool"), dtype="boolean")
mask[:2] = pd.NA
mask[2:4] = True

msg = (
"Cannot mask with a boolean indexer containing NA values|"
"cannot mask with array containing NA / NaN values"
)
with pytest.raises(ValueError, match=msg):
data[mask]
result = data[mask]
expected = data[mask.fillna(False)]

self.assert_extension_array_equal(result, expected)

s = pd.Series(data)

with pytest.raises(ValueError):
s[mask]
result = s[mask]
expected = s[mask.fillna(False)]

self.assert_series_equal(result, expected)

@pytest.mark.parametrize(
"idx",
Expand Down
84 changes: 84 additions & 0 deletions pandas/tests/extension/base/setitem.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,90 @@ def test_setitem_iloc_scalar_multiple_homogoneous(self, data):
df.iloc[10, 1] = data[1]
assert df.loc[10, "B"] == data[1]

@pytest.mark.parametrize(
"mask",
[
np.array([True, True, True, False, False]),
pd.array([True, True, True, False, False], dtype="boolean"),
pd.array([True, True, True, pd.NA, pd.NA], dtype="boolean"),
],
ids=["numpy-array", "boolean-array", "boolean-array-na"],
)
def test_setitem_mask(self, data, mask, box_in_series):
arr = data[:5].copy()
expected = arr.take([0, 0, 0, 3, 4])
if box_in_series:
arr = pd.Series(arr)
expected = pd.Series(expected)
arr[mask] = data[0]
self.assert_equal(expected, arr)

def test_setitem_mask_raises(self, data, box_in_series):
# wrong length
mask = np.array([True, False])

if box_in_series:
data = pd.Series(data)

with pytest.raises(IndexError, match="wrong length"):
data[mask] = data[0]

mask = pd.array(mask, dtype="boolean")
with pytest.raises(IndexError, match="wrong length"):
data[mask] = data[0]

def test_setitem_mask_boolean_array_with_na(self, data, box_in_series):
mask = pd.array(np.zeros(data.shape, dtype="bool"), dtype="boolean")
mask[:3] = True
mask[3:5] = pd.NA

if box_in_series:
data = pd.Series(data)

data[mask] = data[0]

assert (data[:3] == data[0]).all()

@pytest.mark.parametrize(
"idx",
[[0, 1, 2], pd.array([0, 1, 2], dtype="Int64"), np.array([0, 1, 2])],
ids=["list", "integer-array", "numpy-array"],
)
def test_setitem_integer_array(self, data, idx, box_in_series):
arr = data[:5].copy()
expected = data.take([0, 0, 0, 3, 4])

if box_in_series:
arr = pd.Series(arr)
expected = pd.Series(expected)

arr[idx] = arr[0]
self.assert_equal(arr, expected)

@pytest.mark.parametrize(
"idx, box_in_series",
[
([0, 1, 2, pd.NA], False),
pytest.param(
[0, 1, 2, pd.NA], True, marks=pytest.mark.xfail(reason="GH-31948")
),
(pd.array([0, 1, 2, pd.NA], dtype="Int64"), False),
(pd.array([0, 1, 2, pd.NA], dtype="Int64"), False),
],
ids=["list-False", "list-True", "integer-array-False", "integer-array-True"],
)
def test_setitem_integer_with_missing_raises(self, data, idx, box_in_series):
arr = data.copy()

# TODO(xfail) this raises KeyError about labels not found (it tries label-based)
# for list of labels with Series
if box_in_series:
arr = pd.Series(data, index=[tm.rands(4) for _ in range(len(data))])

msg = "Cannot index with an integer indexer containing NA values"
with pytest.raises(ValueError, match=msg):
arr[idx] = arr[0]

@pytest.mark.parametrize("as_callable", [True, False])
@pytest.mark.parametrize("setter", ["loc", None])
def test_setitem_mask_aligned(self, data, as_callable, setter):
Expand Down
42 changes: 42 additions & 0 deletions pandas/tests/extension/test_numpy.py
Original file line number Diff line number Diff line change
Expand Up @@ -396,6 +396,48 @@ def test_setitem_scalar_key_sequence_raise(self, data):
# Failed: DID NOT RAISE <class 'ValueError'>
super().test_setitem_scalar_key_sequence_raise(data)

# TODO: there is some issue with PandasArray, therefore,
# skip the setitem test for now, and fix it later (GH 31446)

@skip_nested
@pytest.mark.parametrize(
"mask",
[
np.array([True, True, True, False, False]),
pd.array([True, True, True, False, False], dtype="boolean"),
],
ids=["numpy-array", "boolean-array"],
)
def test_setitem_mask(self, data, mask, box_in_series):
super().test_setitem_mask(data, mask, box_in_series)

@skip_nested
def test_setitem_mask_raises(self, data, box_in_series):
super().test_setitem_mask_raises(data, box_in_series)

@skip_nested
@pytest.mark.parametrize(
"idx",
[[0, 1, 2], pd.array([0, 1, 2], dtype="Int64"), np.array([0, 1, 2])],
ids=["list", "integer-array", "numpy-array"],
)
def test_setitem_integer_array(self, data, idx, box_in_series):
super().test_setitem_integer_array(data, idx, box_in_series)

@skip_nested
@pytest.mark.parametrize(
"idx, box_in_series",
[
([0, 1, 2, pd.NA], False),
pytest.param([0, 1, 2, pd.NA], True, marks=pytest.mark.xfail),
(pd.array([0, 1, 2, pd.NA], dtype="Int64"), False),
(pd.array([0, 1, 2, pd.NA], dtype="Int64"), False),
],
ids=["list-False", "list-True", "integer-array-False", "integer-array-True"],
)
def test_setitem_integer_with_missing_raises(self, data, idx, box_in_series):
super().test_setitem_integer_with_missing_raises(data, idx, box_in_series)

@skip_nested
def test_setitem_slice(self, data, box_in_series):
super().test_setitem_slice(data, box_in_series)
Expand Down
Loading