Skip to content

BUG (?): Some rolling window calculations do not work on Int64Dtype Series containing pd.NA #44291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
tamargrey opened this issue Nov 2, 2021 · 1 comment
Closed
2 of 3 tasks
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@tamargrey
Copy link

tamargrey commented Nov 2, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

s = pd.Series([1,2,3,pd.NA,5], dtype='Int64')

# Raises DataError: No numeric types to aggregate
s.rolling(3).max()

# count works
s.rolling(3).count()

Issue Description

When trying to run a rolling window calculation on an Int64Dtype series that contains NaNs, it seems that many of the available calculations raise the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/window/rolling.py in _prep_values(self, values)
    322             try:
--> 323                 values = ensure_float64(values)
    324             except (ValueError, TypeError) as err:

pandas/_libs/algos_common_helper.pxi in pandas._libs.algos.ensure_float64()

~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/arrays/masked.py in __array__(self, dtype)
    334         """
--> 335         return self.to_numpy(dtype=dtype)
    336 

~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/arrays/masked.py in to_numpy(self, dtype, copy, na_value)
    291             ):
--> 292                 raise ValueError(
    293                     f"cannot convert to '{dtype}'-dtype NumPy array "

ValueError: cannot convert to 'float64'-dtype NumPy array with missing values. Specify an appropriate 'na_value' for this dtype.

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/window/rolling.py in _apply_series(self, homogeneous_func, name)
    403         try:
--> 404             values = self._prep_values(obj._values)
    405         except (TypeError, NotImplementedError) as err:

~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/window/rolling.py in _prep_values(self, values)
    324             except (ValueError, TypeError) as err:
--> 325                 raise TypeError(f"cannot handle this type -> {values.dtype}") from err
    326 

TypeError: cannot handle this type -> Int64

The above exception was the direct cause of the following exception:

DataError                                 Traceback (most recent call last)
<ipython-input-11-7b94d5aa8fee> in <module>
      1 import pandas as pd
      2 s = pd.Series([1,2,3,None,5], dtype='Int64')
----> 3 s.rolling(3).max()

~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/window/rolling.py in max(self, engine, engine_kwargs, *args, **kwargs)
   1764     ):
   1765         nv.validate_rolling_func("max", args, kwargs)
-> 1766         return super().max(*args, engine=engine, engine_kwargs=engine_kwargs, **kwargs)
   1767 
   1768     @doc(

~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/window/rolling.py in max(self, engine, engine_kwargs, *args, **kwargs)
   1261             )
   1262         window_func = window_aggregations.roll_max
-> 1263         return self._apply(window_func, name="max", **kwargs)
   1264 
   1265     def min(

~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/window/rolling.py in _apply(self, func, name, numba_cache_key, **kwargs)
    543 
    544         if self.method == "single":
--> 545             return self._apply_blockwise(homogeneous_func, name)
    546         else:
    547             return self._apply_tablewise(homogeneous_func, name)

~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/window/rolling.py in _apply_blockwise(self, homogeneous_func, name)
    417         """
    418         if self._selected_obj.ndim == 1:
--> 419             return self._apply_series(homogeneous_func, name)
    420 
    421         obj = self._create_data(self._selected_obj)

~/.pyenv/versions/3.8.2/envs/test-env/lib/python3.8/site-packages/pandas/core/window/rolling.py in _apply_series(self, homogeneous_func, name)
    404             values = self._prep_values(obj._values)
    405         except (TypeError, NotImplementedError) as err:
--> 406             raise DataError("No numeric types to aggregate") from err
    407 
    408         result = homogeneous_func(values)

DataError: No numeric types to aggregate 

I've only shown this behavior for pandas.core.window.rolling.Rolling.max, but it also exists for many other of the pandas.core.window.rolling.Rolling methods. One calculation in the pandas.core.window.rolling.Rolling series of methods that does not raise this error is count, which seems to have a different handling of null values altogether from the other methods, so that may be related.

My current work-around is to convert Int64 columns to float64 before calling series.rolling.

Expected Behavior

I would expect the behavior to match that of a rolling window calculation on a series with 'float64' dtype series containing nans.

# Using 'float64' works
s = pd.Series([1,2,3,None,5], dtype='float64')
s.rolling(3).max()

In this case, the result is as follows:

0    NaN
1    NaN
2    3.0
3    NaN
4    NaN
dtype: float64

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.8.2.final.0
python-bits : 64
OS : Darwin
OS-release : 20.5.0
Version : Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 41.2.0
Cython : 0.29.17
pytest : 6.0.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : 2.11.3
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 2021.06.0
fastparquet : 0.5.0
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.53.0

@tamargrey tamargrey added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 2, 2021
@mroeschke
Copy link
Member

Thanks for the report.

This has been addressed on our development branch and will be fixed in v1.4.0 (likely December 2021) #43174

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants