Skip to content

Allow rolling API to accept BaseIndexer subclass #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
48 changes: 48 additions & 0 deletions pandas/_libs/_window.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import abc
from typing import Dict, Sequence, Tuple

import numpy as np

BeginEnd = Tuple[np.ndarray, np.ndarray]

# TODO: Refactor MockFixedWindowIndexer, FixedWindowIndexer,
# VariableWindowIndexer to also have `get_window_bounds` methods that
# only calculates start & stop

# TODO: Currently, when win_type is specified, it calls a special routine,
# `roll_window`, while None win_type ops dispatch to specific methods.
# Consider consolidating?


class BaseIndexer(abc.ABC):

def __init__(self, index, offset, keys):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whatever you can type would be great

# TODO: The alternative is for the `rolling` API to accept
# index, offset, and keys as keyword arguments
self.index = index
self.offset = offset
self.keys = keys # type: Sequence[np.ndarray]

@classmethod
@abc.abstractmethod
def get_window_bounds(cls, **kwargs: Dict) -> BeginEnd:
"""
Compute the bounds of a window.

Users should subclass this class to implement a custom method
to calculate window bounds

Parameters
----------
kwargs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we explicity set the kwargs?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to allow indexer authors to pass arbitrary data here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, these should be explicit.

A dictionary of keyword arguments obtained from the top level
rolling API e.g. min_periods, win_type

Returns
-------
BeginEnd
A tuple of ndarray[int64]s, indicating the boundaries of each
window

"""
return NotImplemented
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to return anything.

21 changes: 18 additions & 3 deletions pandas/core/window.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import numpy as np

import pandas._libs.window as libwindow
import pandas._libs._window as libwindow_refactor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I would either do this directly in .window or name this something non-private (as its already private). Note that if say refactoring needs to be done to make things cleaner we can do pre-cursors to master.

from pandas.compat._optional import import_optional_dependency
from pandas.compat.numpy import function as nv
from pandas.util._decorators import Appender, Substitution, cache_readonly
Expand Down Expand Up @@ -451,14 +452,19 @@ class Window(_Window):

Parameters
----------
window : int, or offset
window : int, offset, or BaseIndexer subclass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally type as much as possible (again I know that things are not typed now), maybe makes sense to do a pre-cursor to type what we have now

Size of the moving window. This is the number of observations used for
calculating the statistic. Each window will be a fixed size.

If its an offset then this will be the time period of each window. Each
window will be a variable sized based on the observations included in
the time-period. This is only valid for datetimelike indexes. This is
new in 0.19.0

If a BaseIndexer subclass is passed, calculates the window boundaries
based on the defined ``get_window_bounds`` method. Additional rolling
keyword arguments, namely `min_periods`, `center`, `win_type`, and
`closed` will be passed to `get_window_bounds`.
min_periods : int, default None
Minimum number of observations in window required to have a value
(otherwise result is NA). For a window that is specified by an offset,
Expand Down Expand Up @@ -599,7 +605,8 @@ def validate(self):
super().validate()

window = self.window
if isinstance(window, (list, tuple, np.ndarray)):
if isinstance(window, (list, tuple, np.ndarray,
libwindow_refactor.BaseIndexer)):
pass
elif is_integer(window):
if window <= 0:
Expand Down Expand Up @@ -661,6 +668,11 @@ def _pop_args(win_type, arg_names, kwargs):
win_type = _validate_win_type(self.win_type, kwargs)
# GH #15662. `False` makes symmetric window, rather than periodic.
return sig.get_window(win_type, window, False).astype(float)
elif isinstance(window, libwindow_refactor.BaseIndexer):
return window.get_window_span(win_type=self.win_type,
min_periods=self.min_periods,
center=self.center,
closed=self.closed)

def _apply_window(self, mean=True, **kwargs):
"""
Expand Down Expand Up @@ -1628,7 +1640,8 @@ def validate(self):
# min_periods must be an integer
if self.min_periods is None:
self.min_periods = 1

elif isinstance(self.window, libwindow_refactor.BaseIndexer):
pass
elif not is_integer(self.window):
raise ValueError("window must be an integer")
elif self.window < 0:
Expand Down Expand Up @@ -2605,6 +2618,8 @@ def _get_center_of_mass(comass, span, halflife, alpha):


def _offset(window, center):
# TODO: (MATT) If the window is a BaseIndexer subclass,
# we need to pass in the materialized window
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the type of window here? It looks like it can be a sequence or an integer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. After a light audit, I anticipate the materialized window may be passed here and other times and Indexer class might be passed here.

Overall I think this routine is for label formatting.

if not is_integer(window):
window = len(window)
offset = (window - 1) / 2. if center else 0
Expand Down
53 changes: 52 additions & 1 deletion pandas/tests/test_window.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,13 @@
import pytest

from pandas.errors import UnsupportedFunctionCall
import pandas._libs._window as libwindow_refactor
import pandas.util._test_decorators as td

import pandas as pd
from pandas import (
DataFrame, Index, Series, Timestamp, bdate_range, concat, isna, notna)
DataFrame, Index, Series, Timestamp, bdate_range, concat, date_range,
isna, notna)
from pandas.core.base import DataError, SpecificationError
from pandas.core.sorting import safe_sort
import pandas.core.window as rwindow
Expand Down Expand Up @@ -54,6 +56,38 @@ def arithmetic_win_operators(request):
return request.param


@pytest.fixture(params=['right', 'left', 'both', 'neither'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of this seems a good candidate for a pre-cursor PR to master (the fixtures)

def closed(request):
return request.param


@pytest.fixture(params=[True, False])
def center(request):
return request.param


@pytest.fixture(params=[None, 1])
def min_periods(request):
return request.param


@pytest.fixture
def dummy_custom_indexer():

class DummyIndexer(libwindow_refactor.BaseIndexer):

def __init__(self, index, offset, keys):
super().__init__(index, offset, keys)

def get_window_bounds(self, **kwargs):
pass

idx = date_range('2019', freq='D', periods=3)
offset = offsets.BusinessDay(1)
keys = ['A']
return DummyIndexer(index=idx, offset=offset, keys=keys)


class Base:

_nan_locs = np.arange(20, 40)
Expand Down Expand Up @@ -4154,3 +4188,20 @@ def test_rolling_cov_offset(self):

expected2 = ss.rolling(3, min_periods=1).cov()
tm.assert_series_equal(result, expected2)


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not averse to actually putting the tests in a separate file (may make review easier).

so I think a pre-cursor PR to move pandas/tests/test_window.py -> pandas/tests/window/test_window.py makes sense (then we can add on things in an easy namespace)

class TestCustomIndexer:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class isn't necessary. Since we have pytest there's really no reason not to use top-level-function-style tests.


def test_custom_indexer_validates(self,
dummy_custom_indexer,
win_types,
closed,
min_periods,
center):
# Test passing a BaseIndexer subclass does not raise validation errors
s = Series(range(10))
s.rolling(dummy_custom_indexer,
win_type=win_types,
center=center,
min_periods=min_periods,
closed=closed)