Skip to content

ENH - Index set operation modifications to address issue #23525 #23538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 75 commits into from
May 21, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
c2cf269
ENH - first pass at modifying set operations on indexes. Dont ignore …
sds9995 Nov 7, 2018
435e50f
Merge branch 'master' into enh/index_setops
sds9995 Nov 8, 2018
4922fd3
BUG - account for empty index + non-monotonic index, and dont try to …
sds9995 Nov 8, 2018
5e528a1
TST - update existing tests to account for cross type index joins bei…
sds9995 Nov 8, 2018
cdaa5b0
ENH - incompatibility checks and incompatible type unions
sds9995 Nov 9, 2018
40d57ec
TST - update datetime union tets, add tests for inconsistent unions
sds9995 Nov 9, 2018
11fd041
CLN - refactor union -> _union
sds9995 Nov 11, 2018
8f0ace3
TST - add tests for categrorical index, and compatible inconsistent p…
sds9995 Nov 11, 2018
8364c2e
BUG - union -> _union in overriden _union methods
sds9995 Nov 11, 2018
ab329a9
TST - update test_operator raised exception
sds9995 Nov 11, 2018
93486ad
CLN - pep8 line adherence
sds9995 Nov 11, 2018
e435e4c
ENH - reverse polarity of compatibility check and add docstrings
sds9995 Nov 13, 2018
b9787b8
Merge branch 'master' into enh/index_setops
sds9995 Nov 13, 2018
2241b65
TST - add test fixture for index factories and use in test_setops
sds9995 Nov 13, 2018
4daf360
ENH - cast difference result to original dtype to match other index b…
sds9995 Nov 14, 2018
6e5a52b
TST - update interval setop test to account for difference now return…
sds9995 Nov 14, 2018
d344e11
CLN - remove unnecceary code from test
sds9995 Nov 14, 2018
b339bd1
CLN - reorganize some code to make it more readable
sds9995 Nov 14, 2018
85e2db7
CLN - pep8 adherence
sds9995 Nov 29, 2018
cf34960
CLN - pep8 adherence
sds9995 Nov 29, 2018
7150c22
BUG - fix function name
sds9995 Nov 29, 2018
fbb3743
Merge branch 'master' into enh/index_setops
sds9995 Dec 1, 2018
5aa41f6
BUG - fix numeric index compatibility
sds9995 Dec 1, 2018
02d7a3b
BUG - actually fix numeric compatibilty check, with passing index tests
sds9995 Dec 1, 2018
558e182
DOC - initial whatsnew
sds9995 Dec 2, 2018
706f973
ENH - no longer consider category indexes containing different catego…
sds9995 Dec 4, 2018
2ccab59
TST/CLN - no longer need new index_factory fixture and make code more…
sds9995 Dec 4, 2018
c70f1c0
CLN - make code more readable
sds9995 Dec 5, 2018
edb7e9c
CLN - pep8 adherence
sds9995 Dec 5, 2018
84bfbda
Merge branch 'master' into enh/index_setops
sds9995 Dec 5, 2018
aba75fe
DOC - fix whatsnew entry
sds9995 Dec 5, 2018
fc9f138
BUG - chagne object dtype index construction
sds9995 Dec 5, 2018
69cce99
Merge branch 'master' into enh/index_setops
sds9995 Dec 6, 2018
fdfc7d7
CLN/BUG - clean according to failed pandas-dev style checks
sds9995 Dec 6, 2018
42ca70e
CLN - fix imports with isort
sds9995 Dec 7, 2018
5b25645
CLN - refactor tests and remove overriden public union methods
sds9995 Dec 8, 2018
9b1ee7f
Merge branch 'master' into enh/index_setops
sds9995 Dec 8, 2018
fdf9b71
CLN - make code more efficient and cleanup whatsnew
sds9995 Dec 8, 2018
1de3cc8
Merge branch 'master' into enh/index_setops
sds9995 Jan 1, 2019
8ed1093
DOC - fix ipython code block
sds9995 Jan 1, 2019
77ca3a3
DOC - fix whatsnew code blocks again
sds9995 Jan 2, 2019
5921038
CLN - clean up some code, tests and docs
sds9995 Jan 3, 2019
3b94e3b
CLN - reorganize some code and add TODOs
sds9995 Jan 9, 2019
fd4510e
CLN - remove trailing whitespace
ms7463 Jan 14, 2019
345eec1
Merge branch 'master' into enh/index_setops
sds9995 Jan 14, 2019
265a7ee
Merge branch 'enh/index_setops' of https://github.com/ArtinSarraf/pan…
sds9995 Jan 14, 2019
5de3d57
CLN - fix import order
sds9995 Jan 15, 2019
6d82621
CLN - code cleanup, remove unneccesary operations
sds9995 Jan 17, 2019
0af8a24
Merge branch 'master' into enh/index_setops
sds9995 Jan 21, 2019
5a87715
CLN - apply error messages to both statements
sds9995 Jan 21, 2019
a4f9e78
TST - add regex queries
sds9995 Jan 23, 2019
c3c0caa
Merge branch 'master' into enh/index_setops
sds9995 Feb 11, 2019
0bcbdf4
BUG - fix default sort arg
sds9995 Feb 11, 2019
c410625
BUG - remove print
sds9995 Feb 11, 2019
6bb054f
TST/DOC - move to new whatsnew and use local fixture for tests
sds9995 Feb 12, 2019
aea731c
DOC - minor update to get tests to rerun
ms7463 Feb 13, 2019
b5938fc
Merge branch 'master' into enh/index_setops
sds9995 Feb 28, 2019
25452fc
Merge branch 'enh/index_setops' of https://github.com/ArtinSarraf/pan…
sds9995 Feb 28, 2019
0b97a79
Merge branch 'master' into enh/index_setops
ms7463 Mar 1, 2019
bf11c6f
Merge branch 'master' into enh/index_setops
sds9995 Mar 1, 2019
6fd941d
Merge branch 'enh/index_setops' of https://github.com/ArtinSarraf/pan…
sds9995 Mar 1, 2019
32037b5
DOC - fix docstrings and whatsnew
sds9995 Mar 2, 2019
8870006
Merge branch 'master' into enh/index_setops
sds9995 Mar 11, 2019
1d12bc9
DOC - update docstring
sds9995 Mar 11, 2019
92f6707
TST - use tm.assert_index_equal
sds9995 Mar 12, 2019
fbf3242
Merge branch 'master' into enh/index_setops
sds9995 Mar 20, 2019
38d9f74
TST - parametrize union tests
sds9995 Mar 21, 2019
b9e7b18
Merge branch 'master' into enh/index_setops
sds9995 Mar 21, 2019
69aaa93
DOC - add docstring
sds9995 Mar 21, 2019
b57160a
Merge branch 'master' into enh/index_setops
sds9995 Mar 28, 2019
daa1287
Merge branch 'master' into enh/index_setops
sds9995 May 15, 2019
54898c1
CLN/TST - fix super method calls and add error msg
sds9995 May 15, 2019
fa839a9
TST - add Timestamp to regexand fix import sorting
sds9995 May 15, 2019
a36f475
CLN - minor style updates
sds9995 May 16, 2019
b840f49
Merge branch 'master' into enh/index_setops
sds9995 May 21, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,33 @@ returned if all the columns were dummy encoded, and a :class:`DataFrame` otherwi
Providing any ``SparseSeries`` or ``SparseDataFrame`` to :func:`concat` will
cause a ``SparseSeries`` or ``SparseDataFrame`` to be returned, as before.

.. _whatsnew_0250.api_breaking.incompatible_index_unions

Incompatible Index Type Unions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When performing :func:`Index.union` operations between objects of incompatible dtypes,
the result will be a base :class:`Index` of dtype ``object``. This behavior holds true for
unions between :class:`Index` objects that previously would have been prohibited. The dtype
of empty :class:`Index` objects will now be evaluated before performing union operations
rather than simply returning the other :class:`Index` object. :func:`Index.union` can now be
considered commutative, such that ``A.union(B) == B.union(A)`` (:issue:`23525`).

*Previous Behavior*:

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')

*New Behavior*:

.. ipython:: python

pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))

``DataFrame`` groupby ffill/bfill no longer return group labels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
104 changes: 87 additions & 17 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,10 @@
ensure_categorical, ensure_int64, ensure_object, ensure_platform_int,
is_bool, is_bool_dtype, is_categorical, is_categorical_dtype,
is_datetime64_any_dtype, is_datetime64tz_dtype, is_dtype_equal,
is_dtype_union_equal, is_extension_array_dtype, is_float, is_float_dtype,
is_hashable, is_integer, is_integer_dtype, is_interval_dtype, is_iterator,
is_list_like, is_object_dtype, is_period_dtype, is_scalar,
is_signed_integer_dtype, is_timedelta64_dtype, is_unsigned_integer_dtype,
pandas_dtype)
is_extension_array_dtype, is_float, is_float_dtype, is_hashable,
is_integer, is_integer_dtype, is_interval_dtype, is_iterator, is_list_like,
is_object_dtype, is_period_dtype, is_scalar, is_signed_integer_dtype,
is_timedelta64_dtype, is_unsigned_integer_dtype, pandas_dtype)
import pandas.core.dtypes.concat as _concat
from pandas.core.dtypes.generic import (
ABCDataFrame, ABCDateOffset, ABCDatetimeArray, ABCIndexClass,
Expand Down Expand Up @@ -2262,6 +2261,47 @@ def _get_reconciled_name_object(self, other):
return self._shallow_copy(name=name)
return self

def _union_incompatible_dtypes(self, other, sort):
"""
Casts this and other index to object dtype to allow the formation
of a union between incompatible types.

Parameters
----------
other : Index or array-like
sort : False or None, default False
Whether to sort the resulting index.

* False : do not sort the result.
* None : sort the result, except when `self` and `other` are equal
or when the values cannot be compared.

Returns
-------
Index
"""
this = self.astype(object, copy=False)
# cast to Index for when `other` is list-like
other = Index(other).astype(object, copy=False)
return Index.union(this, other, sort=sort).astype(object, copy=False)

def _is_compatible_with_other(self, other):
"""
Check whether this and the other dtype are compatible with each other.
Meaning a union can be formed between them without needing to be cast
to dtype object.

Parameters
----------
other : Index or array-like

Returns
-------
bool
"""
return (type(self) is type(other)
and is_dtype_equal(self.dtype, other.dtype))

def _validate_sort_keyword(self, sort):
if sort not in [None, False]:
raise ValueError("The 'sort' keyword only takes the values of "
Expand All @@ -2271,6 +2311,11 @@ def union(self, other, sort=None):
"""
Form the union of two Index objects.

If the Index objects are incompatible, both Index objects will be
cast to dtype('object') first.

.. versionchanged:: 0.25.0

Parameters
----------
other : Index or array-like
Expand Down Expand Up @@ -2300,30 +2345,54 @@ def union(self, other, sort=None):
Examples
--------

Union matching dtypes

>>> idx1 = pd.Index([1, 2, 3, 4])
>>> idx2 = pd.Index([3, 4, 5, 6])
>>> idx1.union(idx2)
Int64Index([1, 2, 3, 4, 5, 6], dtype='int64')

Union mismatched dtypes

>>> idx1 = pd.Index(['a', 'b', 'c', 'd'])
>>> idx2 = pd.Index([1, 2, 3, 4])
>>> idx1.union(idx2)
Index(['a', 'b', 'c', 'd', 1, 2, 3, 4], dtype='object')
"""
self._validate_sort_keyword(sort)
self._assert_can_do_setop(other)
other = ensure_index(other)

if len(other) == 0 or self.equals(other):
if not self._is_compatible_with_other(other):
return self._union_incompatible_dtypes(other, sort=sort)

return self._union(other, sort=sort)

def _union(self, other, sort):
"""
Specific union logic should go here. In subclasses, union behavior
should be overwritten here rather than in `self.union`.

Parameters
----------
other : Index or array-like
sort : False or None, default False
Whether to sort the resulting index.

* False : do not sort the result.
* None : sort the result, except when `self` and `other` are equal
or when the values cannot be compared.

Returns
-------
Index
"""

if not len(other) or self.equals(other):
return self._get_reconciled_name_object(other)

if len(self) == 0:
if not len(self):
return other._get_reconciled_name_object(self)

# TODO: is_dtype_union_equal is a hack around
# 1. buggy set ops with duplicates (GH #13432)
# 2. CategoricalIndex lacking setops (GH #10186)
# Once those are fixed, this workaround can be removed
if not is_dtype_union_equal(self.dtype, other.dtype):
this = self.astype('O')
other = other.astype('O')
return this.union(other, sort=sort)

# TODO(EA): setops-refactor, clean all this up
if is_period_dtype(self) or is_datetime64tz_dtype(self):
lvals = self._ndarray_values
Expand Down Expand Up @@ -2370,6 +2439,7 @@ def union(self, other, sort=None):
def _wrap_setop_result(self, other, result):
return self._constructor(result, name=get_op_result_name(self, other))

# TODO: standardize return type of non-union setops type(self vs other)
def intersection(self, other, sort=False):
"""
Form the intersection of two Index objects.
Expand Down
34 changes: 4 additions & 30 deletions pandas/core/indexes/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -451,35 +451,9 @@ def _formatter_func(self):
# --------------------------------------------------------------------
# Set Operation Methods

def union(self, other, sort=None):
"""
Specialized union for DatetimeIndex objects. If combine
overlapping ranges with the same DateOffset, will be much
faster than Index.union

Parameters
----------
other : DatetimeIndex or array-like
sort : bool or None, default None
Whether to sort the resulting Index.

* None : Sort the result, except when

1. `self` and `other` are equal.
2. `self` or `other` has length 0.
3. Some values in `self` or `other` cannot be compared.
A RuntimeWarning is issued in this case.

* False : do not sort the result

.. versionadded:: 0.25.0

Returns
-------
y : Index or DatetimeIndex
"""
self._validate_sort_keyword(sort)
self._assert_can_do_setop(other)
def _union(self, other, sort):
if not len(other) or self.equals(other) or not len(self):
return super()._union(other, sort=sort)

if len(other) == 0 or self.equals(other) or len(self) == 0:
return super().union(other, sort=sort)
Expand All @@ -495,7 +469,7 @@ def union(self, other, sort=None):
if this._can_fast_union(other):
return this._fast_union(other, sort=sort)
else:
result = Index.union(this, other, sort=sort)
result = Index._union(this, other, sort=sort)
if isinstance(result, DatetimeIndex):
# TODO: we shouldn't be setting attributes like this;
# in all the tests this equality already holds
Expand Down
26 changes: 12 additions & 14 deletions pandas/core/indexes/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -964,19 +964,6 @@ def insert(self, loc, item):
new_right = self.right.insert(loc, right_insert)
return self._shallow_copy(new_left, new_right)

def _as_like_interval_index(self, other):
self._assert_can_do_setop(other)
other = ensure_index(other)
if not isinstance(other, IntervalIndex):
msg = ('the other index needs to be an IntervalIndex too, but '
'was type {}').format(other.__class__.__name__)
raise TypeError(msg)
elif self.closed != other.closed:
msg = ('can only do set operations between two IntervalIndex '
'objects that are closed on the same side')
raise ValueError(msg)
return other

def _concat_same_dtype(self, to_concat, name):
"""
assert that we all have the same .closed
Expand Down Expand Up @@ -1092,7 +1079,17 @@ def overlaps(self, other):

def _setop(op_name, sort=None):
def func(self, other, sort=sort):
other = self._as_like_interval_index(other)
self._assert_can_do_setop(other)
other = ensure_index(other)
if not isinstance(other, IntervalIndex):
result = getattr(self.astype(object), op_name)(other)
if op_name in ('difference',):
result = result.astype(self.dtype)
return result
elif self.closed != other.closed:
msg = ('can only do set operations between two IntervalIndex '
'objects that are closed on the same side')
raise ValueError(msg)

# GH 19016: ensure set op will not return a prohibited dtype
subtypes = [self.dtype.subtype, other.dtype.subtype]
Expand All @@ -1114,6 +1111,7 @@ def func(self, other, sort=sort):

return type(self).from_tuples(result, closed=self.closed,
name=result_name)

return func

@property
Expand Down
8 changes: 8 additions & 0 deletions pandas/core/indexes/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
is_bool, is_bool_dtype, is_dtype_equal, is_extension_array_dtype, is_float,
is_integer_dtype, is_scalar, needs_i8_conversion, pandas_dtype)
import pandas.core.dtypes.concat as _concat
from pandas.core.dtypes.generic import ABCInt64Index, ABCRangeIndex
from pandas.core.dtypes.missing import isna

from pandas.core import algorithms
Expand Down Expand Up @@ -221,6 +222,13 @@ def _assert_safe_casting(cls, data, subarr):
raise TypeError('Unsafe NumPy casting, you must '
'explicitly cast')

def _is_compatible_with_other(self, other):
return (
super()._is_compatible_with_other(other)
or all(isinstance(type(obj), (ABCInt64Index, ABCRangeIndex))
for obj in [self, other])
)


Int64Index._add_numeric_methods()
Int64Index._add_logical_methods()
Expand Down
12 changes: 8 additions & 4 deletions pandas/core/indexes/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -791,6 +791,11 @@ def join(self, other, how='left', level=None, return_indexers=False,
"""
self._assert_can_do_setop(other)

if not isinstance(other, PeriodIndex):
return self.astype(object).join(other, how=how, level=level,
return_indexers=return_indexers,
sort=sort)

result = Int64Index.join(self, other, how=how, level=level,
return_indexers=return_indexers,
sort=sort)
Expand All @@ -807,10 +812,9 @@ def intersection(self, other, sort=False):
def _assert_can_do_setop(self, other):
super()._assert_can_do_setop(other)

if not isinstance(other, PeriodIndex):
raise ValueError('can only call with other PeriodIndex-ed objects')

if self.freq != other.freq:
# *Can't* use PeriodIndexes of different freqs
# *Can* use PeriodIndex/DatetimeIndex
if isinstance(other, PeriodIndex) and self.freq != other.freq:
msg = DIFFERENT_FREQ.format(cls=type(self).__name__,
own_freq=self.freqstr,
other_freq=other.freqstr)
Expand Down
10 changes: 4 additions & 6 deletions pandas/core/indexes/range.py
Original file line number Diff line number Diff line change
Expand Up @@ -470,7 +470,7 @@ def _extended_gcd(self, a, b):
old_t, t = t, old_t - quotient * t
return old_r, old_s, old_t

def union(self, other, sort=None):
def _union(self, other, sort):
"""
Form the union of two Index objects and sorts if possible

Expand All @@ -490,9 +490,8 @@ def union(self, other, sort=None):
-------
union : Index
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed any longer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see related comment in DatetimeIndex module

self._assert_can_do_setop(other)
if len(other) == 0 or self.equals(other) or len(self) == 0:
return super().union(other, sort=sort)
if not len(other) or self.equals(other) or not len(self):
return super()._union(other, sort=sort)

if isinstance(other, RangeIndex) and sort is None:
start_s, step_s = self._start, self._step
Expand Down Expand Up @@ -530,8 +529,7 @@ def union(self, other, sort=None):
(start_s + step_o >= start_o) and
(end_s - step_o <= end_o)):
return RangeIndex(start_r, end_r + step_o, step_o)

return self._int64index.union(other, sort=sort)
return self._int64index._union(other, sort=sort)

@Appender(_index_shared_docs['join'])
def join(self, other, how='left', level=None, return_indexers=False,
Expand Down
21 changes: 3 additions & 18 deletions pandas/core/indexes/timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -329,24 +329,9 @@ def astype(self, dtype, copy=True):
return Index(result.astype('i8'), name=self.name)
return DatetimeIndexOpsMixin.astype(self, dtype, copy=copy)

def union(self, other):
"""
Specialized union for TimedeltaIndex objects. If combine
overlapping ranges with the same DateOffset, will be much
faster than Index.union

Parameters
----------
other : TimedeltaIndex or array-like

Returns
-------
y : Index or TimedeltaIndex
"""
self._assert_can_do_setop(other)

def _union(self, other, sort):
if len(other) == 0 or self.equals(other) or len(self) == 0:
return super().union(other)
return super()._union(other, sort=sort)

if not isinstance(other, TimedeltaIndex):
try:
Expand All @@ -358,7 +343,7 @@ def union(self, other):
if this._can_fast_union(other):
return this._fast_union(other)
else:
result = Index.union(this, other)
result = Index._union(this, other, sort=sort)
if isinstance(result, TimedeltaIndex):
if result.freq is None:
result.freq = to_offset(result.inferred_freq)
Expand Down
Loading