CLN: Index.append() refactoring #16236

toobaz · 2017-05-04T16:05:54Z

tests passed (one xpassed due to Different results with tzlib and dateutil_tz with "ambiguous" argument to tz_localize #16234)
passes git diff upstream/master --name-only -- '*.py' | flake8 --diff

The first commit is just #16213 .

The others reorganize a bit the code for Index.append().

codecov · 2017-05-04T17:23:37Z

Codecov Report

Merging #16236 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16236      +/-   ##
==========================================
- Coverage   90.24%   90.21%   -0.04%     
==========================================
  Files         164      164              
  Lines       50894    50920      +26     
==========================================
+ Hits        45930    45938       +8     
- Misses       4964     4982      +18

Flag	Coverage Δ
#multiple	`88% <100%> (-0.02%)`	⬇️
#single	`40.3% <65%> (-0.1%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/category.py	`98.48% <100%> (ø)`	⬆️
pandas/core/indexes/base.py	`95.69% <100%> (-0.06%)`	⬇️
pandas/core/indexes/range.py	`92.69% <100%> (+0.57%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/indexes/datetimelike.py	`94.88% <0%> (-1.78%)`	⬇️
pandas/core/indexes/interval.py	`91.82% <0%> (-0.77%)`	⬇️
pandas/core/common.py	`90.68% <0%> (-0.35%)`	⬇️
pandas/core/frame.py	`97.58% <0%> (-0.1%)`	⬇️
pandas/_version.py	`44.65% <0%> (+1.9%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2002da3...5d734ad. Read the comment docs.

codecov · 2017-05-04T17:23:45Z

Codecov Report

Merging #16236 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #16236      +/-   ##
==========================================
- Coverage   91.01%   90.98%   -0.03%     
==========================================
  Files         162      162              
  Lines       49567    49565       -2     
==========================================
- Hits        45111    45099      -12     
- Misses       4456     4466      +10

Flag	Coverage Δ
#multiple	`88.77% <100%> (-0.01%)`	⬇️
#single	`40.25% <68.57%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/base.py	`95.87% <100%> (-0.06%)`	⬇️
pandas/core/dtypes/concat.py	`98.28% <100%> (+0.2%)`	⬆️
pandas/core/indexes/range.py	`92.18% <100%> (-0.63%)`	⬇️
pandas/core/indexes/category.py	`98.53% <100%> (ø)`	⬆️
pandas/core/indexes/datetimelike.py	`96.66% <100%> (ø)`	⬆️
pandas/core/indexes/interval.py	`92.61% <100%> (ø)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.72% <0%> (-0.1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0d28fe...554ee79. Read the comment docs.

jreback · 2017-05-04T23:35:20Z

pandas/core/indexes/base.py

+        return res

+    @classmethod
+    def _concat(self, to_concat):


so things like this should be handled in pandas.core.dtypes.concat, which is our general: put these things of possibly different dtype together. There are already routines there that handle Series so this should fit in w/o much work.

Having Index concatenation as an Index method looks to me definitely cleaner, anyway OK, certainly not much work.

and then you have the same exact problem in that the code lives in multiple unrelaled places. Please use the pattern I suggest.

jreback · 2017-05-06T23:33:50Z

pandas/core/dtypes/concat.py

+
+
+def _concat_indexes(to_concat, default=None):
+    """


so this is good, but can you use _concat_compat which basically does all of the combination logic, rather than have it live IN the Index classes themselves? (you would simply wrap it to return an Index, rather than a Series as is done elsewhere).

My goal is to centralize all of this code, and NOT decentralize it to the Index whenever possible.

jreback · 2017-05-06T23:34:38Z

pandas/core/indexes/category.py

-        # if name is None, _create_from_codes sets self.name
-        result.name = name
-        return result
+        compat = [to_concat[0]._is_dtype_compat(c) for c in to_concat]


this is replicating what _concat_categorical is doing

jreback · 2017-05-06T23:35:30Z

pandas/core/indexes/range.py

        return super(RangeIndex, self).join(other, how, level, return_indexers,
                                            sort)

+    @classmethod


even this could live in dtypes.concat. It would deal in slices. Again you could have a simple wrapper to return an Index.

jreback · 2017-05-06T23:36:20Z

pandas/tests/tseries/test_timezones.py

        dti2 = dti.tz_convert(None)
        tm.assert_numpy_array_equal(dti2.asi8, dti.asi8)

+    @pytest.mark.xfail(reason='See gh-16234')


jreback · 2017-06-10T19:04:36Z

can you rebase and update?

jreback · 2017-07-26T23:59:13Z

got lost sorry.. pls rebase / update.

toobaz · 2017-07-27T07:52:00Z

My goal is to centralize all of this code, and NOT decentralize it to the Index whenever possible.

My fear is that it is difficult to move to dtypes.concat code which makes a clever use of overrides - and on the other hand, we loose all benefits of inheritance if we don't make a clever use of overrides (we already have too much code with sequences of if is_some_dtype(idx): _call_specialized_method_for_such_dtype, which harm readability).

Anyway, I will rebase and try to move and replace with wrappers as much logic as I can.

pep8speaks · 2017-08-10T14:35:27Z

Hello @toobaz! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 21, 2017 at 15:24 Hours UTC

toobaz · 2017-08-10T14:41:58Z

OK, this new version is much simpler, and still removes all code duplication from #16213 .

toobaz · 2017-08-10T14:43:33Z

pandas/core/dtypes/concat.py

+        start = obj._start
+        step = obj._step
+    stop = obj._stop if next is None else next
+    return indexes[0].__class__(start, stop, step)


This is ugly... but the only alternative I see (an import inside the function) is uglier.

its ok here

~~One alternative would be is to return the (start, stop, step) and do the construction in RangeIndex._append_same_dtype (then also the rename there is not needed)~~

Ah no, this wouldn't play nice with the case when no range index is returned, ignore this

toobaz · 2017-08-16T10:08:03Z

@jreback ping

jreback

looks pretty good, some stylistc comments. I think your reorg is fine.

jreback · 2017-08-16T10:10:33Z

pandas/core/dtypes/common.py

    return isinstance(arr, ABCCategorical) or is_categorical_dtype(arr)


+def is_range(arr):


this is not similar to the other methods, which detect a type. a RangeIndex is not a type.

This is only called in get_dtype_kinds, see below, so I could have just used isinstance(arr, ABCRangeIndex). But I tried to be coherent with is_categorical and is_sparse... what exactly is a "type"?!

you are conflating a type with an Index. is_categorical will detect a categorical type which could be a Categorical (or the dtype=='category'), a CategoricalIndex happens to be of this type as well.

However RangeIndex is simply an Index subclassing Int64Index. its not a type (its dtype is int64). types can be the dtype of an Index.

In [3]: pd.Series([1,2,3]).to_sparse().dtype Out[3]: dtype('int64')

... still,

In [4]: _concat.get_dtype_kinds([pd.Series([1,2,3]).to_sparse()]) Out[4]: {'sparse'}

... so the type returned by get_dtype_kinds is already not the dtype, and not even the dtype.kind.

But anyway, I don't particularly care about changing get_dtype_kinds. We just need a method which can tell us whether two indexes can be natively concatenated: this is currently get_dtype_kinds, but I can write a new one if you prefer.

jreback · 2017-08-16T10:11:02Z

pandas/core/dtypes/concat.py

            # if to_concat contains different tz,
            # the result must be object dtype
            typ = str(arr.dtype)
+        elif is_range(arr):


this is very strange. why are you adding this here

you are directly calling the routine (and you don't handle typ='range' anywhere), so not sure this is even hit

This is required so that Index._concat distinguishes between RangeIndex and Int64Index - as they should not be treated as equal (e.g. when appending). The range is indeed an arbitrary label.

ok I see, then you should be explicit about testing what you need, e.g. ABCInt64Inde,
(rather than adding a helper function which serves no other purpose)

Index._concat has no idea of the specific types of indexes, and rightly so... it uses get_dtype_kinds just to test whether different types of indexes are being concatenated.

you can simply check isinstance(arr, ABCRangeIndex), you are special casing this so I don't find this a problem. We don't have a special 'type' for this index so is_int64_dtype would not work here.

I'm fine with using isinstance(arr, ABCRangeIndex) in get_dtype_kinds... but I'm not special casing Index._concat: I'm only overwriting RangeIndex._append_same_dtype. So I need get_dtype_kinds to (also) distinguish RangeIndex. Isn't the point of get_dtype_kinds precisely to distinguish stuff which can be concatenated "natively" together?!

We don't have a special 'type' for this index so is_int64_dtype would not work here.

I had missed this, and hence maybe our misunderstanding: pd.RangeIndex(3).dtype is dtype('int64')... which makes sense, but is not what we want _concat.get_dtype_kinds to consider.

Yep, if it is for a single call, I would just do the isinstance(arr, ABCRangeIndex) here instead of defining the new function

jreback · 2017-08-16T10:13:21Z

pandas/core/dtypes/concat.py

+
+
+def _concat_indexes_same_dtype_rangeindex(indexes):
+


can you add a comment about what this is doing / guarantees and example would be nice as well.

jreback · 2017-08-16T10:13:31Z

pandas/core/dtypes/concat.py

+        start = obj._start
+        step = obj._step
+    stop = obj._stop if next is None else next
+    return indexes[0].__class__(start, stop, step)


its ok here

jorisvandenbossche

Looks good! Some minor comments

jorisvandenbossche · 2017-08-21T11:27:14Z

pandas/core/indexes/base.py

-            return CategoricalIndex._append_same_dtype(self, to_concat, name)
+        return self._concat(to_concat, name)
+
+    def _concat(self, to_concat, name):


can you call this _append ? (then it is more in line with _append_same_dtype)

Actually, I think it would make more sense to change _append_same_dtype to _concat_same_dtype (also in IntervalIndex, DatetimeIndex, CategoryIndex), since it already disregards self (it is conceptually a @classmethod). Shall I proceed?

since it is only use by append, I prefer using append in the name, but no strong feelings

You are right that it's currently used only by append, but usually you expect x.append(y) to concatenate x to y or to elements of y; instead this only concatenates elements of y. So since you don't object I will go with my proposal.

instead this only concatenates elements of y

in the end it is used to concatenate both y to x, just that this is passed like that in append to this helper function. So it is still only used for append.

So it is still only used for append.

Sure, I don't object to that. We can agree it is a concat operation used to implement appending: the switch happens when append(self, other) does to_concat = [self] + list(other).

jorisvandenbossche · 2017-08-21T11:30:16Z

pandas/core/dtypes/concat.py

            # if to_concat contains different tz,
            # the result must be object dtype
            typ = str(arr.dtype)
+        elif is_range(arr):


Yep, if it is for a single call, I would just do the isinstance(arr, ABCRangeIndex) here instead of defining the new function

jorisvandenbossche · 2017-08-21T11:32:33Z

pandas/core/dtypes/concat.py

+
+def _concat_indexes_same_dtype_rangeindex(indexes):
+    # Concatenates multiple RangeIndex instances. All members of "indexes" must
+    # be of type RangeIndex; result will be RangeIndex if possible, Int64Index


You can put this in a 'normal' docstring using """

jorisvandenbossche · 2017-08-21T11:33:24Z

pandas/core/dtypes/concat.py

    return result
+
+
+def _concat_indexes_same_dtype_rangeindex(indexes):


maybe _concat_rangeindex_same_dtype ? (little bit shorter and also clear I think)

jorisvandenbossche · 2017-08-21T11:36:16Z

pandas/core/dtypes/concat.py

+        non_consecutive = ((step != obj._step and len(obj) > 1) or
+                           (next is not None and obj._start != next))
+        if non_consecutive:
+            # Not nice... but currently what happens in NumericIndex:


is this comment needed?

I see it as a reminder: I would have liked to use

return Int64Index._append_same_dtype([ix.astype(int) for ix in indexes])

... but then, numeric indexes currently do not special-case _append_same_dtype, so we end up calling _concat_index_asobject anyway.

But I can remove it, and just take a note of this TODO.

I see, OK to keep it then, but maybe make it a bit more informative (or remove the 'not nice ', just say that it is what is used by Int64Index._append_same_dtype)

(I also don't think it would give that much of gain for making a special cased one of integers)

jorisvandenbossche · 2017-08-21T11:37:08Z

pandas/core/dtypes/concat.py

+        start = obj._start
+        step = obj._step
+    stop = obj._stop if next is None else next
+    return indexes[0].__class__(start, stop, step)


~~One alternative would be is to return the (start, stop, step) and do the construction in RangeIndex._append_same_dtype (then also the rename there is not needed)~~

Ah no, this wouldn't play nice with the case when no range index is returned, ignore this

jreback · 2017-08-21T23:57:42Z

pandas/core/dtypes/concat.py

+        non_consecutive = ((step != obj._step and len(obj) > 1) or
+                           (next is not None and obj._start != next))
+        if non_consecutive:
+            # Int64Index._append_same_dtype([ix.astype(int) for ix in indexes])


can you explain this

Merged this, you can update this line in the other PR if we merge that

jreback · 2017-08-21T23:58:51Z

lgtm. just if you can explain that where I indicated a bit more.

toobaz changed the title ~~Index append ref~~ Index append refactoring May 4, 2017

toobaz changed the title ~~Index append refactoring~~ Index.append() refactoring May 4, 2017

toobaz mentioned this pull request May 4, 2017

ENH: Make RangeIndex.append() return RangeIndex when possible #16213

Merged

4 tasks

jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 4, 2017

jreback reviewed May 4, 2017

View reviewed changes

toobaz force-pushed the index_append_ref branch from 842466e to 012a88b Compare May 5, 2017 15:59

jreback reviewed May 6, 2017

View reviewed changes

toobaz force-pushed the index_append_ref branch from 012a88b to 059c948 Compare August 10, 2017 14:35

toobaz force-pushed the index_append_ref branch from 059c948 to c47e081 Compare August 10, 2017 14:40

toobaz commented Aug 10, 2017

View reviewed changes

toobaz added 2 commits August 10, 2017 16:47

REF: use inheritance to concatenate categorical Indexes

5029592

REF: Avoid code duplication in RangeIndex.append

d5c7d77

toobaz force-pushed the index_append_ref branch from c47e081 to d5c7d77 Compare August 10, 2017 14:47

jreback requested changes Aug 16, 2017

View reviewed changes

toobaz force-pushed the index_append_ref branch from 6b93601 to ce04aed Compare August 20, 2017 19:45

jorisvandenbossche reviewed Aug 21, 2017

View reviewed changes

toobaz added 2 commits August 21, 2017 15:09

Comment

44800a4

Removed is_range, renamed _concat_rangeindex_same_dtype

528295d

toobaz force-pushed the index_append_ref branch from ce04aed to a8969f7 Compare August 21, 2017 13:16

toobaz force-pushed the index_append_ref branch from a8969f7 to 852dde0 Compare August 21, 2017 13:29

_append_same_dtype -> _concat_same_dtype

554ee79

toobaz force-pushed the index_append_ref branch from 852dde0 to 554ee79 Compare August 21, 2017 15:24

jreback reviewed Aug 21, 2017

View reviewed changes

jreback added this to the 0.21.0 milestone Aug 21, 2017

jreback approved these changes Aug 21, 2017

View reviewed changes

toobaz mentioned this pull request Aug 22, 2017

REF: Special case NumericIndex._append_same_dtype() #17307

Merged

2 tasks

jorisvandenbossche approved these changes Aug 22, 2017

View reviewed changes

jorisvandenbossche changed the title ~~Index.append() refactoring~~ CLN: Index.append() refactoring Aug 22, 2017

jorisvandenbossche merged commit 2f00159 into pandas-dev:master Aug 22, 2017

toobaz deleted the index_append_ref branch August 22, 2017 08:42

		return isinstance(arr, ABCCategorical) or is_categorical_dtype(arr)


		def is_range(arr):

		return result


		def _concat_indexes_same_dtype_rangeindex(indexes):

Uh oh!

CLN: Index.append() refactoring #16236

CLN: Index.append() refactoring #16236

Uh oh!

Conversation

toobaz commented May 4, 2017

Uh oh!

codecov bot commented May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov bot commented May 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback May 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 10, 2017

Uh oh!

jreback commented Jul 26, 2017

Uh oh!

toobaz commented Jul 27, 2017

Uh oh!

pep8speaks commented Aug 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on August 21, 2017 at 15:24 Hours UTC

Uh oh!

toobaz commented Aug 10, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toobaz commented Aug 16, 2017

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toobaz Aug 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

codecov bot commented May 4, 2017 •

edited

Loading

codecov bot commented May 4, 2017 •

edited

Loading

jreback May 6, 2017 •

edited

Loading

pep8speaks commented Aug 10, 2017 •

edited

Loading

jorisvandenbossche Aug 21, 2017 •

edited

Loading

toobaz Aug 20, 2017 •

edited

Loading

toobaz Aug 21, 2017 •

edited

Loading