-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
CLN: Index.append() refactoring #16236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN: Index.append() refactoring #16236
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16236 +/- ##
==========================================
- Coverage 90.24% 90.21% -0.04%
==========================================
Files 164 164
Lines 50894 50920 +26
==========================================
+ Hits 45930 45938 +8
- Misses 4964 4982 +18
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16236 +/- ##
==========================================
- Coverage 91.01% 90.98% -0.03%
==========================================
Files 162 162
Lines 49567 49565 -2
==========================================
- Hits 45111 45099 -12
- Misses 4456 4466 +10
Continue to review full report at Codecov.
|
pandas/core/indexes/base.py
Outdated
| return res | ||
|
|
||
| @classmethod | ||
| def _concat(self, to_concat): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so things like this should be handled in pandas.core.dtypes.concat, which is our general: put these things of possibly different dtype together. There are already routines there that handle Series so this should fit in w/o much work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having Index concatenation as an Index method looks to me definitely cleaner, anyway OK, certainly not much work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then you have the same exact problem in that the code lives in multiple unrelaled places. Please use the pattern I suggest.
|
|
||
|
|
||
| def _concat_indexes(to_concat, default=None): | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is good, but can you use _concat_compat which basically does all of the combination logic, rather than have it live IN the Index classes themselves? (you would simply wrap it to return an Index, rather than a Series as is done elsewhere).
My goal is to centralize all of this code, and NOT decentralize it to the Index whenever possible.
pandas/core/indexes/category.py
Outdated
| # if name is None, _create_from_codes sets self.name | ||
| result.name = name | ||
| return result | ||
| compat = [to_concat[0]._is_dtype_compat(c) for c in to_concat] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is replicating what _concat_categorical is doing
pandas/core/indexes/range.py
Outdated
| return super(RangeIndex, self).join(other, how, level, return_indexers, | ||
| sort) | ||
|
|
||
| @classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even this could live in dtypes.concat. It would deal in slices. Again you could have a simple wrapper to return an Index.
| dti2 = dti.tz_convert(None) | ||
| tm.assert_numpy_array_equal(dti2.asi8, dti.asi8) | ||
|
|
||
| @pytest.mark.xfail(reason='See gh-16234') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
|
can you rebase and update? |
|
got lost sorry.. pls rebase / update. |
My fear is that it is difficult to move to Anyway, I will rebase and try to move and replace with wrappers as much logic as I can. |
012a88b to
059c948
Compare
|
Hello @toobaz! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 21, 2017 at 15:24 Hours UTC |
059c948 to
c47e081
Compare
|
OK, this new version is much simpler, and still removes all code duplication from #16213 . |
| start = obj._start | ||
| step = obj._step | ||
| stop = obj._stop if next is None else next | ||
| return indexes[0].__class__(start, stop, step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ugly... but the only alternative I see (an import inside the function) is uglier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its ok here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative would be is to return the (start, stop, step) and do the construction in RangeIndex._append_same_dtype (then also the rename there is not needed)
Ah no, this wouldn't play nice with the case when no range index is returned, ignore this
c47e081 to
d5c7d77
Compare
|
@jreback ping |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good, some stylistc comments. I think your reorg is fine.
pandas/core/dtypes/common.py
Outdated
| return isinstance(arr, ABCCategorical) or is_categorical_dtype(arr) | ||
|
|
||
|
|
||
| def is_range(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not similar to the other methods, which detect a type. a RangeIndex is not a type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only called in get_dtype_kinds, see below, so I could have just used isinstance(arr, ABCRangeIndex). But I tried to be coherent with is_categorical and is_sparse... what exactly is a "type"?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are conflating a type with an Index. is_categorical will detect a categorical type which could be a Categorical (or the dtype=='category'), a CategoricalIndex happens to be of this type as well.
However RangeIndex is simply an Index subclassing Int64Index. its not a type (its dtype is int64). types can be the dtype of an Index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [3]: pd.Series([1,2,3]).to_sparse().dtype
Out[3]: dtype('int64')... still,
In [4]: _concat.get_dtype_kinds([pd.Series([1,2,3]).to_sparse()])
Out[4]: {'sparse'}... so the type returned by get_dtype_kinds is already not the dtype, and not even the dtype.kind.
But anyway, I don't particularly care about changing get_dtype_kinds. We just need a method which can tell us whether two indexes can be natively concatenated: this is currently get_dtype_kinds, but I can write a new one if you prefer.
pandas/core/dtypes/concat.py
Outdated
| # if to_concat contains different tz, | ||
| # the result must be object dtype | ||
| typ = str(arr.dtype) | ||
| elif is_range(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very strange. why are you adding this here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are directly calling the routine (and you don't handle typ='range' anywhere), so not sure this is even hit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required so that Index._concat distinguishes between RangeIndex and Int64Index - as they should not be treated as equal (e.g. when appending). The range is indeed an arbitrary label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I see, then you should be explicit about testing what you need, e.g. ABCInt64Inde,
(rather than adding a helper function which serves no other purpose)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index._concat has no idea of the specific types of indexes, and rightly so... it uses get_dtype_kinds just to test whether different types of indexes are being concatenated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can simply check isinstance(arr, ABCRangeIndex), you are special casing this so I don't find this a problem. We don't have a special 'type' for this index so is_int64_dtype would not work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with using isinstance(arr, ABCRangeIndex) in get_dtype_kinds... but I'm not special casing Index._concat: I'm only overwriting RangeIndex._append_same_dtype. So I need get_dtype_kinds to (also) distinguish RangeIndex. Isn't the point of get_dtype_kinds precisely to distinguish stuff which can be concatenated "natively" together?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have a special 'type' for this index so is_int64_dtype would not work here.
I had missed this, and hence maybe our misunderstanding: pd.RangeIndex(3).dtype is dtype('int64')... which makes sense, but is not what we want _concat.get_dtype_kinds to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, if it is for a single call, I would just do the isinstance(arr, ABCRangeIndex) here instead of defining the new function
|
|
||
|
|
||
| def _concat_indexes_same_dtype_rangeindex(indexes): | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment about what this is doing / guarantees and example would be nice as well.
| start = obj._start | ||
| step = obj._step | ||
| stop = obj._stop if next is None else next | ||
| return indexes[0].__class__(start, stop, step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its ok here
6b93601 to
ce04aed
Compare
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Some minor comments
| return CategoricalIndex._append_same_dtype(self, to_concat, name) | ||
| return self._concat(to_concat, name) | ||
|
|
||
| def _concat(self, to_concat, name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you call this _append ? (then it is more in line with _append_same_dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think it would make more sense to change _append_same_dtype to _concat_same_dtype (also in IntervalIndex, DatetimeIndex, CategoryIndex), since it already disregards self (it is conceptually a @classmethod). Shall I proceed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it is only use by append, I prefer using append in the name, but no strong feelings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that it's currently used only by append, but usually you expect x.append(y) to concatenate x to y or to elements of y; instead this only concatenates elements of y. So since you don't object I will go with my proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead this only concatenates elements of y
in the end it is used to concatenate both y to x, just that this is passed like that in append to this helper function. So it is still only used for append.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it is still only used for append.
Sure, I don't object to that. We can agree it is a concat operation used to implement appending: the switch happens when append(self, other) does to_concat = [self] + list(other).
pandas/core/dtypes/concat.py
Outdated
| # if to_concat contains different tz, | ||
| # the result must be object dtype | ||
| typ = str(arr.dtype) | ||
| elif is_range(arr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, if it is for a single call, I would just do the isinstance(arr, ABCRangeIndex) here instead of defining the new function
pandas/core/dtypes/concat.py
Outdated
|
|
||
| def _concat_indexes_same_dtype_rangeindex(indexes): | ||
| # Concatenates multiple RangeIndex instances. All members of "indexes" must | ||
| # be of type RangeIndex; result will be RangeIndex if possible, Int64Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put this in a 'normal' docstring using """
pandas/core/dtypes/concat.py
Outdated
| return result | ||
|
|
||
|
|
||
| def _concat_indexes_same_dtype_rangeindex(indexes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe _concat_rangeindex_same_dtype ? (little bit shorter and also clear I think)
pandas/core/dtypes/concat.py
Outdated
| non_consecutive = ((step != obj._step and len(obj) > 1) or | ||
| (next is not None and obj._start != next)) | ||
| if non_consecutive: | ||
| # Not nice... but currently what happens in NumericIndex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it as a reminder: I would have liked to use
return Int64Index._append_same_dtype([ix.astype(int) for ix in indexes])... but then, numeric indexes currently do not special-case _append_same_dtype, so we end up calling _concat_index_asobject anyway.
But I can remove it, and just take a note of this TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, OK to keep it then, but maybe make it a bit more informative (or remove the 'not nice ', just say that it is what is used by Int64Index._append_same_dtype)
(I also don't think it would give that much of gain for making a special cased one of integers)
| start = obj._start | ||
| step = obj._step | ||
| stop = obj._stop if next is None else next | ||
| return indexes[0].__class__(start, stop, step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative would be is to return the (start, stop, step) and do the construction in RangeIndex._append_same_dtype (then also the rename there is not needed)
Ah no, this wouldn't play nice with the case when no range index is returned, ignore this
ce04aed to
a8969f7
Compare
a8969f7 to
852dde0
Compare
852dde0 to
554ee79
Compare
| non_consecutive = ((step != obj._step and len(obj) > 1) or | ||
| (next is not None and obj._start != next)) | ||
| if non_consecutive: | ||
| # Int64Index._append_same_dtype([ix.astype(int) for ix in indexes]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #17307
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged this, you can update this line in the other PR if we merge that
|
lgtm. just if you can explain that where I indicated a bit more. |
git diff upstream/master --name-only -- '*.py' | flake8 --diffThe first commit is just #16213 .
The others reorganize a bit the code for
Index.append().