Skip to content

Conversation

@toobaz
Copy link
Member

@toobaz toobaz commented May 4, 2017

The first commit is just #16213 .

The others reorganize a bit the code for Index.append().

@toobaz toobaz changed the title Index append ref Index append refactoring May 4, 2017
@toobaz toobaz changed the title Index append refactoring Index.append() refactoring May 4, 2017
@codecov
Copy link

codecov bot commented May 4, 2017

Codecov Report

Merging #16236 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16236      +/-   ##
==========================================
- Coverage   90.24%   90.21%   -0.04%     
==========================================
  Files         164      164              
  Lines       50894    50920      +26     
==========================================
+ Hits        45930    45938       +8     
- Misses       4964     4982      +18
Flag Coverage Δ
#multiple 88% <100%> (-0.02%) ⬇️
#single 40.3% <65%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 98.48% <100%> (ø) ⬆️
pandas/core/indexes/base.py 95.69% <100%> (-0.06%) ⬇️
pandas/core/indexes/range.py 92.69% <100%> (+0.57%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/indexes/datetimelike.py 94.88% <0%> (-1.78%) ⬇️
pandas/core/indexes/interval.py 91.82% <0%> (-0.77%) ⬇️
pandas/core/common.py 90.68% <0%> (-0.35%) ⬇️
pandas/core/frame.py 97.58% <0%> (-0.1%) ⬇️
pandas/_version.py 44.65% <0%> (+1.9%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2002da3...5d734ad. Read the comment docs.

@codecov
Copy link

codecov bot commented May 4, 2017

Codecov Report

Merging #16236 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16236      +/-   ##
==========================================
- Coverage   91.01%   90.98%   -0.03%     
==========================================
  Files         162      162              
  Lines       49567    49565       -2     
==========================================
- Hits        45111    45099      -12     
- Misses       4456     4466      +10
Flag Coverage Δ
#multiple 88.77% <100%> (-0.01%) ⬇️
#single 40.25% <68.57%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 95.87% <100%> (-0.06%) ⬇️
pandas/core/dtypes/concat.py 98.28% <100%> (+0.2%) ⬆️
pandas/core/indexes/range.py 92.18% <100%> (-0.63%) ⬇️
pandas/core/indexes/category.py 98.53% <100%> (ø) ⬆️
pandas/core/indexes/datetimelike.py 96.66% <100%> (ø) ⬆️
pandas/core/indexes/interval.py 92.61% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.72% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d0d28fe...554ee79. Read the comment docs.

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label May 4, 2017
return res

@classmethod
def _concat(self, to_concat):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so things like this should be handled in pandas.core.dtypes.concat, which is our general: put these things of possibly different dtype together. There are already routines there that handle Series so this should fit in w/o much work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having Index concatenation as an Index method looks to me definitely cleaner, anyway OK, certainly not much work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then you have the same exact problem in that the code lives in multiple unrelaled places. Please use the pattern I suggest.

@toobaz toobaz force-pushed the index_append_ref branch from 842466e to 012a88b Compare May 5, 2017 15:59


def _concat_indexes(to_concat, default=None):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is good, but can you use _concat_compat which basically does all of the combination logic, rather than have it live IN the Index classes themselves? (you would simply wrap it to return an Index, rather than a Series as is done elsewhere).

My goal is to centralize all of this code, and NOT decentralize it to the Index whenever possible.

# if name is None, _create_from_codes sets self.name
result.name = name
return result
compat = [to_concat[0]._is_dtype_compat(c) for c in to_concat]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is replicating what _concat_categorical is doing

return super(RangeIndex, self).join(other, how, level, return_indexers,
sort)

@classmethod
Copy link
Contributor

@jreback jreback May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even this could live in dtypes.concat. It would deal in slices. Again you could have a simple wrapper to return an Index.

dti2 = dti.tz_convert(None)
tm.assert_numpy_array_equal(dti2.asi8, dti.asi8)

@pytest.mark.xfail(reason='See gh-16234')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@jreback
Copy link
Contributor

jreback commented Jul 26, 2017

got lost sorry.. pls rebase / update.

@toobaz
Copy link
Member Author

toobaz commented Jul 27, 2017

My goal is to centralize all of this code, and NOT decentralize it to the Index whenever possible.

My fear is that it is difficult to move to dtypes.concat code which makes a clever use of overrides - and on the other hand, we loose all benefits of inheritance if we don't make a clever use of overrides (we already have too much code with sequences of if is_some_dtype(idx): _call_specialized_method_for_such_dtype, which harm readability).

Anyway, I will rebase and try to move and replace with wrappers as much logic as I can.

@pep8speaks
Copy link

pep8speaks commented Aug 10, 2017

Hello @toobaz! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 21, 2017 at 15:24 Hours UTC

@toobaz
Copy link
Member Author

toobaz commented Aug 10, 2017

OK, this new version is much simpler, and still removes all code duplication from #16213 .

start = obj._start
step = obj._step
stop = obj._stop if next is None else next
return indexes[0].__class__(start, stop, step)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly... but the only alternative I see (an import inside the function) is uglier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its ok here

Copy link
Member

@jorisvandenbossche jorisvandenbossche Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative would be is to return the (start, stop, step) and do the construction in RangeIndex._append_same_dtype (then also the rename there is not needed)

Ah no, this wouldn't play nice with the case when no range index is returned, ignore this

@toobaz
Copy link
Member Author

toobaz commented Aug 16, 2017

@jreback ping

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good, some stylistc comments. I think your reorg is fine.

return isinstance(arr, ABCCategorical) or is_categorical_dtype(arr)


def is_range(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not similar to the other methods, which detect a type. a RangeIndex is not a type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only called in get_dtype_kinds, see below, so I could have just used isinstance(arr, ABCRangeIndex). But I tried to be coherent with is_categorical and is_sparse... what exactly is a "type"?!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are conflating a type with an Index. is_categorical will detect a categorical type which could be a Categorical (or the dtype=='category'), a CategoricalIndex happens to be of this type as well.

However RangeIndex is simply an Index subclassing Int64Index. its not a type (its dtype is int64). types can be the dtype of an Index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [3]: pd.Series([1,2,3]).to_sparse().dtype
Out[3]: dtype('int64')

... still,

In [4]: _concat.get_dtype_kinds([pd.Series([1,2,3]).to_sparse()])
Out[4]: {'sparse'}

... so the type returned by get_dtype_kinds is already not the dtype, and not even the dtype.kind.

But anyway, I don't particularly care about changing get_dtype_kinds. We just need a method which can tell us whether two indexes can be natively concatenated: this is currently get_dtype_kinds, but I can write a new one if you prefer.

# if to_concat contains different tz,
# the result must be object dtype
typ = str(arr.dtype)
elif is_range(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very strange. why are you adding this here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are directly calling the routine (and you don't handle typ='range' anywhere), so not sure this is even hit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required so that Index._concat distinguishes between RangeIndex and Int64Index - as they should not be treated as equal (e.g. when appending). The range is indeed an arbitrary label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see, then you should be explicit about testing what you need, e.g. ABCInt64Inde,
(rather than adding a helper function which serves no other purpose)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index._concat has no idea of the specific types of indexes, and rightly so... it uses get_dtype_kinds just to test whether different types of indexes are being concatenated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simply check isinstance(arr, ABCRangeIndex), you are special casing this so I don't find this a problem. We don't have a special 'type' for this index so is_int64_dtype would not work here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with using isinstance(arr, ABCRangeIndex) in get_dtype_kinds... but I'm not special casing Index._concat: I'm only overwriting RangeIndex._append_same_dtype. So I need get_dtype_kinds to (also) distinguish RangeIndex. Isn't the point of get_dtype_kinds precisely to distinguish stuff which can be concatenated "natively" together?!

Copy link
Member Author

@toobaz toobaz Aug 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a special 'type' for this index so is_int64_dtype would not work here.

I had missed this, and hence maybe our misunderstanding: pd.RangeIndex(3).dtype is dtype('int64')... which makes sense, but is not what we want _concat.get_dtype_kinds to consider.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if it is for a single call, I would just do the isinstance(arr, ABCRangeIndex) here instead of defining the new function



def _concat_indexes_same_dtype_rangeindex(indexes):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about what this is doing / guarantees and example would be nice as well.

start = obj._start
step = obj._step
stop = obj._stop if next is None else next
return indexes[0].__class__(start, stop, step)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its ok here

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some minor comments

return CategoricalIndex._append_same_dtype(self, to_concat, name)
return self._concat(to_concat, name)

def _concat(self, to_concat, name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you call this _append ? (then it is more in line with _append_same_dtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it would make more sense to change _append_same_dtype to _concat_same_dtype (also in IntervalIndex, DatetimeIndex, CategoryIndex), since it already disregards self (it is conceptually a @classmethod). Shall I proceed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it is only use by append, I prefer using append in the name, but no strong feelings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that it's currently used only by append, but usually you expect x.append(y) to concatenate x to y or to elements of y; instead this only concatenates elements of y. So since you don't object I will go with my proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead this only concatenates elements of y

in the end it is used to concatenate both y to x, just that this is passed like that in append to this helper function. So it is still only used for append.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is still only used for append.

Sure, I don't object to that. We can agree it is a concat operation used to implement appending: the switch happens when append(self, other) does to_concat = [self] + list(other).

# if to_concat contains different tz,
# the result must be object dtype
typ = str(arr.dtype)
elif is_range(arr):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if it is for a single call, I would just do the isinstance(arr, ABCRangeIndex) here instead of defining the new function


def _concat_indexes_same_dtype_rangeindex(indexes):
# Concatenates multiple RangeIndex instances. All members of "indexes" must
# be of type RangeIndex; result will be RangeIndex if possible, Int64Index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put this in a 'normal' docstring using """

return result


def _concat_indexes_same_dtype_rangeindex(indexes):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe _concat_rangeindex_same_dtype ? (little bit shorter and also clear I think)

non_consecutive = ((step != obj._step and len(obj) > 1) or
(next is not None and obj._start != next))
if non_consecutive:
# Not nice... but currently what happens in NumericIndex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment needed?

Copy link
Member Author

@toobaz toobaz Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it as a reminder: I would have liked to use

return Int64Index._append_same_dtype([ix.astype(int) for ix in indexes])

... but then, numeric indexes currently do not special-case _append_same_dtype, so we end up calling _concat_index_asobject anyway.

But I can remove it, and just take a note of this TODO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, OK to keep it then, but maybe make it a bit more informative (or remove the 'not nice ', just say that it is what is used by Int64Index._append_same_dtype)

(I also don't think it would give that much of gain for making a special cased one of integers)

start = obj._start
step = obj._step
stop = obj._stop if next is None else next
return indexes[0].__class__(start, stop, step)
Copy link
Member

@jorisvandenbossche jorisvandenbossche Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative would be is to return the (start, stop, step) and do the construction in RangeIndex._append_same_dtype (then also the rename there is not needed)

Ah no, this wouldn't play nice with the case when no range index is returned, ignore this

non_consecutive = ((step != obj._step and len(obj) > 1) or
(next is not None and obj._start != next))
if non_consecutive:
# Int64Index._append_same_dtype([ix.astype(int) for ix in indexes])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #17307

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged this, you can update this line in the other PR if we merge that

@jreback jreback added this to the 0.21.0 milestone Aug 21, 2017
@jreback
Copy link
Contributor

jreback commented Aug 21, 2017

lgtm. just if you can explain that where I indicated a bit more.

@jorisvandenbossche jorisvandenbossche changed the title Index.append() refactoring CLN: Index.append() refactoring Aug 22, 2017
@jorisvandenbossche jorisvandenbossche merged commit 2f00159 into pandas-dev:master Aug 22, 2017
@toobaz toobaz deleted the index_append_ref branch August 22, 2017 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants