Skip to content

BUG: Allow assignment by indexing with duplicate column names #12498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Mar 1, 2016

closes #12344

in which assignment to columns in DataFrame with duplicate column names caused all columns with the same name to be reassigned. The bug was located here.

When you try to index into the DataFrame using .iloc, pandas will find the corresponding column name (or key) first before setting that key with the given value. Unfortunately, since all of your columns have the same key name, pandas ends up choosing all of the columns corresponding to that name.

This PR introduces a _set_item_by_index method for DataFrame objects that allows you to bypass that issue by using the indices of the columns to set the columns themselves whenever there are duplicates involved.

' Same': np.nan},
index=[0, 1, 2])

df.columns = [c.strip() for c in df.columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a more directly method for constructing this test case? I found this setup very confusing (using index to do implicit broadcasting across rows).

For the record, this is the resulting dataframe:

     Same   Same  Same
0     NaN    NaN     1
1     NaN    NaN     1
2     NaN    NaN     1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Would you mind also cancelling my first Travis build then (it hasn't started yet but fixing this test added a new build)?

@gfyoung gfyoung force-pushed the dup_name_corrupt branch from d550378 to e070e80 Compare March 1, 2016 02:32
@TomAugspurger
Copy link
Contributor

I canceled https://travis-ci.org/pydata/pandas/builds/112754541, hopefully that was the correct one.

@gfyoung
Copy link
Member Author

gfyoung commented Mar 1, 2016

@TomAugspurger : Yep, that's the right one! The one on top should always be the latest build. Thanks!

@gfyoung gfyoung force-pushed the dup_name_corrupt branch 2 times, most recently from c11051a to 8ff3e9a Compare March 1, 2016 06:22
@jreback
Copy link
Contributor

jreback commented Mar 1, 2016

you are adding WAY too much machinery. This is a pretty simple fix, though I don't have time at the moment to look.

@jreback
Copy link
Contributor

jreback commented Mar 1, 2016

this should be handled in internals.

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Mar 1, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Mar 1, 2016

@jreback : yes, it is currently being handled there, but I probably took a slightly more circuitous root than needed to get there. I'll see what I can simplify.

@gfyoung gfyoung force-pushed the dup_name_corrupt branch from 8ff3e9a to 027069e Compare March 1, 2016 18:20
@gfyoung
Copy link
Member Author

gfyoung commented Mar 2, 2016

Simplified the internal machinery to route directly to internals.py, and Travis is happy. Should be good to merge if there is nothing else.

@jreback
Copy link
Contributor

jreback commented Mar 2, 2016

still needs some work. you are adding function. use existing functionaility. This just makes it much harder on future readers.

@gfyoung
Copy link
Member Author

gfyoung commented Mar 2, 2016

@jreback : I don't quite see how the existing functionality can properly handle duplicate columns in this context IMOH. The new functionality I have added is just five lines of code here. The rest of it is refactoring. Besides essentially taking the internals of the function I wrote and placing them in the location where the function is called, I'm not sure what else you mean by "using existing functionality"

@gfyoung gfyoung force-pushed the dup_name_corrupt branch 4 times, most recently from 2a4a4ec to 08bef91 Compare March 3, 2016 22:20
@gfyoung
Copy link
Member Author

gfyoung commented Mar 3, 2016

If some could cancel this build #18021 (it's an old build), that would be great. Thanks!

@jreback
Copy link
Contributor

jreback commented Mar 3, 2016

this still needs work. too much specific if/thening

@gfyoung
Copy link
Member Author

gfyoung commented Mar 3, 2016

What do you mean "too much specific if/thening"?

index = indexer[info_axis]
target = self.obj[item_labels[index]]

# Duplicate columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too much specific if thening. you are trying to catch a very specific case and not solving it in a more general way.

@jreback
Copy link
Contributor

jreback commented Mar 3, 2016

diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py
index f0f5507..0313bb2 100644
--- a/pandas/core/indexing.py
+++ b/pandas/core/indexing.py
@@ -541,7 +541,7 @@ class _NDFrameIndexer(object):
                 if (len(indexer) > info_axis and
                         is_integer(indexer[info_axis]) and
                         all(is_null_slice(idx) for i, idx in enumerate(indexer)
-                            if i != info_axis)):
+                            if i != info_axis) and item_labels.is_unique ):
                     self.obj[item_labels[indexer[info_axis]]] = value
                     return

This is a more idiomatic way of creating the data

df = pd.DataFrame(np.arange(9).reshape(3,3).T)
df.columns = list('AAA')

df.iloc[:, 0] = df.iloc[:, 0].fillna(df.iloc[:, 1])

also need to test mixed setting (e.g. use int/float/string block)

The .fillna doesn't matter here at all.

@gfyoung gfyoung force-pushed the dup_name_corrupt branch from fa6a78a to 7265d29 Compare March 6, 2016 10:57
@gfyoung
Copy link
Member Author

gfyoung commented Mar 6, 2016

@jreback : Alright, I definitely did not understand what you meant by "using existing functionality". Thanks for the patch! I think I was too focused on resolving the issue within that block itself and didn't quite pay attention to the other conditionals listed underneath it.

@gfyoung
Copy link
Member Author

gfyoung commented Mar 6, 2016

@jreback : Tests are passing with the patch you provided. Should be good to merge now.

@jreback jreback added this to the 0.18.0 milestone Mar 6, 2016
@jreback jreback added the Bug label Mar 6, 2016
@jreback jreback closed this in a174898 Mar 6, 2016
@jreback
Copy link
Contributor

jreback commented Mar 6, 2016

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.fillna corrupts columns with duplicated names
4 participants