BUG: Allow assignment by indexing with duplicate column names #12498

gfyoung · 2016-03-01T02:11:13Z

in which assignment to columns in DataFrame with duplicate column names caused all columns with the same name to be reassigned. The bug was located here.

When you try to index into the DataFrame using .iloc, pandas will find the corresponding column name (or key) first before setting that key with the given value. Unfortunately, since all of your columns have the same key name, pandas ends up choosing all of the columns corresponding to that name.

This PR introduces a _set_item_by_index method for DataFrame objects that allows you to bypass that issue by using the indices of the columns to set the columns themselves whenever there are duplicates involved.

shoyer · 2016-03-01T02:20:50Z

pandas/tests/frame/test_nonunique_indexes.py

+                           '  Same': np.nan},
+                          index=[0, 1, 2])
+
+        df.columns = [c.strip() for c in df.columns]


Could we use a more directly method for constructing this test case? I found this setup very confusing (using index to do implicit broadcasting across rows).

For the record, this is the resulting dataframe:

Same Same Same 0 NaN NaN 1 1 NaN NaN 1 2 NaN NaN 1

Done. Would you mind also cancelling my first Travis build then (it hasn't started yet but fixing this test added a new build)?

TomAugspurger · 2016-03-01T02:48:03Z

I canceled https://travis-ci.org/pydata/pandas/builds/112754541, hopefully that was the correct one.

gfyoung · 2016-03-01T02:49:31Z

@TomAugspurger : Yep, that's the right one! The one on top should always be the latest build. Thanks!

jreback · 2016-03-01T12:46:27Z

you are adding WAY too much machinery. This is a pretty simple fix, though I don't have time at the moment to look.

jreback · 2016-03-01T12:46:58Z

this should be handled in internals.

gfyoung · 2016-03-01T13:22:05Z

@jreback : yes, it is currently being handled there, but I probably took a slightly more circuitous root than needed to get there. I'll see what I can simplify.

gfyoung · 2016-03-02T03:43:43Z

Simplified the internal machinery to route directly to internals.py, and Travis is happy. Should be good to merge if there is nothing else.

jreback · 2016-03-02T12:50:03Z

still needs some work. you are adding function. use existing functionaility. This just makes it much harder on future readers.

gfyoung · 2016-03-02T13:32:31Z

@jreback : I don't quite see how the existing functionality can properly handle duplicate columns in this context IMOH. The new functionality I have added is just five lines of code here. The rest of it is refactoring. Besides essentially taking the internals of the function I wrote and placing them in the location where the function is called, I'm not sure what else you mean by "using existing functionality"

gfyoung · 2016-03-03T22:21:19Z

If some could cancel this build #18021 (it's an old build), that would be great. Thanks!

jreback · 2016-03-03T22:25:47Z

this still needs work. too much specific if/thening

gfyoung · 2016-03-03T22:28:32Z

What do you mean "too much specific if/thening"?

jreback · 2016-03-03T22:30:07Z

pandas/core/indexing.py

+                    index = indexer[info_axis]
+                    target = self.obj[item_labels[index]]
+
+                    # Duplicate columns


this is too much specific if thening. you are trying to catch a very specific case and not solving it in a more general way.

jreback · 2016-03-03T22:44:28Z

diff --git a/pandas/core/indexing.py b/pandas/core/indexing.py
index f0f5507..0313bb2 100644
--- a/pandas/core/indexing.py
+++ b/pandas/core/indexing.py
@@ -541,7 +541,7 @@ class _NDFrameIndexer(object):
                 if (len(indexer) > info_axis and
                         is_integer(indexer[info_axis]) and
                         all(is_null_slice(idx) for i, idx in enumerate(indexer)
-                            if i != info_axis)):
+                            if i != info_axis) and item_labels.is_unique ):
                     self.obj[item_labels[indexer[info_axis]]] = value
                     return

This is a more idiomatic way of creating the data

df = pd.DataFrame(np.arange(9).reshape(3,3).T)
df.columns = list('AAA')

df.iloc[:, 0] = df.iloc[:, 0].fillna(df.iloc[:, 1])

also need to test mixed setting (e.g. use int/float/string block)

The .fillna doesn't matter here at all.

Closes pandas-devgh-12344.

gfyoung · 2016-03-06T10:58:07Z

@jreback : Alright, I definitely did not understand what you meant by "using existing functionality". Thanks for the patch! I think I was too focused on resolving the issue within that block itself and didn't quite pay attention to the other conditionals listed underneath it.

gfyoung · 2016-03-06T11:39:04Z

@jreback : Tests are passing with the patch you provided. Should be good to merge now.

jreback · 2016-03-06T15:28:50Z

thanks

shoyer reviewed Mar 1, 2016
View reviewed changes

gfyoung force-pushed the dup_name_corrupt branch from d550378 to e070e80 Compare March 1, 2016 02:32

gfyoung force-pushed the dup_name_corrupt branch 2 times, most recently from c11051a to 8ff3e9a Compare March 1, 2016 06:22

jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Mar 1, 2016

gfyoung force-pushed the dup_name_corrupt branch from 8ff3e9a to 027069e Compare March 1, 2016 18:20

gfyoung force-pushed the dup_name_corrupt branch 4 times, most recently from 2a4a4ec to 08bef91 Compare March 3, 2016 22:20

jreback reviewed Mar 3, 2016
View reviewed changes

TomAugspurger mentioned this pull request Mar 4, 2016

Value assignment by iloc overwrites multiple concat columns #12528

Closed

gfyoung force-pushed the dup_name_corrupt branch 3 times, most recently from 698805f to fa6a78a Compare March 6, 2016 10:40

BUG: Allow assignment by indexing with duplicate column names

7265d29

Closes pandas-devgh-12344.

gfyoung force-pushed the dup_name_corrupt branch from fa6a78a to 7265d29 Compare March 6, 2016 10:57

jreback added this to the 0.18.0 milestone Mar 6, 2016

jreback added the Bug label Mar 6, 2016

jreback closed this in a174898 Mar 6, 2016

gfyoung deleted the dup_name_corrupt branch March 6, 2016 15:43

jreback mentioned this pull request Jun 11, 2016

Setting NaN values on DataFrame with non-unique column names #13423

Closed

Uh oh!

BUG: Allow assignment by indexing with duplicate column names #12498

BUG: Allow assignment by indexing with duplicate column names #12498

Uh oh!

Conversation

gfyoung commented Mar 1, 2016

Uh oh!

shoyer Mar 1, 2016

Choose a reason for hiding this comment

Uh oh!

gfyoung Mar 1, 2016

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Mar 1, 2016

Uh oh!

gfyoung commented Mar 1, 2016

Uh oh!

jreback commented Mar 1, 2016

Uh oh!

jreback commented Mar 1, 2016

Uh oh!

gfyoung commented Mar 1, 2016

Uh oh!

gfyoung commented Mar 2, 2016

Uh oh!

jreback commented Mar 2, 2016

Uh oh!

gfyoung commented Mar 2, 2016

Uh oh!

gfyoung commented Mar 3, 2016

Uh oh!

jreback commented Mar 3, 2016

Uh oh!

gfyoung commented Mar 3, 2016

Uh oh!

jreback Mar 3, 2016

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 3, 2016

Uh oh!

gfyoung commented Mar 6, 2016

Uh oh!

gfyoung commented Mar 6, 2016

Uh oh!

jreback commented Mar 6, 2016

Uh oh!

Uh oh!