Skip to content

ENH: Add suffixes argument for pd.concat #29669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

charlesdong1991
Copy link
Member

@charlesdong1991 charlesdong1991 commented Nov 17, 2019

@charlesdong1991 charlesdong1991 changed the title Add suffixes argument for pd.concat ENH: Add suffixes argument for pd.concat Nov 17, 2019
@@ -37,6 +41,7 @@ def concat(
names=None,
verify_integrity: bool = False,
sort=None,
suffixes=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

x : renamed column name
"""
if x in to_rename and suffix is not None:
return "{x}{suffix}".format(x=x, suffix=suffix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: f string?

Copy link
Member Author

@charlesdong1991 charlesdong1991 Nov 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

if not isinstance(suffixes, tuple):
raise ValueError(
"Invalid type {t} is assigned to suffixes, only"
" <class 'tuple'> is allowed.".format(t=type(suffixes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer str(tuple) + f string - also mind adding a test case for this particular code path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i already have a test case? or you mean something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have one for the ValueError below this using pytest.raises but not for this error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I didn't get your point. What do you mean by have one for the ValueError below this? I might misunderstand your words, could you please bear with me and clarify a bit? I had a test called test_concat_suffixes_type ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yep thanks that test func covers it. I initially searched the test script for "Invalid type" as I assumed you would match on that - apologies look good

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, i see, i used the other part for match. thanks for your comments!

if len(to_rename) == 0 or suffixes is None:
return objs

if not isinstance(suffixes, tuple):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this isinstance check maybe be moved to the start of the function? Might be clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought so as well, and then decided to leave it here because of two reasons:

  1. this will only be used if there is overlapping in column names.
  2. the default is None, so default is used, then directly return original objs without post-processing and checking below.
    Does it make any sense? I am very open to suggestions!

@alimcmaster1
Copy link
Member

Thanks for the PR @charlesdong1991 ! Left a few comment above

@alimcmaster1 alimcmaster1 added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Enhancement labels Nov 17, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the implementation needs quite some work this is adding a lot of non-shared code with the merge path. I am also not quite sure of the usecase (yes I opened the original issue). Can you show why this is actually needed?

@charlesdong1991
Copy link
Member Author

charlesdong1991 commented Nov 18, 2019

thanks for your comment, @jreback

I initially opened an issue #29615 for this, and later found you opened the issue already long ago. The usecase I am having right now is I have several tables with exactly same format, but they have different values in each field since let's say they are representing the data from different years, e.g

df1 = pd.DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
df2 = pd.DataFrame({"a": [2, 3, 4], "b": [3, 4, 5]})
df3 = pd.DataFrame({"a": [4, 5, 6], "b": [6, 7, 8]})

And I would like to put them together horizontally so as to do further analysis. The easiest way in this case would be pd.concat([df1, df2, df3], axis=1), but after concatenation, the dataframe contains duplicated column names, and it would be very useful and convenient to have a suffixes argument to distinguish those as merge has. And this is my initial motivation to create this PR.

if self._is_series:

# when _is_series is True, objs are actually column Index
overlap_cols = list(objs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually need to convert to a list here? I think leaving it as an Index should still work the same way?

# when _is_series is True, objs are actually column Index
overlap_cols = list(objs)
else:
overlap_cols = chain.from_iterable([obj.columns for obj in objs])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need the list comprehension and could just do a generator expression instead.

overlap_cols = list(objs)
else:
overlap_cols = chain.from_iterable([obj.columns for obj in objs])
to_rename = [col for col, cnt in Counter(overlap_cols).items() if cnt > 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make this a set comprehension since this is only used for x in to_rename lookups.

as is with duplicated column names.

This has no effect if there is no overlapping column names or if axis=0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add .. versionadded:: 1.0.0?

if not isinstance(suffixes, tuple):
raise ValueError(
f"Invalid type {type(suffixes)} is assigned to suffixes, only "
f"'tuple' is allowed."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I don't think you need the second line to be and f-string

"equal to number of suffixes"
)

def renamer(x, suffix):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's basically an identical definition of this function in core/reshape/merge.py, so would be nice to be able to reuse this. A little tricky in that this is a nested function, so maybe can do as a follow-up once things are set in stone.

overlap_cols = chain.from_iterable([obj.columns for obj in objs])
to_rename = [col for col, cnt in Counter(overlap_cols).items() if cnt > 1]

if len(to_rename) == 0 or suffixes is None:
Copy link
Member

@jschendel jschendel Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a little more pythonic to check the boolness of to_rename: len(to_rename) == 0 --> not to_rename


for obj, suffix in zip(objs, suffixes):
col_renamer = partial(renamer, suffix=suffix)
obj.columns = _transform_index(obj.columns, col_renamer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is causing the original dataframes to be modified as well:

In [1]: import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+947.gc8570707c'

In [2]: df1 = pd.DataFrame({'A': list('ab'), 'B': [0, 1]})

In [3]: df2 = pd.DataFrame({'A':list('ac'), 'C': [100, 200]})

In [4]: df3 = pd.concat([df1, df2], axis=1, suffixes=('_x', '_y'))

In [5]: df1.columns
Out[5]: Index(['A_x', 'B'], dtype='object')

In [6]: df1
Out[6]: 
  A_x  B
0   a  0
1   b  1

In [7]: df2.columns
Out[7]: Index(['A_y', 'C'], dtype='object')

In [8]: df2
Out[8]: 
  A_y    C
0   a  100
1   c  200

Root cause could be occurring elsewhere though, but it's an issue nonetheless. Would be nice to add a test for the above.

[
(
[pd.Series([1, 2], name="a"), pd.Series([2, 3], name="a")],
("_x", "_y"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add tests for if dupe suffixes are specified, e.g. ("_x", "_x")?

Suffix to apply to overlapping column names for each concatenated object
respectively. If the length of suffixes does not match with number of
concatenated objects, an error will raise. If None, the output will remain
as is with duplicated column names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to replicate the behavior of merge in regards to suffixes=(False, False)?

pandas/pandas/core/frame.py

Lines 203 to 206 in e246c3b

suffixes : tuple of (str, str), default ('_x', '_y')
Suffix to apply to overlapping column names in the left and right
side, respectively. To raise an exception on overlapping columns use
(False, False).

@WillAyd
Copy link
Member

WillAyd commented Nov 19, 2019

Been thinking this over and while this is a nice effort I think I am -1 on adding this to the concat API. It feels intuitive for merge where you are essentially taking a union of columns, but with pd.concat you are just stacking things.

Seems like if anything this de-duplication would really be a Index method (though not sure that’s worth adding to the API either)

@charlesdong1991
Copy link
Member Author

Many thanks for taking your time and giving me very nice reviews @jschendel and I appreciate a lot!

However, since I see @WillAyd and @jreback have concerns if this feature should be added to pd.concat or not, and i agree that the change is mainly doing a de-duplication on column Index by adding a suffix since pd.concat is just stacking. The only small benefit of having this feature might be if having many tables with identical names, stacking them together while renaming with suffix can be easier.

I am open to advices on the opinions on if it's worth to have this enhancement, if it is, I will start code changes based on @jschendel 's nice reviews.

@WillAyd
Copy link
Member

WillAyd commented Dec 17, 2019

Thanks for the contribution @charlesdong1991 but based on feedback I don't think we have an appetite for this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: add suffixes arg to concat
5 participants