-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add suffixes argument for pd.concat #29669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add suffixes argument for pd.concat #29669
Conversation
pandas/core/reshape/concat.py
Outdated
@@ -37,6 +41,7 @@ def concat( | |||
names=None, | |||
verify_integrity: bool = False, | |||
sort=None, | |||
suffixes=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add the type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure!
pandas/core/reshape/concat.py
Outdated
x : renamed column name | ||
""" | ||
if x in to_rename and suffix is not None: | ||
return "{x}{suffix}".format(x=x, suffix=suffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: f string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
pandas/core/reshape/concat.py
Outdated
if not isinstance(suffixes, tuple): | ||
raise ValueError( | ||
"Invalid type {t} is assigned to suffixes, only" | ||
" <class 'tuple'> is allowed.".format(t=type(suffixes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer str(tuple)
+ f string - also mind adding a test case for this particular code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess i already have a test case? or you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have one for the ValueError below this using pytest.raises but not for this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I didn't get your point. What do you mean by have one for the ValueError below this
? I might misunderstand your words, could you please bear with me and clarify a bit? I had a test called test_concat_suffixes_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yep thanks that test func covers it. I initially searched the test script for "Invalid type" as I assumed you would match on that - apologies look good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, i see, i used the other part for match
. thanks for your comments!
if len(to_rename) == 0 or suffixes is None: | ||
return objs | ||
|
||
if not isinstance(suffixes, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this isinstance
check maybe be moved to the start of the function? Might be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought so as well, and then decided to leave it here because of two reasons:
- this will only be used if there is overlapping in column names.
- the default is
None
, so default is used, then directly return originalobjs
without post-processing and checking below.
Does it make any sense? I am very open to suggestions!
Thanks for the PR @charlesdong1991 ! Left a few comment above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the implementation needs quite some work this is adding a lot of non-shared code with the merge path. I am also not quite sure of the usecase (yes I opened the original issue). Can you show why this is actually needed?
thanks for your comment, @jreback I initially opened an issue #29615 for this, and later found you opened the issue already long ago. The usecase I am having right now is I have several tables with exactly same format, but they have different values in each field since let's say they are representing the data from different years, e.g
And I would like to put them together horizontally so as to do further analysis. The easiest way in this case would be |
if self._is_series: | ||
|
||
# when _is_series is True, objs are actually column Index | ||
overlap_cols = list(objs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you actually need to convert to a list
here? I think leaving it as an Index
should still work the same way?
# when _is_series is True, objs are actually column Index | ||
overlap_cols = list(objs) | ||
else: | ||
overlap_cols = chain.from_iterable([obj.columns for obj in objs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need the list comprehension and could just do a generator expression instead.
overlap_cols = list(objs) | ||
else: | ||
overlap_cols = chain.from_iterable([obj.columns for obj in objs]) | ||
to_rename = [col for col, cnt in Counter(overlap_cols).items() if cnt > 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe make this a set
comprehension since this is only used for x in to_rename
lookups.
as is with duplicated column names. | ||
|
||
This has no effect if there is no overlapping column names or if axis=0. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add .. versionadded:: 1.0.0
?
if not isinstance(suffixes, tuple): | ||
raise ValueError( | ||
f"Invalid type {type(suffixes)} is assigned to suffixes, only " | ||
f"'tuple' is allowed." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I don't think you need the second line to be and f-string
"equal to number of suffixes" | ||
) | ||
|
||
def renamer(x, suffix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's basically an identical definition of this function in core/reshape/merge.py
, so would be nice to be able to reuse this. A little tricky in that this is a nested function, so maybe can do as a follow-up once things are set in stone.
overlap_cols = chain.from_iterable([obj.columns for obj in objs]) | ||
to_rename = [col for col, cnt in Counter(overlap_cols).items() if cnt > 1] | ||
|
||
if len(to_rename) == 0 or suffixes is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a little more pythonic to check the boolness of to_rename
: len(to_rename) == 0
--> not to_rename
|
||
for obj, suffix in zip(objs, suffixes): | ||
col_renamer = partial(renamer, suffix=suffix) | ||
obj.columns = _transform_index(obj.columns, col_renamer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line is causing the original dataframes to be modified as well:
In [1]: import pandas as pd; pd.__version__
Out[1]: '0.26.0.dev0+947.gc8570707c'
In [2]: df1 = pd.DataFrame({'A': list('ab'), 'B': [0, 1]})
In [3]: df2 = pd.DataFrame({'A':list('ac'), 'C': [100, 200]})
In [4]: df3 = pd.concat([df1, df2], axis=1, suffixes=('_x', '_y'))
In [5]: df1.columns
Out[5]: Index(['A_x', 'B'], dtype='object')
In [6]: df1
Out[6]:
A_x B
0 a 0
1 b 1
In [7]: df2.columns
Out[7]: Index(['A_y', 'C'], dtype='object')
In [8]: df2
Out[8]:
A_y C
0 a 100
1 c 200
Root cause could be occurring elsewhere though, but it's an issue nonetheless. Would be nice to add a test for the above.
[ | ||
( | ||
[pd.Series([1, 2], name="a"), pd.Series([2, 3], name="a")], | ||
("_x", "_y"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add tests for if dupe suffixes
are specified, e.g. ("_x", "_x")
?
Suffix to apply to overlapping column names for each concatenated object | ||
respectively. If the length of suffixes does not match with number of | ||
concatenated objects, an error will raise. If None, the output will remain | ||
as is with duplicated column names. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to replicate the behavior of merge in regards to suffixes=(False, False)
?
Lines 203 to 206 in e246c3b
suffixes : tuple of (str, str), default ('_x', '_y') | |
Suffix to apply to overlapping column names in the left and right | |
side, respectively. To raise an exception on overlapping columns use | |
(False, False). |
Been thinking this over and while this is a nice effort I think I am -1 on adding this to the concat API. It feels intuitive for merge where you are essentially taking a union of columns, but with pd.concat you are just stacking things. Seems like if anything this de-duplication would really be a Index method (though not sure that’s worth adding to the API either) |
Many thanks for taking your time and giving me very nice reviews @jschendel and I appreciate a lot! However, since I see @WillAyd and @jreback have concerns if this feature should be added to I am open to advices on the opinions on if it's worth to have this enhancement, if it is, I will start code changes based on @jschendel 's nice reviews. |
Thanks for the contribution @charlesdong1991 but based on feedback I don't think we have an appetite for this one |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff