ENH: Add suffixes argument for pd.concat #29669

charlesdong1991 · 2019-11-17T10:15:30Z

closes ENH: add suffixes arg to concat #21791, xref ENH: Add suffixes for pd.concat when axis=1 #29615
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

alimcmaster1 · 2019-11-17T15:06:01Z

pandas/core/reshape/concat.py

@@ -37,6 +41,7 @@ def concat(
    names=None,
    verify_integrity: bool = False,
    sort=None,
+    suffixes=None,


Could we add the type?

alimcmaster1 · 2019-11-17T15:07:29Z

pandas/core/reshape/concat.py

+            x : renamed column name
+            """
+            if x in to_rename and suffix is not None:
+                return "{x}{suffix}".format(x=x, suffix=suffix)


NIT: f string?

alimcmaster1 · 2019-11-17T15:14:41Z

pandas/core/reshape/concat.py

+        if not isinstance(suffixes, tuple):
+            raise ValueError(
+                "Invalid type {t} is assigned to suffixes, only"
+                " <class 'tuple'> is allowed.".format(t=type(suffixes))


Prefer str(tuple) + f string - also mind adding a test case for this particular code path?

i guess i already have a test case? or you mean something else?

You have one for the ValueError below this using pytest.raises but not for this error?

sorry, I didn't get your point. What do you mean by have one for the ValueError below this? I might misunderstand your words, could you please bear with me and clarify a bit? I had a test called test_concat_suffixes_type ?

Ahh yep thanks that test func covers it. I initially searched the test script for "Invalid type" as I assumed you would match on that - apologies look good

ahh, i see, i used the other part for match. thanks for your comments!

alimcmaster1 · 2019-11-17T15:15:35Z

pandas/core/reshape/concat.py

+        if len(to_rename) == 0 or suffixes is None:
+            return objs
+
+        if not isinstance(suffixes, tuple):


Should this isinstance check maybe be moved to the start of the function? Might be clearer.

i thought so as well, and then decided to leave it here because of two reasons:

this will only be used if there is overlapping in column names.

the default is None, so default is used, then directly return original objs without post-processing and checking below.
Does it make any sense? I am very open to suggestions!

alimcmaster1 · 2019-11-17T15:16:14Z

Thanks for the PR @charlesdong1991 ! Left a few comment above

jreback

the implementation needs quite some work this is adding a lot of non-shared code with the merge path. I am also not quite sure of the usecase (yes I opened the original issue). Can you show why this is actually needed?

charlesdong1991 · 2019-11-18T14:00:04Z

thanks for your comment, @jreback

I initially opened an issue #29615 for this, and later found you opened the issue already long ago. The usecase I am having right now is I have several tables with exactly same format, but they have different values in each field since let's say they are representing the data from different years, e.g

df1 = pd.DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
df2 = pd.DataFrame({"a": [2, 3, 4], "b": [3, 4, 5]})
df3 = pd.DataFrame({"a": [4, 5, 6], "b": [6, 7, 8]})

And I would like to put them together horizontally so as to do further analysis. The easiest way in this case would be pd.concat([df1, df2, df3], axis=1), but after concatenation, the dataframe contains duplicated column names, and it would be very useful and convenient to have a suffixes argument to distinguish those as merge has. And this is my initial motivation to create this PR.

jschendel · 2019-11-18T18:11:17Z

pandas/core/reshape/concat.py

+        if self._is_series:
+
+            # when _is_series is True, objs are actually column Index
+            overlap_cols = list(objs)


Do you actually need to convert to a list here? I think leaving it as an Index should still work the same way?

jschendel · 2019-11-18T18:12:09Z

pandas/core/reshape/concat.py

+            # when _is_series is True, objs are actually column Index
+            overlap_cols = list(objs)
+        else:
+            overlap_cols = chain.from_iterable([obj.columns for obj in objs])


I don't think you need the list comprehension and could just do a generator expression instead.

jschendel · 2019-11-18T18:21:12Z

pandas/core/reshape/concat.py

+            overlap_cols = list(objs)
+        else:
+            overlap_cols = chain.from_iterable([obj.columns for obj in objs])
+        to_rename = [col for col, cnt in Counter(overlap_cols).items() if cnt > 1]


Maybe make this a set comprehension since this is only used for x in to_rename lookups.

jschendel · 2019-11-18T18:44:01Z

pandas/core/reshape/concat.py

+        as is with duplicated column names.
+
+        This has no effect if there is no overlapping column names or if axis=0.
+


Can you add .. versionadded:: 1.0.0?

jschendel · 2019-11-18T18:55:44Z

pandas/core/reshape/concat.py

+        if not isinstance(suffixes, tuple):
+            raise ValueError(
+                f"Invalid type {type(suffixes)} is assigned to suffixes, only "
+                f"'tuple' is allowed."


nitpick: I don't think you need the second line to be and f-string

jschendel · 2019-11-18T19:33:29Z

pandas/core/reshape/concat.py

+                "equal to number of suffixes"
+            )
+
+        def renamer(x, suffix):


It looks like there's basically an identical definition of this function in core/reshape/merge.py, so would be nice to be able to reuse this. A little tricky in that this is a nested function, so maybe can do as a follow-up once things are set in stone.

jschendel · 2019-11-18T19:39:16Z

pandas/core/reshape/concat.py

+            overlap_cols = chain.from_iterable([obj.columns for obj in objs])
+        to_rename = [col for col, cnt in Counter(overlap_cols).items() if cnt > 1]
+
+        if len(to_rename) == 0 or suffixes is None:


maybe a little more pythonic to check the boolness of to_rename: len(to_rename) == 0 --> not to_rename

jschendel · 2019-11-18T21:18:10Z

pandas/core/reshape/concat.py

+
+        for obj, suffix in zip(objs, suffixes):
+            col_renamer = partial(renamer, suffix=suffix)
+            obj.columns = _transform_index(obj.columns, col_renamer)


I think this line is causing the original dataframes to be modified as well:

In [1]: import pandas as pd; pd.__version__ Out[1]: '0.26.0.dev0+947.gc8570707c' In [2]: df1 = pd.DataFrame({'A': list('ab'), 'B': [0, 1]}) In [3]: df2 = pd.DataFrame({'A':list('ac'), 'C': [100, 200]}) In [4]: df3 = pd.concat([df1, df2], axis=1, suffixes=('_x', '_y')) In [5]: df1.columns Out[5]: Index(['A_x', 'B'], dtype='object') In [6]: df1 Out[6]: A_x B 0 a 0 1 b 1 In [7]: df2.columns Out[7]: Index(['A_y', 'C'], dtype='object') In [8]: df2 Out[8]: A_y C 0 a 100 1 c 200

Root cause could be occurring elsewhere though, but it's an issue nonetheless. Would be nice to add a test for the above.

jschendel · 2019-11-18T21:21:12Z

pandas/tests/reshape/test_concat.py

+    [
+        (
+            [pd.Series([1, 2], name="a"), pd.Series([2, 3], name="a")],
+            ("_x", "_y"),


Can you also add tests for if dupe suffixes are specified, e.g. ("_x", "_x")?

jschendel · 2019-11-18T21:39:55Z

pandas/core/reshape/concat.py

+        Suffix to apply to overlapping column names for each concatenated object
+        respectively. If the length of suffixes does not match with number of
+        concatenated objects, an error will raise. If None, the output will remain
+        as is with duplicated column names.


Do we want to replicate the behavior of merge in regards to suffixes=(False, False)?

pandas/pandas/core/frame.py

Lines 203 to 206 in e246c3b

suffixes : tuple of (str, str), default ('_x', '_y')

Suffix to apply to overlapping column names in the left and right

side, respectively. To raise an exception on overlapping columns use

(False, False).

WillAyd · 2019-11-19T04:05:25Z

Been thinking this over and while this is a nice effort I think I am -1 on adding this to the concat API. It feels intuitive for merge where you are essentially taking a union of columns, but with pd.concat you are just stacking things.

Seems like if anything this de-duplication would really be a Index method (though not sure that’s worth adding to the API either)

charlesdong1991 · 2019-11-19T07:05:51Z

Many thanks for taking your time and giving me very nice reviews @jschendel and I appreciate a lot!

However, since I see @WillAyd and @jreback have concerns if this feature should be added to pd.concat or not, and i agree that the change is mainly doing a de-duplication on column Index by adding a suffix since pd.concat is just stacking. The only small benefit of having this feature might be if having many tables with identical names, stacking them together while renaming with suffix can be easier.

I am open to advices on the opinions on if it's worth to have this enhancement, if it is, I will start code changes based on @jschendel 's nice reviews.

WillAyd · 2019-12-17T17:54:29Z

Thanks for the contribution @charlesdong1991 but based on feedback I don't think we have an appetite for this one

charlesdong1991 added 5 commits December 3, 2018 17:43

remove \n from docstring

7e461a1

fix conflicts

1314059

Merge remote-tracking branch 'upstream/master'

8bcb313

Merge remote-tracking branch 'upstream/master' into add_suffixes_concat

a4f9d14

Add suffixes argument for pd.concat

13a2930

charlesdong1991 changed the title ~~Add suffixes argument for pd.concat~~ ENH: Add suffixes argument for pd.concat Nov 17, 2019

charlesdong1991 added 7 commits November 17, 2019 11:22

Add whatsnew note

901a21b

fix linting

fd64695

fix linting

b7e97d6

Merge remote-tracking branch 'upstream/master' into add_suffixes_concat

22e3250

fix linting

fd53b09

fix tests

1252f7b

add docstring

233adee

alimcmaster1 requested changes Nov 17, 2019

View reviewed changes

alimcmaster1 added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Enhancement labels Nov 17, 2019

charlesdong1991 added 2 commits November 17, 2019 16:59

code change on comments

f7d3d59

fix error message

c857070

jreback requested changes Nov 18, 2019

View reviewed changes

jschendel reviewed Nov 18, 2019

View reviewed changes

WillAyd closed this Dec 17, 2019

WillAyd mentioned this pull request Dec 17, 2019

ENH: add suffixes arg to concat #21791

Closed

		as is with duplicated column names.

		This has no effect if there is no overlapping column names or if axis=0.

	suffixes : tuple of (str, str), default ('_x', '_y')
	Suffix to apply to overlapping column names in the left and right
	side, respectively. To raise an exception on overlapping columns use
	(False, False).

Uh oh!

ENH: Add suffixes argument for pd.concat #29669

ENH: Add suffixes argument for pd.concat #29669

Uh oh!

Conversation

charlesdong1991 commented Nov 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesdong1991 Nov 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alimcmaster1 commented Nov 17, 2019

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

charlesdong1991 commented Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jschendel Nov 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd commented Nov 19, 2019

Uh oh!

charlesdong1991 commented Nov 19, 2019

Uh oh!

WillAyd commented Dec 17, 2019

Uh oh!

Uh oh!

charlesdong1991 commented Nov 17, 2019 •

edited

Loading

charlesdong1991 Nov 17, 2019 •

edited

Loading

charlesdong1991 commented Nov 18, 2019 •

edited

Loading

jschendel Nov 18, 2019 •

edited

Loading