Skip to content

ENH: try to preserve the dtype on combine_first for the case where the two DataFrame objects have the same columns #39051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jan 15, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -6425,7 +6425,9 @@ def combine(
# convert_objects just in case
return self._constructor(result, index=new_index, columns=new_columns)

def combine_first(self, other: DataFrame) -> DataFrame:
def combine_first(
self, other: DataFrame, preserve_dtypes: bool = False
) -> DataFrame:
"""
Update null elements with value in the same location in `other`.

Expand All @@ -6438,6 +6440,11 @@ def combine_first(self, other: DataFrame) -> DataFrame:
other : DataFrame
Provided DataFrame to use to fill null values.

preserve_dtypes : bool, default False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not want to add a flag for this. simply change it.

Please add some examples for this behaivor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that maybe it is a good idea to keep the current behavior as default, and provide the new behavior as an option

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no its better to just fix this, you can add a whatsnew note in 1.3. everywhere else we cast to common dtypes, this should be no different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running into some failed tests that exceed my understand of the lib. Is it expected that if a Series is constructed from a list of None then the result of this combined with some other Series should have the latter's dtype (coercing to the respective NaN value)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of None -> object, so combined -> object

try to preserve the column dtypes after combining

.. versionadded:: 1.2.1

Returns
-------
DataFrame
Expand Down Expand Up @@ -6482,7 +6489,18 @@ def combiner(x, y):

return expressions.where(mask, y_values, x_values)

return self.combine(other, combiner, overwrite=False)
combined = self.combine(other, combiner, overwrite=False)

if preserve_dtypes:
dtypes = {
col: find_common_type([self.dtypes[col], other.dtypes[col]])
for col in self.columns.intersection(other.columns)
}

if dtypes:
combined = combined.astype(dtypes)

return combined

def update(
self,
Expand Down
43 changes: 43 additions & 0 deletions pandas/tests/frame/methods/test_combine_first.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ def test_combine_first_mixed(self):
combined = f.combine_first(g)
tm.assert_frame_equal(combined, exp)

exp = DataFrame({"A": list("abab"), "B": [0, 1, 0, 1]}, index=[0, 1, 5, 6])
combined = f.combine_first(g, preserve_dtypes=True)
tm.assert_frame_equal(combined, exp)

def test_combine_first(self, float_frame):
# disjoint
head, tail = float_frame[:5], float_frame[5:]
Expand Down Expand Up @@ -363,9 +367,16 @@ def test_combine_first_int(self):
expected_12 = DataFrame({"a": [0, 1, 3, 5]}, dtype="float64")
tm.assert_frame_equal(result_12, expected_12)

result_12 = df1.combine_first(df2, preserve_dtypes=True)
expected_12 = DataFrame({"a": [0, 1, 3, 5]})
tm.assert_frame_equal(result_12, expected_12)

result_21 = df2.combine_first(df1)
expected_21 = DataFrame({"a": [1, 4, 3, 5]}, dtype="float64")
tm.assert_frame_equal(result_21, expected_21)

result_21 = df2.combine_first(df1, preserve_dtypes=True)
expected_21 = DataFrame({"a": [1, 4, 3, 5]})
tm.assert_frame_equal(result_21, expected_21)

@pytest.mark.parametrize("val", [1, 1.0])
Expand Down Expand Up @@ -439,3 +450,35 @@ def test_combine_first_with_nan_multiindex():
index=mi_expected,
)
tm.assert_frame_equal(res, expected)


def test_combine_preserve_dtypes():
a = Series(["a", "b"], index=range(2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

b = Series(range(2), index=range(2))
f = DataFrame({"A": a, "B": b})

c = Series(["a", "b"], index=range(5, 7))
b = Series(range(-1, 1), index=range(5, 7))
g = DataFrame({"B": b, "C": c})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can you avoid 1-letter variable names? makes it harder to grep for things


exp = DataFrame(
{
"A": ["a", "b", np.nan, np.nan],
"B": [0.0, 1.0, -1.0, 0.0],
"C": [np.nan, np.nan, "a", "b"],
},
index=[0, 1, 5, 6],
)
combined = f.combine_first(g)
tm.assert_frame_equal(combined, exp)

exp = DataFrame(
{
"A": ["a", "b", np.nan, np.nan],
"B": [0, 1, -1, 0],
"C": [np.nan, np.nan, "a", "b"],
},
index=[0, 1, 5, 6],
)
combined = f.combine_first(g, preserve_dtypes=True)
tm.assert_frame_equal(combined, exp)