Skip to content

feat: df.join lsuffix and rsuffix support #1857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Conversation

Genesis929
Copy link
Collaborator

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Jun 26, 2025
@Genesis929 Genesis929 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 26, 2025
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 26, 2025
@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Jun 26, 2025
@Genesis929 Genesis929 added the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 26, 2025
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Jun 26, 2025
@Genesis929 Genesis929 marked this pull request as ready for review June 26, 2025 19:48
@Genesis929 Genesis929 requested review from a team as code owners June 26, 2025 19:48
@Genesis929 Genesis929 requested a review from tswast June 26, 2025 19:48
@Genesis929 Genesis929 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 26, 2025
@bigframes-bot bigframes-bot removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Jun 26, 2025
["string_col", "int64_col", "int64_too"]
].rename(columns={"int64_too": "int64_col"})
pd_result = pd_df_a.join(pd_df_b, how=how, lsuffix="_l", rsuffix="_r")
print(pd_result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove leftover print() statements.

PS. Adding --pdb to your pytest command line arguments makes dropping into a debugger to inspect variables really easy. https://docs.pytest.org/en/stable/how-to/failures.html#dropping-to-pdb-on-failures

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +2469 to +2470
if how == "cross":
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be worth added a test that ValueError is not raise for this condition with a cross join.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross join actually raise another error, match added.

@Genesis929 Genesis929 requested a review from tswast July 7, 2025 20:37
f"bigframes_left_col_name_{i}" if col_name != on else on_col_name
for i, col_name in enumerate(left_col_original_names)
]
left.columns = pandas.Index(left_col_temp_names)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems dangerous. We haven't made a copy of self, so I'm uncomfortable with mutating it. If we must do this, then please either:

  1. make a copy of self first
  2. or put a finally block that resets the names back to the original in case anything when wrong.

I prefer (1) since it's less likely to have problems in we're in a multi-threaded environment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated left = self.copy()

f"bigframes_left_idx_name_{i}" for i in range(len(left_idx_original_names))
]
if left._has_index:
left.index.names = left_idx_names_in_cols
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Mutating the index is dangerous. Can we avoid this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to avoid duplicates in names when join, or the reordering columns won't work, so for current join logic, we can't avoid this.

f"bigframes_right_col_name_{i}"
for i in range(len(right_col_original_names))
]
right.columns = pandas.Index(right_col_temp_names)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to avoid duplicates in names when join, or the reordering columns won't work, so for current join logic, we can't avoid this.

right_columns,
lsuffix: str = "",
rsuffix: str = "",
extra_col: typing.Optional[str] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring explaining this extra_col parameter and when it is intended to be used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

final_col_names.append(f"{col_name}{rsuffix}")
else:
final_col_names.append(col_name)
self.columns = pandas.Index(final_col_names)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only be modifying self if we're doing an inplace operation, right? Why is self getting changed? Can we avoid this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this function the self is actually combined_df, so it should be safe. Changed to self.copy for additional safety.

bf_df_b = scalars_df_index.dropna()[
["string_col", "int64_col", "int64_too"]
].rename(columns={"int64_too": "int64_col"})
bf_result = bf_df_a.join(bf_df_b, how=how, lsuffix="_l", rsuffix="_r").to_pandas()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some checks that bf_df_a's column names and index names didn't get modified?

@Genesis929 Genesis929 requested a review from tswast July 28, 2025 20:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants