Skip to content

Commit 10c0446

Browse files
authored
feat: ensure Series.str.len() can get length of array columns (#497)
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/python-bigquery-dataframes/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #<issue_number_goes_here> 🦕
1 parent d51fa84 commit 10c0446

File tree

2 files changed

+20
-2
lines changed

2 files changed

+20
-2
lines changed

tests/system/conftest.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -357,8 +357,6 @@ def nested_pandas_df() -> pd.DataFrame:
357357
DATA_DIR / "nested.jsonl",
358358
lines=True,
359359
)
360-
tests.system.utils.convert_pandas_dtypes(df, bytes_col=True)
361-
362360
df = df.set_index("rowindex")
363361
return df
364362

tests/system/small/operations/test_strings.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -181,6 +181,26 @@ def test_len(scalars_dfs):
181181
)
182182

183183

184+
def test_len_with_array_column(nested_df, nested_pandas_df):
185+
"""
186+
Series.str.len() is expected to work on columns containing lists as well as strings.
187+
188+
See: https://stackoverflow.com/a/41340543/101923
189+
"""
190+
col_name = "event_sequence"
191+
bf_series: bigframes.series.Series = nested_df[col_name]
192+
bf_result = bf_series.str.len().to_pandas()
193+
pd_result = nested_pandas_df[col_name].str.len()
194+
195+
# One of dtype mismatches to be documented. Here, the `bf_result.dtype` is `Int64` but
196+
# the `pd_result.dtype` is `float64`: https://github.com/pandas-dev/pandas/issues/51948
197+
assert_series_equal(
198+
pd_result.astype(pd.Int64Dtype()),
199+
bf_result,
200+
check_index_type=False,
201+
)
202+
203+
184204
def test_lower(scalars_dfs):
185205
scalars_df, scalars_pandas_df = scalars_dfs
186206
col_name = "string_col"

0 commit comments

Comments
 (0)