Skip to content

Commit 8ecbea3

Browse files
authored
Add DataFrame.iter_columns() and simplify (#326)
* add DataFrame.column_iter * iter_columns instead * lint
1 parent 21271f5 commit 8ecbea3

File tree

4 files changed

+22
-30
lines changed

4 files changed

+22
-30
lines changed

spec/API_specification/dataframe_api/dataframe_object.py

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
from typing import TYPE_CHECKING, Any, Literal, NoReturn, Protocol
44

55
if TYPE_CHECKING:
6-
from collections.abc import Mapping, Sequence
6+
from collections.abc import Iterator, Mapping, Sequence
77

88
from typing_extensions import Self
99

@@ -275,6 +275,10 @@ def schema(self) -> dict[str, DType]:
275275
"""
276276
...
277277

278+
def iter_columns(self) -> Iterator[Column]:
279+
"""Return iterator over columns."""
280+
...
281+
278282
def sort(
279283
self,
280284
*keys: str,
@@ -905,23 +909,20 @@ def persist(self) -> Self:
905909
.. code-block:: python
906910
907911
df: DataFrame
908-
features = []
909912
result = df.std() > 0
910913
result = result.persist()
911-
for column_name in df.column_names:
912-
if result.col(column_name).get_value(0):
913-
features.append(column_name)
914+
features = [col.name for col in df.iter_columns() if col.get_value(0)]
914915
915916
instead of this:
916917
917918
.. code-block:: python
918919
919920
df: DataFrame
920-
features = []
921-
for column_name in df.column_names:
922-
# Do NOT call `persist` on a `DataFrame` within a for-loop!
923-
# This may re-trigger the same computation multiple times
924-
if df.persist().col(column_name).std() > 0:
925-
features.append(column_name)
921+
result = df.std() > 0
922+
features = [
923+
# Do NOT do this! This will trigger execution of the entire
924+
# pipeline for element in the for-loop!
925+
col.name for col in df.iter_columns() if col.get_value(0).persist()
926+
]
926927
"""
927928
...

spec/API_specification/examples/01_standardise_columns.py

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,10 @@
99
def my_dataframe_agnostic_function(df_non_standard: SupportsDataFrameAPI) -> Any:
1010
df = df_non_standard.__dataframe_consortium_standard__(api_version="2023.09-beta")
1111

12-
for column_name in df.column_names:
13-
if column_name == "species":
14-
continue
15-
new_column = df.col(column_name)
16-
new_column = (new_column - new_column.mean()) / new_column.std()
17-
df = df.assign(new_column.rename(f"{column_name}_scaled"))
12+
new_columns = [
13+
((col - col.mean()) / col.std()).rename(f"{col.name}_scaled")
14+
for col in df.iter_columns()
15+
]
16+
df = df.assign(*new_columns)
1817

1918
return df.dataframe

spec/API_specification/examples/04_datatypes.py

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,7 @@ def main(df_raw: SupportsDataFrameAPI) -> SupportsDataFrameAPI:
1212
df = df_raw.__dataframe_consortium_standard__(api_version="2023-11.beta").persist()
1313
pdx = df.__dataframe_namespace__()
1414
df = df.select(
15-
*[
16-
col_name
17-
for col_name in df.column_names
18-
if isinstance(df.col(col_name).dtype, pdx.Int64)
19-
],
15+
*[col.name for col in df.iter_columns() if isinstance(col.dtype, pdx.Int64)],
2016
)
2117
arr = df.to_array()
2218
arr = some_array_function(arr)

spec/design_topics/execution_model.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,13 @@ not be supported in some cases.
1111
For example, let's consider the following:
1212
```python
1313
df: DataFrame
14-
features = []
15-
for column_name in df.column_names:
16-
if df.col(column_name).std() > 0:
17-
features.append(column_name)
18-
return features
14+
features = [col.name for col in df.iter_columns() if col.std() > 0]
1915
```
20-
If `df` is a lazy dataframe, then the call `df.col(column_name).std() > 0` returns
16+
If `df` is a lazy dataframe, then the call `col.std() > 0` returns
2117
a (ducktyped) Python boolean scalar. No issues so far. Problem is,
22-
what happens when `if df.col(column_name).std() > 0` is called?
18+
what happens when `if col.std() > 0` is called?
2319

24-
Under the hood, Python will call `(df.col(column_name).std() > 0).__bool__()` in
20+
Under the hood, Python will call `(col.std() > 0).__bool__()` in
2521
order to extract a Python boolean. This is a problem for "lazy" implementations,
2622
as the laziness needs breaking in order to evaluate the above.
2723

0 commit comments

Comments
 (0)