Add a `fill_nan` method to dataframe and column #167

rgommers · 2023-05-10T18:31:06Z

Addresses half of gh-142 (fill_null is more complex, and not included here).

Note: this is reviewable now, but should be merged after gh-157 which introduces the null object).

MarcoGorelli

looks good (barring docs build error)

Addresses half of data-apisgh-142 (`fill_null` is more complex, and not included here).

rgommers · 2023-05-18T14:27:18Z

It's green now. I had to do the null type annotation in string form to make Sphinx happy.

spec/API_specification/dataframe_api/dataframe_object.py

MarcoGorelli

looks good to me

spec/API_specification/dataframe_api/column_object.py

kkraus14 · 2023-05-18T21:38:55Z

spec/API_specification/dataframe_api/column_object.py

        """
        ...
+
+    def fill_nan(self, value: float | 'null', /):


A bit unrelated to this PR, but having null be typed differently feels like an anti-pattern here. It differentiates between a float scalar (which is implicitly nullable based on our current scalar definition) and a null scalar.

float scalar (which is implicitly nullable based on our current scalar definition

We don't have numpy-style scalars (i.e., instances of a dtype) though? That's why we need a separate null object, so that one can construct a column containing nulls with column_from_sequence([1.5, 2.5, null, 4.5]).

We could add dtype instances and specify that null derives from float, but that seems like a huge can of worms for no gain at all. And I think the consensus from what we learned from numpy is that array scalars were a major design mistake.

nitpick: we can construct a column with column_from_sequence([1.5, 2.5, null, 4.5], dtype='float64') (just pointing this out because it's early days, and I wouldn't want someone to see this and get confused)

And I think the consensus from what we learned from numpy is that array scalars were a major design mistake.

Yes, but I think the thing that was agreed as the correct path forward was 0d arrays which we don't have on the DataFrame side. Those 0d arrays are strongly typed and don't have to deal with nulls.

The issue that I see is that someone could do something like:

my_int_column = column_from_sequence([1, 2, None, 4], dtype='int32') max_my_int_column = my_int_column.max(skip_null=False) # Yields a ducktyped `null` scalar that is int32 typed. Is this `int` type or `null` type from a Typing perspective? my_float_column = column_from_sequence([1.5, 2.5, max_my_int_column, 4.5], dtype='float64') # Does this work if the max is `null`?

For example, PyArrow handles this by having an explicit NULL type (https://arrow.apache.org/docs/python/generated/pyarrow.null.html#pyarrow.null) and presumably has its underlying APIs and compute explicitly handle mixing NULL typed scalars / columns with other typed scalars / columns.

Maybe we just need an explicit NULL type and then 'null' here refers to a scalar of type NULL?

Maybe we just need an explicit NULL type

We do have exactly that already: docs for null. The only reason the type annotation is 'null' rather than null is to avoid some circular import and Sphinx weirdness.

That is an object for a null scalar as opposed to a NULL data type. I.E. allowing a column to be typed NULL and extracting a null-valued scalar from that column has type NULL versus extracting a null-valued scalar from a float64 column has type float64.

It feels counter-intuitive that Columns are type-erased (i.e. just a Column class and no Int32Column, Float32Column, etc.) but the scalars that are contained within Columns are not.

Either way, this should go into a new issue instead of this PR. Just the typing felt a bit funky to me here.

I'll open a new issue for discussion and approve this.

Thanks, a new issue sounds good for this. I had not thought before about a need for a null dtype; if there is one we should indeed consider it.

For example, PyArrow handles this by having an explicit NULL type

Small clarification here: while pyarrow indeed has a "null" data type, we also have type-specific null scalars for each data type. And so in your specific example, the max_my_int_column would actually be an int32 scalar (with the value of "null"), and not a scalar of the null data type.

Co-authored-by: Marco Edward Gorelli <[email protected]>

MarcoGorelli

all good, thanks - @kkraus14 any further comments or good to go?

rgommers · 2023-05-22T09:12:30Z

This now has three approvals, so I'll get it in. Thanks all!

Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.

* Add a `fill_null` method to dataframe and column Follow-up to gh-167, which added `fill_nan`, and closes gh-142. * Address review comments about `DataFrame.fill_null`

rgommers added the API design label May 10, 2023

rgommers mentioned this pull request May 10, 2023

nan to null strategy? #142

Closed

jorisvandenbossche approved these changes May 11, 2023

View reviewed changes

MarcoGorelli approved these changes May 15, 2023

View reviewed changes

rgommers added 2 commits May 18, 2023 16:00

Add a fill_nan method to dataframe and column

1b98168

Addresses half of data-apisgh-142 (`fill_null` is more complex, and not included here).

Resolve reference to null value doc build issue, add __all__ dict

3c7f4a5

rgommers force-pushed the add-fill_nan branch from 877fb3b to 3c7f4a5 Compare May 18, 2023 14:22

MarcoGorelli reviewed May 18, 2023

View reviewed changes

spec/API_specification/dataframe_api/dataframe_object.py Outdated Show resolved Hide resolved

MarcoGorelli approved these changes May 18, 2023

View reviewed changes

MarcoGorelli suggested changes May 18, 2023

View reviewed changes

spec/API_specification/dataframe_api/column_object.py Outdated Show resolved Hide resolved

kkraus14 reviewed May 18, 2023

View reviewed changes

rgommers and others added 2 commits May 19, 2023 11:28

Add return type annotations to fill_nan methods

8c6e694

Update spec/API_specification/dataframe_api/dataframe_object.py

5acee9d

Co-authored-by: Marco Edward Gorelli <[email protected]>

MarcoGorelli approved these changes May 19, 2023

View reviewed changes

kkraus14 approved these changes May 19, 2023

View reviewed changes

rgommers merged commit 2950673 into data-apis:main May 22, 2023

rgommers deleted the add-fill_nan branch May 22, 2023 09:12

rgommers added a commit to rgommers/dataframe-api that referenced this pull request May 22, 2023

Add a fill_null method to dataframe and column

d43ecf1

Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.

rgommers mentioned this pull request May 22, 2023

Add a fill_null method to dataframe and column #173

Merged

rgommers added a commit to rgommers/dataframe-api that referenced this pull request Jul 5, 2023

Add a fill_null method to dataframe and column

7d84cc3

Follow-up to data-apisgh-167, which added `fill_nan`, and closes data-apisgh-142.

Add a fill_nan method to dataframe and column #167

Add a fill_nan method to dataframe and column #167

Uh oh!

Conversation

rgommers commented May 10, 2023

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

rgommers commented May 18, 2023

Uh oh!

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

rgommers commented May 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add a `fill_nan` method to dataframe and column #167

Add a `fill_nan` method to dataframe and column #167