Skip to content

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Feb 20, 2020

This comparison val == val happens in a lot of these groupby operations but only seems to raise here in the presence of NA. Are we just always / mostly converting to numpy beforehand in the other cases?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thanks for PR

@@ -887,7 +889,7 @@ def group_last(rank_t[:, :] out,
for j in range(K):
val = values[i, j]

if val == val:
if (val is not NA) and (val == val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can use the checknull function from missing instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just not checknull(val) for the whole condition? I think there was an issue with checknull not catching decimal.Decimal("nan") but I could try to add that here

@pytest.mark.parametrize("method", ["first", "last"])
def test_first_last_with_na_object(method):
# https://github.com/pandas-dev/pandas/issues/32123
groups = pd.DataFrame({"a": [1, 1, 2, 2], "b": [1, 2, 3, pd.NA]}).groupby("a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also use nulls_fixture here after #31799

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge master? This should be available now

@jbrockmendel
Copy link
Member

#31227 turned up a bunch of other places where having pd.NA in columns/index is going to break the world. the is_matching_na implemented there might be useful here

if method == "first":
values = {"b": [1, 3]}
else:
values = {"b": [2, 3]}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a question (that perhaps should be handled in another issue).
GroupBy.last() returns [2, 3] however GroupBy.tail(1) returns [2, pd.NA].
Is this intended behaviour? it is consistent with pandas 0.25, but undocumented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that last gives the last non-null value but that doesn't seem to be well-documented like you say (probably does warrant a separate issue)

@simonjayhawkins simonjayhawkins added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Feb 21, 2020
@simonjayhawkins simonjayhawkins added this to the 1.0.2 milestone Feb 21, 2020
@simonjayhawkins
Copy link
Member

@dsaxton can you merge master to resolve conflicts

@@ -530,3 +530,23 @@ def test_nth_nan_in_grouper(dropna):
)

tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("method", ["first", "last"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test .nth(0) and .nth(-1) which are the same results (except the nth -1 will have the null as the result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for nth with nulls; I gave it its own test rather than putting it with first / last to avoid too much awkward if / else branching inside the test

@pytest.mark.parametrize("method", ["first", "last"])
def test_first_last_with_na_object(method, nulls_fixture):
# https://github.com/pandas-dev/pandas/issues/32123
groups = pd.DataFrame({"a": [1, 1, 2, 2], "b": [1, 2, 3, nulls_fixture]}).groupby(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also move this belowt est_first_last_nth (it may actually be simpler to modify that test though)

@jreback jreback merged commit 20a84a5 into pandas-dev:master Feb 23, 2020
@jreback
Copy link
Contributor

jreback commented Feb 23, 2020

thanks @dsaxton

@jreback
Copy link
Contributor

jreback commented Feb 23, 2020

@meeseeksdev backport to 1.0.x

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 23, 2020
@dsaxton dsaxton deleted the grpby-nth branch February 23, 2020 15:07
jreback pushed a commit that referenced this pull request Feb 23, 2020
roberthdevries pushed a commit to roberthdevries/pandas that referenced this pull request Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: GroupBy.first fails with pd.NA on Series with object dtype
6 participants