-
-
Notifications
You must be signed in to change notification settings - Fork 8
Initial support for missing data #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updated to match updates in the upstream numpy PR. I also bumped the API versions for the other dtypes in anticipation of the upstream numpy PR getting merged. The changes to the pre-commit configuration are to resolve issues I was seeing from the existing
|
stringdtype/stringdtype/src/dtype.c
Outdated
int a_is_null = ss_isnull(ss_a); | ||
int b_is_null = ss_isnull(ss_b); | ||
if (a_is_null || b_is_null) { | ||
// numpy sorts NaNs to the end of the array | ||
// pandas sorts NAs to the end as well | ||
// so we follow that behavior here | ||
if (!b_is_null) { | ||
return 1; | ||
} | ||
else if (!a_is_null) { | ||
return -1; | ||
} | ||
return 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can this be simplified, since if either a or b is null we don't need to test the other.
if (a_is_null) {
return 1;
}
if (b_is_null) {
return -1;
}
return strcmp(ss_a->buf, ss_b->buf);
if (s1->len == s2->len && strncmp(s1->buf, s2->buf, s1->len) == 0) { | ||
s1 = (ss *)in1; | ||
s2 = (ss *)in2; | ||
if (ss_isnull(s1) || ss_isnull(s2)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that it might be good if NA
values were considered equal, but it looks like numpy doesn't consider nan
values equal either, so this seems consistent with that. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas would consider NA == NA -> NA
but that requires an NA boolean. I suspect that would be the right thing in principle, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a few minor suggestions - thanks for adding this!
The only thing that might be good to add is a test to make sure that NA-values get sorted to the end. It would also be fine to add this later too.
Darn looks like the new wheel hasn't gotten uploaded yet. @seberg any chance you can kick the numpy wheel builder? |
Merging to unbreak the tests. If anyone has additional comments I'm happy to address in a followup. |
Currently we have two ways of representing an empty string:
This PR repurposes the second struct layout to represent missing data, using a new
stringdtype.NA
type. This eliminated the need for theload_string
helper but necessitated a newss_isnull
helper that I've added to the static string header.Before this PR
np.zeros
depended on the second layout being translated to an empty string bygetitem
, so repurposing that layout to mean missing data means we need additional API surface in the dtype API to specify custom zero-filling logic. See numpy/numpy#23591 for that, this PR depends on that numpy PR being merged first.To avoid churning through a large number of 1-byte mallocs whenever
np.zeros
is called, I now return a static empty string if a user requests an empty string, and added a check inssfree
to avoid freeing that static string.I've decided it makes sense for
np.empty
to return an array filled with missing data. I could also add additional API in numpy to customize empty-filling if we decide it makes sense to do something else.I still need to make
NAType
behave more likeNaN
or pandas'NA
in case someone accesses the NA scalar and uses it in operations with other strings, but I'm punting on that for a future PR.