Skip to content

Initial support for missing data #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 21, 2023
Merged

Conversation

ngoldbaum
Copy link
Member

Currently we have two ways of representing an empty string:

ss empty = {0, ""}
ss null = {0, NULL}

This PR repurposes the second struct layout to represent missing data, using a new stringdtype.NA type. This eliminated the need for the load_string helper but necessitated a new ss_isnull helper that I've added to the static string header.

Before this PR np.zeros depended on the second layout being translated to an empty string by getitem, so repurposing that layout to mean missing data means we need additional API surface in the dtype API to specify custom zero-filling logic. See numpy/numpy#23591 for that, this PR depends on that numpy PR being merged first.

To avoid churning through a large number of 1-byte mallocs whenever np.zeros is called, I now return a static empty string if a user requests an empty string, and added a check in ssfree to avoid freeing that static string.

I've decided it makes sense for np.empty to return an array filled with missing data. I could also add additional API in numpy to customize empty-filling if we decide it makes sense to do something else.

I still need to make NAType behave more like NaN or pandas' NA in case someone accesses the NA scalar and uses it in operations with other strings, but I'm punting on that for a future PR.

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Apr 18, 2023

Updated to match updates in the upstream numpy PR. I also bumped the API versions for the other dtypes in anticipation of the upstream numpy PR getting merged.

The changes to the pre-commit configuration are to resolve issues I was seeing from the existing build directory getting generated by an old version of meson locally:

ERROR: Build directory has been generated with Meson version 0.64.0, which is incompatible with the current version 1.0.1

Comment on lines 234 to 242
int a_is_null = ss_isnull(ss_a);
int b_is_null = ss_isnull(ss_b);
if (a_is_null || b_is_null) {
// numpy sorts NaNs to the end of the array
// pandas sorts NAs to the end as well
// so we follow that behavior here
if (!b_is_null) {
return 1;
}
else if (!a_is_null) {
return -1;
}
return 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can this be simplified, since if either a or b is null we don't need to test the other.

if (a_is_null) {
    return 1;
}
if (b_is_null) {
    return -1;
}
return strcmp(ss_a->buf, ss_b->buf);

if (s1->len == s2->len && strncmp(s1->buf, s2->buf, s1->len) == 0) {
s1 = (ss *)in1;
s2 = (ss *)in2;
if (ss_isnull(s1) || ss_isnull(s2)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that it might be good if NA values were considered equal, but it looks like numpy doesn't consider nan values equal either, so this seems consistent with that. 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas would consider NA == NA -> NA but that requires an NA boolean. I suspect that would be the right thing in principle, though.

Copy link
Contributor

@peytondmurray peytondmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a few minor suggestions - thanks for adding this!

The only thing that might be good to add is a test to make sure that NA-values get sorted to the end. It would also be fine to add this later too.

@ngoldbaum
Copy link
Member Author

Darn looks like the new wheel hasn't gotten uploaded yet. @seberg any chance you can kick the numpy wheel builder?

@ngoldbaum
Copy link
Member Author

Merging to unbreak the tests. If anyone has additional comments I'm happy to address in a followup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants