Initial support for missing data #56

ngoldbaum · 2023-04-14T20:44:27Z

Currently we have two ways of representing an empty string:

ss empty = {0, ""}
ss null = {0, NULL}

This PR repurposes the second struct layout to represent missing data, using a new stringdtype.NA type. This eliminated the need for the load_string helper but necessitated a new ss_isnull helper that I've added to the static string header.

Before this PR np.zeros depended on the second layout being translated to an empty string by getitem, so repurposing that layout to mean missing data means we need additional API surface in the dtype API to specify custom zero-filling logic. See numpy/numpy#23591 for that, this PR depends on that numpy PR being merged first.

To avoid churning through a large number of 1-byte mallocs whenever np.zeros is called, I now return a static empty string if a user requests an empty string, and added a check in ssfree to avoid freeing that static string.

I've decided it makes sense for np.empty to return an array filled with missing data. I could also add additional API in numpy to customize empty-filling if we decide it makes sense to do something else.

I still need to make NAType behave more like NaN or pandas' NA in case someone accesses the NA scalar and uses it in operations with other strings, but I'm punting on that for a future PR.

ngoldbaum · 2023-04-18T17:34:22Z

Updated to match updates in the upstream numpy PR. I also bumped the API versions for the other dtypes in anticipation of the upstream numpy PR getting merged.

The changes to the pre-commit configuration are to resolve issues I was seeing from the existing build directory getting generated by an old version of meson locally:

ERROR: Build directory has been generated with Meson version 0.64.0, which is incompatible with the current version 1.0.1

stringdtype/stringdtype/src/dtype.c

peytondmurray · 2023-04-20T21:57:11Z

stringdtype/stringdtype/src/dtype.c

+    int a_is_null = ss_isnull(ss_a);
+    int b_is_null = ss_isnull(ss_b);
+    if (a_is_null || b_is_null) {
+        // numpy sorts NaNs to the end of the array
+        // pandas sorts NAs to the end as well
+        // so we follow that behavior here
+        if (!b_is_null) {
+            return 1;
+        }
+        else if (!a_is_null) {
+            return -1;
+        }
+        return 0;
+    }


I think this can this be simplified, since if either a or b is null we don't need to test the other.

if (a_is_null) { return 1; } if (b_is_null) { return -1; } return strcmp(ss_a->buf, ss_b->buf);

peytondmurray · 2023-04-21T05:20:12Z

stringdtype/stringdtype/src/umath.c

-        if (s1->len == s2->len && strncmp(s1->buf, s2->buf, s1->len) == 0) {
+        s1 = (ss *)in1;
+        s2 = (ss *)in2;
+        if (ss_isnull(s1) || ss_isnull(s2)) {


I was thinking that it might be good if NA values were considered equal, but it looks like numpy doesn't consider nan values equal either, so this seems consistent with that. 👍

Pandas would consider NA == NA -> NA but that requires an NA boolean. I suspect that would be the right thing in principle, though.

peytondmurray

Had a few minor suggestions - thanks for adding this!

The only thing that might be good to add is a test to make sure that NA-values get sorted to the end. It would also be fine to add this later too.

ngoldbaum · 2023-04-21T16:19:59Z

Darn looks like the new wheel hasn't gotten uploaded yet. @seberg any chance you can kick the numpy wheel builder?

ngoldbaum · 2023-04-21T18:11:16Z

Merging to unbreak the tests. If anyone has additional comments I'm happy to address in a followup.

peytondmurray reviewed Apr 20, 2023

View reviewed changes

stringdtype/stringdtype/src/dtype.c Outdated Show resolved Hide resolved

peytondmurray reviewed Apr 20, 2023

View reviewed changes

stringdtype/stringdtype/src/dtype.c Show resolved Hide resolved

peytondmurray reviewed Apr 20, 2023

View reviewed changes

peytondmurray reviewed Apr 21, 2023

View reviewed changes

peytondmurray approved these changes Apr 21, 2023

View reviewed changes

ngoldbaum added 4 commits April 21, 2023 10:05

Initial support for missing data

ba1aa5e

refactor zero-filling to use a traversal loop

92d368c

bump dtype API version for all dtypes

1f5c191

Respond to review comments

7edf3e0

ngoldbaum force-pushed the missing-data branch from 0251321 to 7edf3e0 Compare April 21, 2023 16:05

Specify that dtypes are not pure python in meson config

ed191fc

Add explanatory comment for setitem with NA branch

493b258

ngoldbaum merged commit 6537f42 into numpy:main Apr 21, 2023

ngoldbaum mentioned this pull request Apr 26, 2023

Pandas string dtype needs from NumPy - prototyping & plan of attack pandas-dev/pandas#47884

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Initial support for missing data #56

Initial support for missing data #56

Uh oh!

ngoldbaum commented Apr 14, 2023

Uh oh!

ngoldbaum commented Apr 18, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

peytondmurray Apr 20, 2023

Uh oh!

peytondmurray Apr 21, 2023

Uh oh!

seberg Apr 24, 2023

Uh oh!

peytondmurray left a comment

Uh oh!

ngoldbaum commented Apr 21, 2023

Uh oh!

ngoldbaum commented Apr 21, 2023

Uh oh!

Uh oh!

Uh oh!

Initial support for missing data #56

Initial support for missing data #56

Uh oh!

Conversation

ngoldbaum commented Apr 14, 2023

Uh oh!

ngoldbaum commented Apr 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peytondmurray Apr 20, 2023

Choose a reason for hiding this comment

Uh oh!

peytondmurray Apr 21, 2023

Choose a reason for hiding this comment

Uh oh!

seberg Apr 24, 2023

Choose a reason for hiding this comment

Uh oh!

peytondmurray left a comment

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Apr 21, 2023

Uh oh!

ngoldbaum commented Apr 21, 2023

Uh oh!

Uh oh!

ngoldbaum commented Apr 18, 2023 •

edited

Loading