Skip to content

make getitem return str #55

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 11, 2023
Merged

make getitem return str #55

merged 4 commits into from
Apr 11, 2023

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Apr 6, 2023

This makes getitem return a str instead of StringScalar. This is substantially faster. Before:

In [9]: %%timeit
   ...: for val in arr:
   ...:     pass
   ...: 
126 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

After:

In [12]: %%timeit
    ...: for val in arr:
    ...:     pass
    ...: 
30.9 ms ± 954 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Here's how I created arr in the above test:

import os
os.environ["NUMPY_EXPERIMENTAL_DTYPE_API"] = "1"
import string
import numpy as np
from stringdtype import StringDType
RANDS_CHARS = np.array(list(string.ascii_letters + string.digits), dtype=(np.str_, 1))
arr = np.random.choice(RANDS_CHARS, size=10 * 100_000, replace=True).view('U10').astype(StringDType())

The speed difference is due to avoiding an additional unnecessary copy from the string returned by PyUnicode_FromString to the StringScalar instance. In principle I could still return a StringScalar and retain this speed, but I think I'd effectively need to re-implement the low-level string filling implementation that PyUnicode_FromString uses, and since StringScalar is a trivial subclass that only exists because there's already a scalar in numpy for the python str builtin, that seems not worth the trouble.

Happy to hear about alternate ideas.

@ngoldbaum
Copy link
Member Author

It occurs to me that another way to handle this that avoids copying would be to make StringScalar keep a str instance as an attribute and simulate the API of str for compatibility with e.g. np.char. I'm not sure if the extra indirection would be costly, will investigate.

@ngoldbaum ngoldbaum changed the title Implement StringScalar in C, make getitem return str make getitem return str Apr 10, 2023
@ngoldbaum
Copy link
Member Author

I looked at making StringScalar wrap an internal string, but that was substantially slower than just making getitem return a str instead of StringScalar.

@seberg besides it being awkward that you won't be able to go from a scalar you get back from __getitem__ to a StringDType array, do you see any other downsides to this?

@seberg
Copy link
Member

seberg commented Apr 11, 2023

Dunno, I don't mind round-tripping not working in general, at that point there may not even be a point of not using str as type. The only problem would be not to do (or deciding to explicitly do) change that array(["a"]) would return this dtype.

For CPython you could reach into the internals and construct the unicode to avoid all overheads. Here most of the overhead may actually be calling through Python, but in general the copy and object creation would matter.

@ngoldbaum
Copy link
Member Author

For CPython you could reach into the internals and construct the unicode to avoid all overheads.

Yeah, effectively this is re-implementing PyUnicode_FromString, which I'd prefer not to do unless I really have to, since the only API available to do this is the low-level character by character API in CPython. I'm going to add a comment to that effect in case we need to come back to this, but right now my plan is to advocate for this dtype to become the default string dtype in numpy in the future, at which point all this becomes moot.

* buffer from val_obj to the StringScalar we'd like to return. In
* principle we could avoid this by making a C function like
* PyUnicode_FromStringAndSize that fills a StringScalar instead of a
* str. For now (4-11-23) we are punting on that with the expectation that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* str. For now (4-11-23) we are punting on that with the expectation that
* str. For now (2023-04-11) we are punting on that with the expectation that

just joking, but ISO dates all the way!

This is half-way at the place where str is the scalar. But lets try how it goes!

@ngoldbaum ngoldbaum merged commit d98210c into numpy:main Apr 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants