make getitem return str #55

ngoldbaum · 2023-04-06T17:37:29Z

This makes getitem return a str instead of StringScalar. This is substantially faster. Before:

In [9]: %%timeit
   ...: for val in arr:
   ...:     pass
   ...: 
126 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

After:

In [12]: %%timeit
    ...: for val in arr:
    ...:     pass
    ...: 
30.9 ms ± 954 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Here's how I created arr in the above test:

import os
os.environ["NUMPY_EXPERIMENTAL_DTYPE_API"] = "1"
import string
import numpy as np
from stringdtype import StringDType
RANDS_CHARS = np.array(list(string.ascii_letters + string.digits), dtype=(np.str_, 1))
arr = np.random.choice(RANDS_CHARS, size=10 * 100_000, replace=True).view('U10').astype(StringDType())

The speed difference is due to avoiding an additional unnecessary copy from the string returned by PyUnicode_FromString to the StringScalar instance. In principle I could still return a StringScalar and retain this speed, but I think I'd effectively need to re-implement the low-level string filling implementation that PyUnicode_FromString uses, and since StringScalar is a trivial subclass that only exists because there's already a scalar in numpy for the python str builtin, that seems not worth the trouble.

Happy to hear about alternate ideas.

ngoldbaum · 2023-04-06T18:24:16Z

It occurs to me that another way to handle this that avoids copying would be to make StringScalar keep a str instance as an attribute and simulate the API of str for compatibility with e.g. np.char. I'm not sure if the extra indirection would be costly, will investigate.

ngoldbaum · 2023-04-10T16:22:26Z

I looked at making StringScalar wrap an internal string, but that was substantially slower than just making getitem return a str instead of StringScalar.

@seberg besides it being awkward that you won't be able to go from a scalar you get back from __getitem__ to a StringDType array, do you see any other downsides to this?

seberg · 2023-04-11T06:57:25Z

Dunno, I don't mind round-tripping not working in general, at that point there may not even be a point of not using str as type. The only problem would be not to do (or deciding to explicitly do) change that array(["a"]) would return this dtype.

For CPython you could reach into the internals and construct the unicode to avoid all overheads. Here most of the overhead may actually be calling through Python, but in general the copy and object creation would matter.

ngoldbaum · 2023-04-11T14:46:24Z

For CPython you could reach into the internals and construct the unicode to avoid all overheads.

Yeah, effectively this is re-implementing PyUnicode_FromString, which I'd prefer not to do unless I really have to, since the only API available to do this is the low-level character by character API in CPython. I'm going to add a comment to that effect in case we need to come back to this, but right now my plan is to advocate for this dtype to become the default string dtype in numpy in the future, at which point all this becomes moot.

seberg · 2023-04-11T15:03:11Z

stringdtype/stringdtype/src/dtype.c

+     * buffer from val_obj to the StringScalar we'd like to return. In
+     * principle we could avoid this by making a C function like
+     * PyUnicode_FromStringAndSize that fills a StringScalar instead of a
+     * str. For now (4-11-23) we are punting on that with the expectation that


Suggested change

* str. For now (4-11-23) we are punting on that with the expectation that

* str. For now (2023-04-11) we are punting on that with the expectation that

just joking, but ISO dates all the way!

This is half-way at the place where str is the scalar. But lets try how it goes!

Implement StringScalar in C

352add8

Move the scalar back to python

a5fd24e

ngoldbaum changed the title ~~Implement StringScalar in C, make getitem return str~~ make getitem return str Apr 10, 2023

ngoldbaum added 2 commits April 11, 2023 08:52

slightly faster using PyUnicode_FromStringAndSize

1fad632

Add comment explaining why getitem returns a str

15ccbb9

seberg reviewed Apr 11, 2023

View reviewed changes

ngoldbaum merged commit d98210c into numpy:main Apr 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

make getitem return str #55

make getitem return str #55

Uh oh!

ngoldbaum commented Apr 6, 2023 •

edited

Loading

Uh oh!

ngoldbaum commented Apr 6, 2023

Uh oh!

ngoldbaum commented Apr 10, 2023

Uh oh!

seberg commented Apr 11, 2023

Uh oh!

ngoldbaum commented Apr 11, 2023

Uh oh!

seberg Apr 11, 2023

Uh oh!

Uh oh!

	* str. For now (4-11-23) we are punting on that with the expectation that
	* str. For now (2023-04-11) we are punting on that with the expectation that

Uh oh!

make getitem return str #55

make getitem return str #55

Uh oh!

Conversation

ngoldbaum commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Apr 6, 2023

Uh oh!

ngoldbaum commented Apr 10, 2023

Uh oh!

seberg commented Apr 11, 2023

Uh oh!

ngoldbaum commented Apr 11, 2023

Uh oh!

seberg Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngoldbaum commented Apr 6, 2023 •

edited

Loading