Skip to content
This repository was archived by the owner on Jun 10, 2020. It is now read-only.

ENH: make ndarray generic over dtype #48

Closed
wants to merge 3 commits into from

Conversation

person142
Copy link
Member

@person142 person142 commented Mar 31, 2020

Closes https://github.com/numpy/numpy-stubs/issues/7.

Keeping this as a draft as some discussion will be needed. This makes ndarray parameterized over a type variable with a bound of np.generic; more concretely that's things like np.ndarray[np.int64]. Note that since ndarray is not generic at runtime, you have to use the usual tricks to add annotations in practice.

A concrete issue is that it doesn't yet work correctly with subclasses; note that I commented out one of the subclass tests below.

Edit: nevermind, was hitting stale mypy cache.

On a more philosophical level, there's a question of "is this going to make supporting shapes in the future harder" because presumably that is going to require being generic over shape. One way to handle that would be to make ndarray generic over shape right now, but just do nothing with it. Then all annotations should be ndarray[dtype, Any] in preparation for whatever happens with shapes. Given that it seems pretty unclear what shape support will look like in the future, that might end up being off-base though.

Thoughts?

@@ -326,6 +335,16 @@ class ndarray(_ArrayOrScalarCommon, Iterable, Sized, Container):
) -> None: ...
def dump(self, file: str) -> None: ...
def dumps(self) -> bytes: ...
@overload
def astype(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of the common theme here is adding overloads that distinguish a known dtype from a dtype-like, because in the former case we can make much stronger assertions about the output types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we could handle things like dtype='float64' too by adding an overload for Literal['float64'].

@rgommers
Copy link
Member

There's a long discussion on shape support in gh-5. Not sure how that relates to ndarray[dtype, Any] though.

@person142
Copy link
Member Author

There's a long discussion on shape support in gh-5. Not sure how that relates to ndarray[dtype, Any] though.

Oops sorry, I was a little too in my head when I wrote that description. Let me try to elaborate.

  • I'm approaching this with the perspective that the various proposed approaches to shape are impossible to express with current Python typing
    • The closest you could come to shape is maybe with Tuple + Literal (e.g. Tuple[Literal[2], Literal[2]]), but you can't do arithmetic with literals, so I don't think they are the right tool for the job (at least not yet).
    • The absence of Literal arithmetic makes simpler proposals like just typing rank still unreachable too
  • So upstream work probably needs to be done to get shape working
  • But, dtypes are "easy"
  • To me it seems better to support dtypes now rather than waiting to resolve shape (which could take years!)
  • But we could potentially be setting ourselves up for some back compatibility nightmares
    • With just dtypes we have ndarray[D], presumably shapes will make that ndarray[D, S]
    • If we add S, that's a big breaking change where all code that wrote ndarray[D] stops working
    • On the one hand, all of this is experimental and breaking changes should be expected
    • On the other hand, we could maybe be very conservative and make it ndarray[D, S] right now, and just say "well the S doesn't do anything yet". And we request that everyone go through the song and dance of writing ndarray[D, Any] until such a time as we can give the second slot some meaning. (Or introduce a type alias like AnyShapeArray[D] = ndarray[D, Any] and ask people to use that.)

Add overloads for when a dtype is passed (versus just dtype-like) and
handle the default float64 dtype for ones/zeros.
@person142 person142 marked this pull request as ready for review April 1, 2020 02:36
@person142
Copy link
Member Author

Ok, got some overloads for np.ones and np.zeros that get the dtype inferred in some basic cases. We can continue to do better by adding explicit overloads for cases like dtype = int, but at that point we might want to switch to doing codegen (as has been suggested elsewhere).

Taking this out of draft though-modulo the philosophical issues I think it's in an ok state.

@seberg
Copy link
Member

seberg commented Apr 1, 2020

I had missed this PR, Ralf just mentioned it in the community meeting. I would like to make sure I see how this relates to NEP 41/42; In other words, what is "dtype" here?

What does this currently enforce for the DType in ndarray[DType]? E.g. you mentioned using ndarray[np.float32] but that is different from np.ndarray.dtype which returns a dtype instance.

That is, from a typing (theoretical) point of view I do not think it is typically necessary to distinguish ndarray[np.dtype("<float64")] and ndarray[np.dtype(">float64")] (these indicate different storage of the same type). A more direct implication are strings, normally you should not care about a specific string length, i.e. you just have string as the type. Of course occasionally having a mores specific "strings of length up to 5" which is np.dtype("U5") could be interesting (although even then you do not care about byte-order).

Within NEP 41/42, I see these as instances of a class (and also type), making ndarray[DType] where DType (capitalized) is the class. This means that I would like to (possibly sooner rather than later), allow you to write np.dtype[np.float32], etc. to get a DType class representing either little or big endian (or metadata), or strings of any length.

In any case, I am not sure how much this actually changes this at all. Typing and classes are somewhat different issues. It is maybe more of a heads-up, that I think the scalar types (hierarchy) is good to use for this dtype, but it may not be quite enough on its own, also we want user DTypes, so users must be able to declare their DType class (and/or scalar type) to be a NumPy DType for typing purposes.

About the shape, just curious, is it really not possible to make np.ndarray[DType] an alias for ndarray[DType, Any] when we add support for the shape? I do not know really know typing module as of now, so have no idea about its limitations.

@person142
Copy link
Member Author

Warning: I have only vaguely been following the recent dtypes discussion, so apologies in advance for any misunderstandings.

What does this currently enforce for the DType in ndarray[DType]?

Right, so it's not actually a dtype, it's actually enforcing that it be a subclass of np.generic. The problem being that (until NEP 41/42 IIUC) all dtypes are instances of the same class, so from a nominal typing perspective they are all of the same type. Subclass of np.generic is obviously pretty limiting in terms of what we can type, so many things would have to be ndarray[Any].

but that is different from np.ndarray.dtype which returns a dtype instance.

Oops, I had that wrong; fixed it in ca3fe78.

Within NEP 41/42, I see these as instances of a class (and also type)

That would be amazing from a typing perspective!

In any case, I am not sure how much this actually changes this at all. Typing and classes are somewhat different issues.

If I am understanding everything correctly (I'm sure that I am not), NEP 41/42 would make it possible to type things in a much better way. For better or worse, typing and subclasses are tightly coupled in Python (in that subclass => subtype), so not having classes for dtypes makes it basically impossible (?) to type them.

There are actually situations like that all over the place in NumPy-e.g. ufuncs are all instances of np.ufunc, so from the nominal typing perspective... they are all the same type. Similarly things wrapped by f2py aren't functions, they are instances of a fortran object... so they are all of the same type too.

@seberg
Copy link
Member

seberg commented Apr 2, 2020

Oh, I had hoped that you can somehow type subtypes of classes without subclasses, (say things like integers of a certain range, I guess parametric types, if that was the name for it).

UFuncs are an issue with typing, I guess. Internally I was thinking to use something like ufunc.get_implementaton(DTypes), i.e. a (multiple) dispatching step finding the correct DType(s). Now from a typing perspective that is pretty nightmarish I guess, although I am not sure there is any solution to it, since we need to be able to register more loops and dispatch to them?

As to using the scalar hierarchy, I think that is actually fine. It may have some holes in the long run, but I do think we should have a clean scalar_type -> DType mapping using e.g. np.dtype[scalar_type], so at least for the NumPy buildtdns (which are relevant currently) I think it is OK to rely on the scalar hierarchy. (I still have to make up my mind about np.dtype[np.integer] since it is not a concrete type, but a superclass/type, but I expect it should just exist).

So yes, NEP 41/42, should make things better I guess, since you have a classes for each type (ignoring things such as string lengths, byteorder), and for that you actually mainly need the first tiny step maybe... Although, an alternative where all dtypes are classes (or close) may also be possible (I do not have the vision for that, but we need to finish a thought in that direction in the NEP 41 discussion).

Note that we have not yet accepted the NEP, I expect this to happen, but it is not 100%.

EDIT: One example, I do not think it would be possible for example to type the Unit of a (physical) Unit DType, since that should typically be stored on the dtype instance.

@person142
Copy link
Member Author

I was thinking to use something like ufunc.get_implementaton(DTypes), i.e. a (multiple) dispatching step finding the correct DType(s). Now from a typing perspective that is pretty nightmarish I guess, although I am not sure there is any solution to it, since we need to be able to register more loops and dispatch to them?

I suspect that typing ufuncs will eventually require something like a mypy plugin. It allows you to hook into the type checking process and make modifications, so we could hook into the ufunc call method and augment it with the available loops. (All very theoretical though.)

As to using the scalar hierarchy, I think that is actually fine. It may have some holes in the long run

Yeah, I think that the approach we have to take with typing NumPy is "it's impossible, but we can probably still get something useful". As typing and NumPy evolve I imagine that the situation will improve (and hopefully the current typing experiments will help inform that process).

@seberg
Copy link
Member

seberg commented Apr 5, 2020

After NEP 41 goes in (assuming it does without major changes), we could put in the np.dtype[np.int64] just after 1.19 (or maybe even before, but I doubt it matters). Its new API, but honestly it feels like a very "right" API to me (how the mapping itself is implemented is a bit of a different issue).

@person142
Copy link
Member Author

@seberg do you happen to have an (in progress I imagine) branch implementing NEP 41? It might be interesting to see what can be done typing-wise against that branch.

@seberg
Copy link
Member

seberg commented Apr 13, 2020

@person142 see numpy/numpy#15508 but, I have not added anything to make np.dtype[np.int64] work (I can do that though, if just in a separate branch for you to try). And am still discussing a bit with Eric Wieser mainly right now, e.g. if the scalars can be used directly here.
(With may be a side point from Eric's point of view, but I see the hierarchy as important, so if its not part of the DType, then we may have to use the scalar one.)

The main point I have against moving a bit more into the scalar, is that I am not sure it is practical to expect scalars to have information used by arrays attached to them. I.e. if you want to make a dtype for an arbitrary python object you might have to modify the python scalar.

In the current NEP 41 design there is a mapping DType class <-> scalar type (which unfortunately also means its easy to have reference cycle, which I am not sure matters and probably can be avoided although with a bit of spit). That mapping is, admittedly, a bit artificial and could be considered useless if we simply expect scalars carry slightly more weight.
This is also a bit funny, since the DType has all information necessary to create a scalar object.

@person142
Copy link
Member Author

Ok, here's an experimental branch using NEP 41:

https://github.com/person142/numpy-stubs/tree/nep-41

(The tests pass when built against @seberg's NumPy branch.) Some takeaways from that:

  • I couldn't get mypy to recognize the dtype metaclass, so instead I faked it by overriding __new__ on dtype. Not sure whether I'm messing something up or whether this is something mypy doesn't know about yet.

  • There's a problem where since instances of the dtype classes are constructed dynamically by calling dtype, there's no way to let the types stubs know the classes exist. I had to fake it by creating bogus classes class Float64(dtype): ... and then hard coding paths like

    class dtype:
        @overload
        def __new__(cls, obj: Type[float64]) -> Float64

    @seberg is the purely dynamic nature of the class creation something that's going to change? (Or maybe I'm misunderstanding something?) (I think np.dtype[<thing>] -> <dtype class> is still going to be too dynamic to capture with types.)

@seberg
Copy link
Member

seberg commented Apr 18, 2020

Oh, there is no dynamic nature intended. It just seemd like a reasonable way to not overload the namespace with new names and have a uniform mapping scalar type <-> DType. There is some argument for not having a Dtype class at all and putting it into the scalar. I am a bit torn on it, since that means putting things into the scalar and I am not sure are good there (i.e. that also somewhat means not supporting making a DType that maps to an existing python type?).

@seberg
Copy link
Member

seberg commented Apr 18, 2020

About the metaclass, I do not think it should matter at all? You can just say np.dtype subclass, no?

@person142
Copy link
Member Author

About the metaclass, I do not think it should matter at all? You can just say np.dtype subclass, no?

Strictly speaking yes. But it would be nice if we could infer that the type of x is ndarray[Float64] in things like:

x = np.array([1], dtype=np.dtype(np.float64))
x = np.array([1], dtype=np.dtype[np.float64])

And it seems that to capture things like that we'll need to bake in some kind of understanding of the dtype metaclass.

@person142
Copy link
Member Author

Something which might be relevant: mypy plugins support

get_dynamic_class_hook() can be used to allow dynamic class definitions in mypy

(From https://mypy.readthedocs.io/en/latest/extending_mypy.html#current-list-of-plugin-hooks.)

@seberg
Copy link
Member

seberg commented Apr 18, 2020

That get_dynamic_class_hook sounds very promising!? Also, and maybe especially, for ndarray itself?

I would be surprised if mypy doesn't have something for the possible np.dtype[scalar_type] syntax (to be clear: there is nothing fixed about it, I just like it and if we go the DType class route it seems straight forward to me).
The reason why this should be something mypy can do easy enough, at least in the future, is that python is even adding this as a classmethod to the type (without the need for a MetaClass).
If we have to use the MetaClass/Type, I guess that is fine, it does show a bit that the MetaClass/Type route, even if I consider it more an implementation detail, may have some downsides.

@person142
Copy link
Member Author

That get_dynamic_class_hook sounds very promising!? Also, and maybe especially, for ndarray itself?

Yeah I'm feeling pretty optimistic about what they could let us do. I opened #56 to try out a simple plugin for getting more precise ufunc signatures; if we move forward with that I'll see what we could do in the dtype case.

@seberg
Copy link
Member

seberg commented Apr 24, 2020

@person142 just a quick note. I expect/hope that the dtypemeta branch will go into master shortly after branching of 1.19. If it would help you quite a bit to push it into 1.19, that is not totally unrealistic. There is just not a big reason for me to push for that.

@person142
Copy link
Member Author

If it would help you quite a bit to push it into 1.19, that is not totally unrealistic. There is just not a big reason for me to push for that.

No rush here; I'm fine working from the branch until it's merged. Hopefully I'll have some time this weekend to experiment with writing a plugin using get_dynamic_class_hook.

@person142
Copy link
Member Author

Closing this; will reopen a reworked version in NumPy focusing on the new dtype classes soon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants