Skip to content

FIX: Accept dtype parameter to ArrayProxy.__array__ #844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 11, 2019

Conversation

effigies
Copy link
Member

At least back through numpy 1.12, the __array__ protocol has accepted a dtype parameter.

Overall this is poorly documented in numpy (the above link the only place I've seen reference to the dtype parameter), but some manual testing seems to indicate it works, and that its absence produces errors in the cases described in #843. Systematic tests are added to the proxy API. Happy to add more, if we can think of specific cases that might trip us up.

It occurs to me that this largely replaces the get_scaled() method added in #833, with the primary difference being that get_scaled() guarantees not to downcast if it can't assure there will be no overflow. get_scaled() will also make as few copies as possible, while __array__ may make another if get_scaled(dtype).dtype != np.dtype(dtype). If that distinction is not worth drawing, it may be worth removing get_scaled() before it hits a release, and make the __array__ interface THE way to control dtypes beyond get_fdata().

One interesting side effect of this is that mask or atlas data that is known to be <=255, but the scale factors are not guaranteed to be slope = 1, inter = 0 can be retrieved with minimal up-casting via:

np.array(atlas.dataobj, np.uint8)
# or 
np.uint8(atlas.dataobj)

This is of course quite dangerous, and with the current implementation, non-integer data will be happily coerced into meaningless garbage. Validating coercion or using safer casting rules could be a good idea, but that would also break the following equality by causing the RHS to raise an exception:

np.uint8(np.array(atlas.dataobj)) == np.array(atlas.dataobj, np.uint8)

All of your thoughts are very welcome.

Fixes #843.
Affects #842.
Follow-up to #833.

cc @jeromedockes @kchawla-pi @adelavega @rmarkello @matthew-brett

@codecov
Copy link

codecov bot commented Nov 19, 2019

Codecov Report

Merging #844 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #844      +/-   ##
==========================================
- Coverage   90.02%   90.01%   -0.01%     
==========================================
  Files          98       98              
  Lines       12452    12441      -11     
  Branches     2190     2191       +1     
==========================================
- Hits        11210    11199      -11     
  Misses        890      890              
  Partials      352      352
Impacted Files Coverage Δ
nibabel/ecat.py 88.14% <100%> (-0.13%) ⬇️
nibabel/minc1.py 90.75% <100%> (-0.26%) ⬇️
nibabel/dataobj_images.py 95.52% <100%> (-0.26%) ⬇️
nibabel/arrayproxy.py 100% <100%> (ø) ⬆️
nibabel/parrec.py 91.87% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fcc5448...3351d16. Read the comment docs.

@effigies
Copy link
Member Author

I'm inclined to drop get_scaled() in favor of a dtyped __array__. Any thoughts?

@adelavega
Copy link

The upside of get_scaled is that its a bit clearer what it idoes. Using __array__ directly seems a bit more opaque, unless well documented. But it seems this is not something you necessarily want to encourage doing anyways (versus the standard way of accessing, which is get_fdata), so maybe that's fine.

If so, then I don't see much need for get_scaled either.

@effigies
Copy link
Member Author

I'd be happy to add a more complete description of the dataobj to the docs, if that would help. Right now, it's a pretty small thing here: https://nipy.org/nibabel/images_and_memory.html#use-the-array-proxy-instead-of-get-fdata

Just to concisely summarize: np.array(img.dataobj) (and np.asarray and np.asanyarray) is a valid way of getting information out of the object that's guaranteed to have had scaling applied to it, but the dtype is ambiguous. np.array(img.dataobj, dtype=np.int16) is (with this) very precise, although it may involve unsafe casting.

img.dataobj.get_scaled() is somewhere in between, with bounded ambiguity. Which might still be useful. If I say get_scaled(dtype='uint16'), I know I have something to which I can safely do anything I could do to a uint16.

IDK if that affects your thinking here.

@rmarkello
Copy link
Contributor

In my mind this is similar to the recent pandas (v0.25.0) transition away from DataFrame.get_values() (now deprecated) to np.asarray(DataFrame) and/or DataFrame.to_numpy(), both of which allow unsafe casting of the sort you describe.

I'd personally be fine to see get_scaled() replaced entirely by __array__, but if you're really worried about the casting you could also add a small check (e.g., confirm whether the provided dtype would result in potentially unsafe casting and throw a warning so that people are aware of what they're doing). That way it would still return whatever datatype was requested so the equality you proposed in the initial post would pass, but you're not being too prescriptive.

@effigies
Copy link
Member Author

Okay. I've dropped get_scaled() for now, and will remove it from the API changes in 3.0. It can always be added back, if it's useful enough.

@effigies effigies force-pushed the fix/arraylike_dtype branch from 43c23d6 to 70987fb Compare December 6, 2019 00:04
@effigies
Copy link
Member Author

Any further comments? I'm inclined to do a second RC after this, to give a bit more testing time.

@effigies effigies merged commit f1e1008 into nipy:master Dec 11, 2019
@effigies effigies deleted the fix/arraylike_dtype branch December 11, 2019 13:16
@effigies effigies added this to the 3.0.0 milestone Jan 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ArrayProxy.__array__ should support dtype argument
3 participants