-
Notifications
You must be signed in to change notification settings - Fork 262
Fixing ArraySequence functionalities #811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #811 +/- ##
==========================================
- Coverage 90.32% 90.14% -0.19%
==========================================
Files 96 98 +2
Lines 12192 12400 +208
Branches 2136 2177 +41
==========================================
+ Hits 11013 11178 +165
- Misses 834 872 +38
- Partials 345 350 +5
Continue to review full report at Codecov.
|
Hi @MarcCote, thanks for this. I'll try to review today or tomorrow. To be clear, are you targeting next week's 2.5.1 release, or November's 3.0 release (release candidate at the end of October)? If you're going for 2.5.1, could you rebase onto the Note that 2.5.1 supports Python 2.7 and 3.4, while nibabel 3.0 will be Python 3.5+. |
Other reviewers are of course very welcome... As this will affect @nipy/team-dipy the most, your input would be particularly appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that's great! It should solve all the issues we had with it. Thanks!
I think the only "risky" thing left is when a Tractogram is created with an ArraySequence view, which could be externally modified (or the tractogram could modify a part of the original ArraySequence). However, I think it is reasonable to let the user manage this. |
@skoudoro @effigies it makes more sense to target Nibabel 3.0. |
@MarcCote Sounds good. I'll push off my review until after the 2.5.1 release, then, just because I'm a bit slammed. @skoudoro How would you feel about being the primary reviewer on this one? I suspect some back-and-forth with Dipy use cases will be the most productive way to refine this. When you're satisfied I can come back for a final round. |
@MarcCote I think if we want to stick close to numpy, we shouldn't warn when slicing. Once a Tractogram is created, the user is responsible for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few comments, but mostly I'm a bit concerned that the logic of what makes a copy, what makes a view, and thus the effects of various operations on unrelated objects is not predictable in an obvious way. I'm not steeped in this API, though, so there may be a logic I'm not really seeing.
Would it be too much to ask for a brief tutorial? I think that will make it easier to think through the API, as well as generally useful for new users.
@@ -53,6 +54,35 @@ def update_seq(self, arr_seq): | |||
arr_seq._lengths = np.array(self.lengths) | |||
|
|||
|
|||
def _define_operators(cls): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to make this an abstract class, instead of a decorator? Right now this produces a pretty uninformative pydoc
entry:
> pydoc nibabel/streamlines/array_sequence.py
...
| __gt__ lambda self, value
|
| __iadd__ lambda self, value
|
| __idiv__ lambda self, value
|
| __ifloordiv__ lambda self, value
|
| __imul__ lambda self, value
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I can do it with an abstract class exactly. Do you mean defining all the methods one by one in the abstract, then make ArraySequence
subclass it? If that's the case, I might as well as defining them directly in ArraySequence
. Also, I was trying to keep the number of new lines to a minimum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess if it's only ever one class, that's one thing. Perhaps we can change them to proper methods and give them reasonable docstrings.
I've looked around at numbers and numpy to see if there was a base class we could use, but doesn't look like it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@effigies So, I'm going to remove the decorator in favor of explicitly listing all the operators. Sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine with me.
To be clear, though, what I was suggesting in my last comment was replacing the lambda
with a full function, so lambda self: self.op(op)
would become def fn(self): return self._op(op)
, and then you could add a docstring with fn.__doc__ = ...
before assigning it.
But whatever's the least work is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for using def fn(self): return self._op(op)
. Not a big fan of lambda, but it is a personal preference...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, gotcha. That makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@effigies done. Here's what it looks like now.
NAME
array_sequence
CLASSES
builtins.object
ArraySequence
class ArraySequence(builtins.object)
| ArraySequence(iterable=None, buffer_size=4)
|
| Sequence of ndarrays having variable first dimension sizes.
|
| This is a container that can store multiple ndarrays where each ndarray
| might have a different first dimension size but a *common* size for the
| remaining dimensions.
|
| More generally, an instance of :class:`ArraySequence` of length $N$ is
| composed of $N$ ndarrays of shape $(d_1, d_2, ... d_D)$ where $d_1$
| can vary in length between arrays but $(d_2, ..., d_D)$ have to be the
| same for every ndarray.
|
| Methods defined here:
|
| __abs__(self)
| abs(self)
|
| __add__(self, value)
| Return self+value.
|
| __and__(self, value)
| Return self&value.
|
| __eq__(self, value)
| Return self==value.
|
| __floordiv__(self, value)
| Return self//value.
|
| __ge__(self, value)
| Return self>=value.
|
| __getitem__(self, idx)
| Get sequence(s) through standard or advanced numpy indexing.
|
| Parameters
| ----------
| idx : int or slice or list or ndarray
| If int, index of the element to retrieve.
| If slice, use slicing to retrieve elements.
| If list, indices of the elements to retrieve.
| If ndarray with dtype int, indices of the elements to retrieve.
| If ndarray with dtype bool, only retrieve selected elements.
|
| Returns
stacklevel=2) | ||
return self.get_data() | ||
|
||
def get_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where someone might want read-only access to the original data, without performing a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's not always possible to return a view on just the data belonging to a sliced/indexed ArraySequence
, I decided to always make it return a copy.
I know it is bad practice but I'm expecting Dipy's methods to deal directly with ._data
using ._offsets
and ._lengths
.
That said, maybe there exists a solution but I fail to see it. I'm happy to consider alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I do no think that a lot of users might want a read-only access to the original data
pts = self.streamlines._data[start:end] | ||
self.streamlines.data[start:end] = apply_affine(affine, pts) | ||
for i in range(len(self.streamlines)): | ||
self.streamlines[i] = apply_affine(affine, self.streamlines[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this not cause a significant slowdown, to only update one streamline at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't show up when running benchmark_streamlines.py
but to be fair that benchmark might not be realistic. I don't have any large tractogram file I could use to benchmark it. Do you know where I could easily get one? (or @skoudoro maybe knows ?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I supposed you already downloaded one. You can look in your home folder: .dipy/bundle_atlas_hcp842/Atlas_80_Bundles/whole_brain/whole_brain_MNI.trk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you checked that this was fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit slower but at least it doesn't have side-effects.
Benchmark: loading the whole_brain_MNI.trk
five times.
Old: Loaded 144,678 streamlines 5 times in 16.57
New: Loaded 144,678 streamlines 5 times in 19.61
Edit: this is with the improve TRK loading, see the last commit in this PR.
check_tractogram(tractogram[::2], | ||
streamlines=[s*scaling for s in DATA['streamlines'][::2]], | ||
data_per_streamline=DATA['tractogram'].data_per_streamline[::2], | ||
data_per_point=DATA['tractogram'].data_per_point[::2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels very strange, but to be clear, is the following the intended semantics?
sliced_tractogram = tractogram[::2] # A "view" tractogram
transformed_tractogram = sliced_tractogram.apply_affine(affine)
assert transformed_tractogram is sliced_tractogram
assert np.array_equal(tractogram[::2].data, sliced_tractogram.data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda but Tractogram
doesn't have a .data
property. LazyTractogram
does have one but it returns a generator over TractogramItem, not a ndarray. In the test, I'm also looking at data_per_point
which is a dictionary of ArraySequence
.
@effigies thanks for the review. I'll see what I can do about writing up a small tutorial to showcase the API. I should have done way before :P. |
@MarcCote What's the status here? I'll be feature freezing 3.0.x when the release candidate comes out. 3.1.0 should be February or March next year. I'm okay delaying the release candidate if we have a short-ish time frame (<2 weeks) to get this in. I don't have the bandwidth to guide this one in in that time, so if you want to aim for 3.0, could we find a volunteer from Dipy to be the primary reviewer? |
Really sorry for the delay @effigies, it is on my to-do list, and I will spend time tomorrow to do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really Nice work @MarcCote and Thank you for that!
During, my review and playing with your PR, I faced something tricky with the slice. This case failed:
from nibabel.streamlines import ArraySequence
streamlines_a = ArraySequence(np.arange(900).reshape((50,6,3)))
streamlines_a[0:4] = streamlines_a[5:9]
Generates the following error:
TypeError Traceback (most recent call last)
<ipython-input-52-785f3cdc5232> in <module>
----> 1 streamlines_a[0:2] = streamlines_a[11:13]
~\Devel\nibabel\nibabel\streamlines\array_sequence.py in __setitem__(self, idx, elements)
435 if len(lengths) != elements.total_nb_rows:
436 msg = "Trying to set {} sequences with {} sequences."
--> 437 raise TypeError(msg.format(len(lengths), elements.total_nb_rows))
438
439 for o1, l1, o2, l2 in zip(offsets, lengths, elements._offsets, elements._lengths):
TypeError: Trying to set 4 sequences with 24 sequences.
Still need to play with your PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here some more comments. Now, I will look if it breaks DIPY and checks StatefulTractogram
.
setattr(cls, name, | ||
lambda self, value: self._op(op, value, inplace=inplace)) | ||
|
||
for op in ["__iadd__", "__isub__", "__imul__", "__idiv__", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove __idiv__
since we target python3 and this operator is only for python2
"__ifloordiv__", "__itruediv__", "__ior__"]: | ||
_wrap(cls, op, inplace=True) | ||
|
||
for op in ["__add__", "__sub__", "__mul__", "__div__", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above for __div__
, can be removed
else: | ||
args = [] if value is None else [value] # Dealing with unary and binary ops. | ||
for o1, l1 in zip(seq._offsets, seq._lengths): | ||
seq._data[o1:o1 + l1] = getattr(seq._data[o1:o1 + l1], op)(*args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seq._data[o1:o1 + l1]
will fail with __truediv__
. I think you should check whether it is a division operator and then _data
need to be cast with a float. It is really tricky.
Below , this following example will fail:
streamlines_a = nib.streamlines.ArraySequence(np.arange(900).reshape((50,6,3)))
print(streamlines_a._data.dtype)
test = streamlines_a / streamlines_a #TypeError: No loop matching the specified signature and casting was found for ufunc true_divide
The result of streamlines_a / streamlines_a
will not be an integer array but a floating point array. However, you reassign to the initial array which is an integer array. This is why it failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there is the point [0,0,0]
in your division ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skoudoro the dtype now changes depending on the arithmetic operation and the provided value. Regarding division by 0, it will produce NaN (i.e. it follows Numpy's convention).
pts = self.streamlines._data[start:end] | ||
self.streamlines.data[start:end] = apply_affine(affine, pts) | ||
for i in range(len(self.streamlines)): | ||
self.streamlines[i] = apply_affine(affine, self.streamlines[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I supposed you already downloaded one. You can look in your home folder: .dipy/bundle_atlas_hcp842/Atlas_80_Bundles/whole_brain/whole_brain_MNI.trk
stacklevel=2) | ||
return self.get_data() | ||
|
||
def get_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I do no think that a lot of users might want a read-only access to the original data
Hi guys, what's the status here? Does this feel like something that will be ready to merge in the next week or so? |
I'm working on it right now. I'll ping @skoudoro when I addressed all of the points raised. |
Hello @MarcCote, Thank you for updating! Cheers! There are no style issues detected in this Pull Request. 🍻 To test for issues locally, Comment last updated at 2019-11-13 15:37:35 UTC |
44ae645
to
6e65107
Compare
Also, it seems I didn't finish completely the |
setattr(cls, op, fn_unary_op if unary else fn_binary_op) | ||
fn = getattr(cls, op) | ||
fn.__name__ = op | ||
fn.__doc__ = getattr(np.ndarray, op).__doc__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful! I love this _wrap
function 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! It turned out nicely.
df65e94
to
ffbf0a1
Compare
I just finished the |
I just realized that indexing a column is counter-intuitive. I'm planning on forcing the use of a full slice (':') for the second dimension (i.e. the points dimension). What do you think? from nibabel.streamlines import ArraySequence
seq = ArraySequence([[(1, 2, 3)],
[(4, 5, 6), (7, 8, 9)],
[(10, 11, 12), (13, 14, 15), (16, 17, 18)]
])
# Before
seq[:, 2] # Retrieving the third coordinates of every point of every sequence.
> ArraySequence([array([3]), array([6, 9]), array([12, 15, 18])])
# After. We force to use full slice on the second dimension.
seq[:, 2]
> TypeError: Only full slice (':') is allowed for the second dimension.
seq[:, :, 2] # Dimensions: 1) sequences, 2) points, 3) common shape data.
> ArraySequence([array([3]), array([6, 9]), array([12, 15, 18])]) |
ok, I am not personally again that, but I would like the @ppoulin91 and @frheault opinions. My only concern is I would prefer that you force the full slicing all the time or never. I do not like the idea to have it only for a specific case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, mostly from looking at Coverage.
It would always be forced with my next commit. The example I showed was "before" and "after" the change. |
@effigies @matthew-brett so we should do the following? @property
@deprecate_with_version("'ArraySequence.data' property is deprecated.\n"
"Please use the 'ArraySequence.get_data()' method instead",
'3.0', '4.0')
def data(self):
""" Elements in this array sequence. """
return self._data instead of |
Remove Python version checks. Refactor unit tests for arithmetic operators.
6ee90b7
to
ad73444
Compare
That's what I was thinking. There might be room in the property manifesto rules to make it a read-only view, which could achieve some of your goal without performing a copy behind people's backs. Just one man's opinion... |
I like that idea. Something like that? view = self._data
view.setflags(write=False)
return view |
view = self._data.view()
view.setflags(write=False)
return view |
ad73444
to
1ab8d58
Compare
Oops. There you go. |
@skoudoro How's this looking on your end? |
Overall, looks good to me! But I need to do some tests one last time so I will come back to you by tomorrow evening. Is that ok? |
Sounds good. |
Hi @MarcCote, We are really close! There is a small issue. Below the code to reproduce it:
here the error
|
@skoudoro This is also true with numpy. Since your ArraySequence is composed of integers, you can't do a true division in-place. Any suggestions on how I should properly deal with that?
|
@skoudoro Thanks for reviewing. I'll have another look later today. @matthew-brett Any further concerns? |
Oh, @ppoulin91, I forgot to tag you. Do you have any further comments? |
@@ -5,6 +5,9 @@ | |||
|
|||
import numpy as np | |||
|
|||
from nibabel.deprecated import deprecate_with_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from nibabel.deprecated import deprecate_with_version | |
from ..deprecated import deprecate_with_version |
@@ -5,6 +5,9 @@ | |||
|
|||
import numpy as np | |||
|
|||
from nibabel.deprecated import deprecate_with_version | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pts = self.streamlines._data[start:end] | ||
self.streamlines.data[start:end] = apply_affine(affine, pts) | ||
for i in range(len(self.streamlines)): | ||
self.streamlines[i] = apply_affine(affine, self.streamlines[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you checked that this was fine?
I had a quick check with @frheault, we didn't follow everything in the last weeks, but everything looks good to us! |
@effigies I can move the last commit to a separate PR if needed. That commit is about speeding up the loading of TRK files.
|
Alright, thanks for the contribution, @MarcCote, and the reviews, @skoudoro, @ppoulin91, @frheault! In it goes. |
This PR aims to address the issues related to exposing ArraySequence's internal data (e.g. as discussed in #729 and dipy/dipy#1956 - this post exactly, and my original answer to those issues can be found here).
There are three parts to this PR (I've made separate commits for convenience).
ArraySequence.data
will now return a copy of the data. This way, altering the data of sliced/indexed tractogram will not affect the original data.ArraySequence
now supports some Python operators to modify its internal data (e.g. +=, -=, ...). Also, assigning/overwriting sequences can be done usingarr_seq[idx] = arr_seq2
(provided the shape are compatible).Tractogram.apply_affine
where calling that function on a sliced/indexedTractogram
object would result in modifying all streamlines (i.e. even those not selected).If you think it is missing some Python operators that might be useful let me know.
@skoudoro
NB: I haven't looked extensively but this PR would affect Dipy a bit. For instance, this part (https://github.com/nipy/dipy/blob/master/dipy/tracking/streamlinespeed.pyx#L104-L115) should have been using
._data
like the other functions in that file. Also, some of @frheault'sStatefulTractogram
code might be affected.