Skip to content

Fixing ArraySequence functionalities #811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Nov 13, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 185 additions & 4 deletions nibabel/streamlines/array_sequence.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@

import numpy as np

from ..deprecated import deprecate_with_version

MEGABYTE = 1024 * 1024


Expand Down Expand Up @@ -53,6 +55,37 @@ def update_seq(self, arr_seq):
arr_seq._lengths = np.array(self.lengths)


def _define_operators(cls):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to make this an abstract class, instead of a decorator? Right now this produces a pretty uninformative pydoc entry:

> pydoc nibabel/streamlines/array_sequence.py
...
     |  __gt__ lambda self, value
     |  
     |  __iadd__ lambda self, value
     |  
     |  __idiv__ lambda self, value
     |  
     |  __ifloordiv__ lambda self, value
     |  
     |  __imul__ lambda self, value
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I can do it with an abstract class exactly. Do you mean defining all the methods one by one in the abstract, then make ArraySequence subclass it? If that's the case, I might as well as defining them directly in ArraySequence. Also, I was trying to keep the number of new lines to a minimum.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess if it's only ever one class, that's one thing. Perhaps we can change them to proper methods and give them reasonable docstrings.

I've looked around at numbers and numpy to see if there was a base class we could use, but doesn't look like it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@effigies So, I'm going to remove the decorator in favor of explicitly listing all the operators. Sounds good?

Copy link
Member

@effigies effigies Nov 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me.

To be clear, though, what I was suggesting in my last comment was replacing the lambda with a full function, so lambda self: self.op(op) would become def fn(self): return self._op(op), and then you could add a docstring with fn.__doc__ = ... before assigning it.

But whatever's the least work is fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using def fn(self): return self._op(op). Not a big fan of lambda, but it is a personal preference...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, gotcha. That makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@effigies done. Here's what it looks like now.

NAME
    array_sequence

CLASSES
    builtins.object
        ArraySequence
    
    class ArraySequence(builtins.object)
     |  ArraySequence(iterable=None, buffer_size=4)
     |  
     |  Sequence of ndarrays having variable first dimension sizes.
     |  
     |  This is a container that can store multiple ndarrays where each ndarray
     |  might have a different first dimension size but a *common* size for the
     |  remaining dimensions.
     |  
     |  More generally, an instance of :class:`ArraySequence` of length $N$ is
     |  composed of $N$ ndarrays of shape $(d_1, d_2, ... d_D)$ where $d_1$
     |  can vary in length between arrays but $(d_2, ..., d_D)$ have to be the
     |  same for every ndarray.
     |  
     |  Methods defined here:
     |  
     |  __abs__(self)
     |      abs(self)
     |  
     |  __add__(self, value)
     |      Return self+value.
     |  
     |  __and__(self, value)
     |      Return self&value.
     |  
     |  __eq__(self, value)
     |      Return self==value.
     |  
     |  __floordiv__(self, value)
     |      Return self//value.
     |  
     |  __ge__(self, value)
     |      Return self>=value.
     |  
     |  __getitem__(self, idx)
     |      Get sequence(s) through standard or advanced numpy indexing.
     |      
     |      Parameters
     |      ----------
     |      idx : int or slice or list or ndarray
     |          If int, index of the element to retrieve.
     |          If slice, use slicing to retrieve elements.
     |          If list, indices of the elements to retrieve.
     |          If ndarray with dtype int, indices of the elements to retrieve.
     |          If ndarray with dtype bool, only retrieve selected elements.
     |      
     |      Returns

""" Decorator which adds support for some Python operators. """
def _wrap(cls, op, inplace=False, unary=False):

def fn_unary_op(self):
return self._op(op)

def fn_binary_op(self, value):
return self._op(op, value, inplace=inplace)

setattr(cls, op, fn_unary_op if unary else fn_binary_op)
fn = getattr(cls, op)
fn.__name__ = op
fn.__doc__ = getattr(np.ndarray, op).__doc__
Copy link
Member

@skoudoro skoudoro Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful! I love this _wrap function 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It turned out nicely.


for op in ["__add__", "__sub__", "__mul__", "__mod__", "__pow__",
"__floordiv__", "__truediv__", "__lshift__", "__rshift__",
"__or__", "__and__", "__xor__"]:
_wrap(cls, op=op, inplace=False)
_wrap(cls, op="__i{}__".format(op.strip("_")), inplace=True)

for op in ["__eq__", "__ne__", "__lt__", "__le__", "__gt__", "__ge__"]:
_wrap(cls, op)

for op in ["__neg__", "__abs__", "__invert__"]:
_wrap(cls, op, unary=True)

return cls


@_define_operators
class ArraySequence(object):
""" Sequence of ndarrays having variable first dimension sizes.

Expand Down Expand Up @@ -116,9 +149,42 @@ def total_nb_rows(self):
return np.sum(self._lengths)

@property
@deprecate_with_version("'ArraySequence.data' property is deprecated.\n"
"Please use the 'ArraySequence.get_data()' method instead",
'3.0', '4.0')
def data(self):
""" Elements in this array sequence. """
return self._data
view = self._data.view()
view.setflags(write=False)
return view

def get_data(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where someone might want read-only access to the original data, without performing a copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's not always possible to return a view on just the data belonging to a sliced/indexed ArraySequence, I decided to always make it return a copy.

I know it is bad practice but I'm expecting Dipy's methods to deal directly with ._data using ._offsets and ._lengths.

That said, maybe there exists a solution but I fail to see it. I'm happy to consider alternatives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I do no think that a lot of users might want a read-only access to the original data

""" Returns a *copy* of the elements in this array sequence.

Notes
-----
To modify the data on this array sequence, one can use
in-place mathematical operators (e.g., `seq += ...`) or the use
assignment operator (i.e, `seq[...] = value`).
"""
return self.copy()._data

def _check_shape(self, arrseq):
""" Check whether this array sequence is compatible with another. """
msg = "cannot perform operation - array sequences have different"
if len(self._lengths) != len(arrseq._lengths):
msg += " lengths: {} vs. {}."
raise ValueError(msg.format(len(self._lengths), len(arrseq._lengths)))

if self.total_nb_rows != arrseq.total_nb_rows:
msg += " amount of data: {} vs. {}."
raise ValueError(msg.format(self.total_nb_rows, arrseq.total_nb_rows))

if self.common_shape != arrseq.common_shape:
msg += " common shape: {} vs. {}."
raise ValueError(msg.format(self.common_shape, arrseq.common_shape))

return True

def _get_next_offset(self):
""" Offset in ``self._data`` at which to write next rowelement """
Expand Down Expand Up @@ -320,7 +386,7 @@ def __getitem__(self, idx):
seq._lengths = self._lengths[off_idx]
return seq

if isinstance(off_idx, list) or is_ndarray_of_int_or_bool(off_idx):
if isinstance(off_idx, (list, range)) or is_ndarray_of_int_or_bool(off_idx):
# Fancy indexing
seq._offsets = self._offsets[off_idx]
seq._lengths = self._lengths[off_idx]
Expand All @@ -329,6 +395,116 @@ def __getitem__(self, idx):
raise TypeError("Index must be either an int, a slice, a list of int"
" or a ndarray of bool! Not " + str(type(idx)))

def __setitem__(self, idx, elements):
""" Set sequence(s) through standard or advanced numpy indexing.

Parameters
----------
idx : int or slice or list or ndarray
If int, index of the element to retrieve.
If slice, use slicing to retrieve elements.
If list, indices of the elements to retrieve.
If ndarray with dtype int, indices of the elements to retrieve.
If ndarray with dtype bool, only retrieve selected elements.
elements: ndarray or :class:`ArraySequence`
Data that will overwrite selected sequences.
If `idx` is an int, `elements` is expected to be a ndarray.
Otherwise, `elements` is expected a :class:`ArraySequence` object.
"""
if isinstance(idx, (numbers.Integral, np.integer)):
start = self._offsets[idx]
self._data[start:start + self._lengths[idx]] = elements
return

if isinstance(idx, tuple):
off_idx = idx[0]
data = self._data.__getitem__((slice(None),) + idx[1:])
else:
off_idx = idx
data = self._data

if isinstance(off_idx, slice): # Standard list slicing
offsets = self._offsets[off_idx]
lengths = self._lengths[off_idx]

elif isinstance(off_idx, (list, range)) or is_ndarray_of_int_or_bool(off_idx):
# Fancy indexing
offsets = self._offsets[off_idx]
lengths = self._lengths[off_idx]

else:
raise TypeError("Index must be either an int, a slice, a list of int"
" or a ndarray of bool! Not " + str(type(idx)))

if is_array_sequence(elements):
if len(lengths) != len(elements):
msg = "Trying to set {} sequences with {} sequences."
raise ValueError(msg.format(len(lengths), len(elements)))

if sum(lengths) != elements.total_nb_rows:
msg = "Trying to set {} points with {} points."
raise ValueError(msg.format(sum(lengths), elements.total_nb_rows))

for o1, l1, o2, l2 in zip(offsets, lengths, elements._offsets, elements._lengths):
data[o1:o1 + l1] = elements._data[o2:o2 + l2]

elif isinstance(elements, numbers.Number):
for o1, l1 in zip(offsets, lengths):
data[o1:o1 + l1] = elements

else: # Try to iterate over it.
for o1, l1, element in zip(offsets, lengths, elements):
data[o1:o1 + l1] = element

def _op(self, op, value=None, inplace=False):
""" Applies some operator to this arraysequence.

This handles both unary and binary operators with a scalar or another
array sequence. Operations are performed directly on the underlying
data, or a copy of it, which depends on the value of `inplace`.

Parameters
----------
op : str
Name of the Python operator (e.g., `"__add__"`).
value : scalar or :class:`ArraySequence`, optional
If None, the operator is assumed to be unary.
Otherwise, that value is used in the binary operation.
inplace: bool, optional
If False, the operation is done on a copy of this array sequence.
Otherwise, this array sequence gets modified directly.
"""
seq = self if inplace else self.copy()

if is_array_sequence(value) and seq._check_shape(value):
elements = zip(seq._offsets, seq._lengths,
self._offsets, self._lengths,
value._offsets, value._lengths)

# Change seq.dtype to match the operation resulting type.
o0, l0, o1, l1, o2, l2 = next(elements)
tmp = getattr(self._data[o1:o1 + l1], op)(value._data[o2:o2 + l2])
seq._data = seq._data.astype(tmp.dtype)
seq._data[o0:o0 + l0] = tmp

for o0, l0, o1, l1, o2, l2 in elements:
seq._data[o0:o0 + l0] = getattr(self._data[o1:o1 + l1], op)(value._data[o2:o2 + l2])

else:
args = [] if value is None else [value] # Dealing with unary and binary ops.
elements = zip(seq._offsets, seq._lengths, self._offsets, self._lengths)

# Change seq.dtype to match the operation resulting type.
o0, l0, o1, l1 = next(elements)
tmp = getattr(self._data[o1:o1 + l1], op)(*args)
seq._data = seq._data.astype(tmp.dtype)
seq._data[o0:o0 + l0] = tmp

for o0, l0, o1, l1 in elements:
seq._data[o0:o0 + l0] = getattr(self._data[o1:o1 + l1], op)(*args)

return seq

def __iter__(self):
if len(self._lengths) != len(self._offsets):
raise ValueError("ArraySequence object corrupted:"
Expand Down Expand Up @@ -371,7 +547,7 @@ def load(cls, filename):
return seq


def create_arraysequences_from_generator(gen, n):
def create_arraysequences_from_generator(gen, n, buffer_sizes=None):
""" Creates :class:`ArraySequence` objects from a generator yielding tuples

Parameters
Expand All @@ -381,8 +557,13 @@ def create_arraysequences_from_generator(gen, n):
array sequences.
n : int
Number of :class:`ArraySequences` object to create.
buffer_sizes : list of float, optional
Sizes (in Mb) for each ArraySequence's buffer.
"""
seqs = [ArraySequence() for _ in range(n)]
if buffer_sizes is None:
buffer_sizes = [4] * n

seqs = [ArraySequence(buffer_size=size) for size in buffer_sizes]
for data in gen:
for i, seq in enumerate(seqs):
if data[i].nbytes > 0:
Expand Down
Loading