Skip to content

Selecting on data_per_streamline #729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arokem opened this issue Feb 26, 2019 · 6 comments
Closed

Selecting on data_per_streamline #729

arokem opened this issue Feb 26, 2019 · 6 comments

Comments

@arokem
Copy link
Member

arokem commented Feb 26, 2019

I am trying to use the data_per_streamline attribute to select streamlines from within a Tractogram object. I am running into the following behavior.

I initialize a very simple tractogram object. The first streamline has the data_per_streamline foo key set to 0 and the other two streamlines have this set to 1:

import numpy as np 
import nibabel.streamlines as sl 
  
tgram = sl.Tractogram([np.array([[0, 0, 0], [1, 1, 1]]), 
                       np.array([[2, 2, 2], [3, 3, 3], [4, 4, 4]]), 
                       np.array([[5, 5, 5], [6, 6, 6]])], 
                       data_per_streamline={'foo': [[0],[1],[1]]}) 

I can find the right streamlines using np.where:

idx = np.where(tgram.data_per_streamline['foo'] == 1)[0] 
idx
array([1, 2])

It even looks like it's doing the right thing when I use this for indexing:

tgram[idx].streamlines                                                   
ArraySequence([array([[2, 2, 2],
       [3, 3, 3],
       [4, 4, 4]]), array([[5, 5, 5],
       [6, 6, 6]])])

But under the hood:

tgram[idx].streamlines._data                                             
array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4],
       [5, 5, 5],
       [6, 6, 6]])

Why is it still holding onto all of the data? Is there some way to drop the data for the first (not selected) streamline?

@arokem
Copy link
Member Author

arokem commented Feb 26, 2019

In case you are wondering, also:

tgram.streamlines[idx]._data   
array([[0, 0, 0],
       [1, 1, 1],
       [2, 2, 2],
       [3, 3, 3],
       [4, 4, 4],
       [5, 5, 5],
       [6, 6, 6]])

@arokem
Copy link
Member Author

arokem commented Feb 26, 2019

Surely, I am not supposed to do the following?

new_tgram = sl.Tractogram((tgram.streamlines._data[tgram.streamlines._offsets[i]:tgram.streamlines._offsets[i]+tgram.streamlines._offsets[i]+1] for i in idx),
                           data_per_streamline={k: [tgram.data_per_streamline[k][i] for i in idx] for k in tgram.data_per_streamline.keys()})

Though it does seem to have the expected behavior (at least in this case).

arokem added a commit to arokem/pyAFQ-legacy that referenced this issue Feb 26, 2019
@effigies
Copy link
Member

Just to be clear, it looks like you're in the situation where you load a large number of streamlines, want to make a copy of a subset, and release the memory for all of the rest? From a quick glance through the code, this use case wasn't really coded for. Perhaps it would be best to add a method to allow for making a pruned copy?

I'll note that the ArraySequence.data property simply returns _data without applying offsets.

def data(self):
""" Elements in this array sequence. """
return self._data

If that's a bug, it may be that, after fixing, you could do something like the following:

new_tgram = sl.Tractogram(tgram.streamlines.data)

I'll rope @MarcCote in again, in hopes he can shine some light here. I'm only familiar with this API from a few recent PRs.

@MarcCote
Copy link
Contributor

That was the intended behavior but I agree it might be unexpected when looking under the hood. The advanced indexing on ArraySequence is a lazy operation to avoid creating more memory if we can. If you want to prune the data, you should call .copy() on the ArraySequence, like so

In [5]: tgram[idx].streamlines.copy().data
Out[5]: 
array([[2, 2, 2],
       [3, 3, 3],
       [4, 4, 4],
       [5, 5, 5],
       [6, 6, 6]])

I think we want with that approach (i.e. explicitly calling .copy()) so users are aware that operation will be creating additional memory. I'm open to suggestions to make it clearer to the user?

@arokem
Copy link
Member Author

arokem commented Feb 28, 2019

Thanks! That actually seems rather reasonable.

More generally, I think that a documentation tutorial on streamlines would be useful to have. Maybe that's something I could do at the DIPY workshop in a couple of weeks.

@MarcCote
Copy link
Contributor

MarcCote commented Mar 1, 2019

That would be awesome. I'd be happy to review it. Sadly, I can't make the Dipy workshop but maybe I can set some time aside and join remotely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants