Skip to content

Tractography Data Format #942

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
francopestilli opened this issue Jul 30, 2020 · 99 comments
Open

Tractography Data Format #942

francopestilli opened this issue Jul 30, 2020 · 99 comments

Comments

@francopestilli
Copy link

It would be terrific to start a conversation about an agreed-upon data format for tractography.
@arokem @Garyfallidis

@mrneont
Copy link

mrneont commented Jul 30, 2020

Hi-

I am interested in this. I have worked on the tractography tools in AFNI (with Ziad Saad).

I imagine FSL developers would be interested, as would Frank Yeh @frankyeh of DSI-Studio and the developers of MRtrix3.

Thanks,
Paul Taylor

@francopestilli
Copy link
Author

Yep! We should have this as an open discussion.

@frankyeh
Copy link

Love to see a new format standard. TRK file has been given me a lot of headaches and limited a lot of possible extension. DSI Studio will surely support any open standard for tractography.

@effigies
Copy link
Member

effigies commented Jul 30, 2020

While I don't necessarily think that a new tractography data format must be constrained by the NiBabel API, @MarcCote did put a fair bit of time and thought into the streamlines API. It might be worth thinking through whether this API is sufficient or if it's missing something.

Part of my thinking is that once a sufficient API is settled on, we can turn that around to quickly prototype more-or-less efficient implementations.

For what it's worth, I recently explored the API a bit and tried to summarize a bit in the following:

https://github.com/effigies/nh2020-nibabel/blob/ef3addf947004ca8f5610f34e767a578c4934c09/NiBabel.py#L821-L911

@MarcCote
Copy link
Contributor

MarcCote commented Aug 4, 2020

I totally agree with @effigies. Adding more people to the discussion. @frheault @jchoude

@francopestilli
Copy link
Author

@MarcCote @frheault @jchoude @frankyeh great!
@effigies loading (partial data loading) and efficiency will be critical for the format.

@mrneont
Copy link

mrneont commented Aug 4, 2020

Trying to include a few other potentially interested people: @bjeurissen @jdtournier @neurolabusc
(Not finding an obvious FSL-tracking contact via github ID-- could someone else please help with that?)

@mrneont
Copy link

mrneont commented Aug 4, 2020

@effigies -- would be great to try that API with a demo.

Some functionality we value is keeping track of tracts as bundles-- if we put in some N>2 targets, we often care about any pairwise connections amongst those as separate bundles, because we are basically using tractography to parcellate the WM skeleton. Does that labeling/identifying of groups of tracts exist there?

@effigies
Copy link
Member

effigies commented Aug 4, 2020

The FMRIB contacts I know are @eduff and @pauldmccarthy. They might be able to point us to the right people...

@neurolabusc
Copy link

Before creating a new format, it might be worth considering the existing formats and see if any could be enhanced or improved, similar to the way that NIfTI maintained Analyze compatibility while directly addressing the weaknesses. The popular formats seem to be:

Unlike triangulated meshes, tractography can not benefit from indexing and stripping, so the existing formats all seem to be pretty similar (all describe node-to-node straight lines, not splines).

I concur with @mrneont that it is nice to have the ability to describe tracks as bundles. I think this makes TRK the most attractive choice (some other formats are not well documented, so they may support this feature).

Perhaps @frankyeh can expand on his frustrations with TRK. What are the limitations? Likewise, perhaps everyone (including @francopestilli who started this thread) can provide a wish list for desired features.

@Garyfallidis
Copy link
Member

Hi all,

And thank you for bringing this up. I have to say this a recurrent topic. Every one or two years this re-emerges.

I suggest before you do anything else study what is already developed in the existing API in nibabel.

Marc-Alex and myself worked quite a bit to support all the basic needs. Accessing bundles fast, adding properties etc. Actually we have already implemented a fast version that can load/save tracks to npz (a numpy ) format. Which you can use if you have big data.

For me the main decision which requires feedback from the community is the formatting technology. Do you want to save the end result using json, hdf5, gITF or something else? If we can decide on that. Then we are set. The work to study previous formats is already mostly done at least on my side.

Nonetheless, see also a recent paper for a new format called Trako
https://arxiv.org/pdf/2004.13630.pdf

@frheault
Copy link

frheault commented Aug 4, 2020

It is important to mention that no matter the file format, the main problems when it comes to standard will remain.
Most people have a lot of trouble with TRK because you can mess up the space or the header or both.
But the same remains true for vtk/tck, one could always write the TCK wrong (data) and/or provide the wrong nifti as a reference for the transformation.

No matter the new format, the same difficulties will remain. There is a thousand ways to write a TRK wrong, but many write it wrong and read it wrong too and it can work in their software. I think I was added due to my contribution to Dipy (StatefulTractogram). I think that no matter the new format, as long as people can have header attributes such as:

  • Space (VOX, VOXMM, RASMM)
  • Origin (of voxel, CORNER, CENTER)
  • Affine (vox2rasmm or inverse)
  • dimensions (shape of the diffusion from which the tractogram was computed)
  • Voxel size (For verification with the affine)
  • Voxel order (For verification with the affine)

I think I will be happy. For example, in my own code I used @MarcCote API to write an HDF5 format in which the length, offset, and data of one or multiple 'tractograms' are saved. So I can easily read any of these tractograms (I use it for connectomics, also could be used for bundles) and one could achieve the same to read any streamlines in particular. But as long as the attribute listed earlier are available anything can be done after that.

Also, if a new format is added to dipy. If it is statefulTractogram friendly it can easily be converted back to other commonly supported formats (TCK, TRK, VTK/FIB, DPY). If these efforts are made to have more efficient reading for computation, I think there is no problem with supporting more format. If the goal is to reduce confusion in how to write/read the format, I believe that a new format would not/never help. The unstructured nature of tractogram (not grid-like) makes it harder since the header and data are not fixed together when it comes to the spatial coherence.

PS : I personally think TRK is fine, everything is in the header. The problem is the variety of ways people can write/read it wrong. Making support across tools and labs pretty difficult. However, I think a strict approach to reading/write in dipy was beneficial on the long term. Short term, sure maybe half of the user probably hate me on some level, but a think a strict TRK (or at least a TCK always with the nifti that generated it) is superior than a lot of format. Just in term of available info, not for large scale computation visualisation and fancy reading/writing.

@arokem
Copy link
Member

arokem commented Aug 5, 2020

At the danger of somewhat rerouting the conversation, I guess since this has come up, this might also be a good time to discuss whether it would be beneficial to "upstream" the StatefulTractogram that @frheault has implemented in DIPY into nibabel. It's been "incubating" in DIPY since April 2019 (dipy/dipy#1812), with a few fixes and improvements along the way (e.g., dipy/dipy#2013, dipy/dipy#1997) that have followed rather closely the continuous improvements that have happened in nibabel. When Francois originally implemented this, we discussed the possibility of eventually moving the SFT object implementation here (see @effigies comment here about that: dipy/dipy#1812 (comment)). As someone who has been using SFT quite extensively in my own work, I find it rather helpful (where previously I was struggling with all the things that @frheault mentioned above). So I think that there could be a broader use for it. Just to echo @frheault's comment: a benefit of that would be that you could move between SFT-compliant formats with less fear of messing things up. I guess the question is one of timing and of potential future evolution of the SFT object. What are your thoughts, @frheault, (and others, of course)?

And to bring the discussion back to its start -- @francopestilli -- I am curious to hear: what needs are not currently addressed and prompted your original post here? Are you looking to store information that is not in the currently-supported formats? Or is there some limitation on performance that needs to be addressed?

@arokem
Copy link
Member

arokem commented Aug 5, 2020

Oh - just saw the mailing list messages and now understand where this originated (here: https://mail.python.org/pipermail/neuroimaging/2020-July/002161.html, and in other parts of that thread). Sorry: hard to stay on top of everything...

@frankyeh
Copy link

frankyeh commented Aug 5, 2020 via email

@jdtournier
Copy link

I'm looping in Rob Smith (@Lestropie) into this conversation, this is something we've discussed many times in the past.

I've not had a chance to look into all the details here, but here's just a few of my unsolicited thoughts on the matter, for what they're worth:

  • The issue of loading a 10M streamline tractogram into memory is in my opinion independent of the file format - it's about internal in-memory data representation, and as shown by the TRK handling mentioned above, different implementations can handle the same format very differently.

  • Simplicity: essential it it's to be accepted as a standard. It should be relatively easy to code up import/export routines from any language, without relying on external tooling. As also mentioned by @frheault, there's lots of ways of storing these data wrong, no matter the format, so it's important to minimise any unnecessary complexities in the format, and be explicit about the conventions used. For example, the tck format used in MRtrix stores vertices in world coordinates and has only 2 required elements in its header: a datatype specifier (almost invariably Float32LE), and an offset to the start of the binary data. I can't think of anything else that would be classed as necessary here (though of course in practice there's lots of additional useful information that we want to store in the header).

  • Space efficiency: these are likely to contain very large amounts of data, and these should take no more space than is strictly necessary to store the information. The type of geometry and/or data layout is known in advance, so I don't think it makes sense to try to use more generic container formats like VTK or HDF5 - these will likely require more space to describe the geometry / layout. Text formats are also likely to be inefficient from that point of view.

  • Load/store efficiency: loading large amounts of data will take even longer if the data need to be converted, especially to/from text. Ideally it should be possible to read() / write() the data into/out of memory in one go, and even better, memory-map it and access it directly. This implies storing in IEEE floating-point format, most likely little-endian since that's the native format on all CPUs in common use. We could discuss whether to store in single or double precision, but I don't expect there will be many applications where we need to store vertex locations with 64 bits of precision - in fact, I wouldn't be surprised if the discussion goes the other way, with the possibility of using 16 or 24 bit floats instead (though these would require conversion and could potentially slow down the IO).

  • Independence: I think it's critical that the format is standalone, and independent of any external reference. Having to supply the coordinate system for the tractogram by way of a user-supplied image would in my opinion massively expand the scope for mistakes. I don't mind so much if the necessary information is encoded in the header, as suggested by @frheault above - but I don't see that it adds a great deal to simply storing the data in world coordinates directly. I do appreciate that it probably matches the way data are processed in many packages, where everything in performed in voxel space. In MRtrix, everything is performed in real space, and the fODFs / eigenvectors are stored relative to world coordinates also, so there's no further conversion necessary - I appreciate not all packages work that way. And for full disclosure: we (MRtrix) would have a vested interest here since storing in anything other than world coordinates would probably mean more work for our applications.

  • Extensibility: we routinely add lots more information in our tractogram headers as the need arises, and I expect there will be many applications where the ability to store additional information more loosely will be useful. A standard format should allow for this, and also allow for additional entries in the header to become part of the official standard if & when their use becomes commonplace.

So that's my 2 cents on what I would like a standard tracrography format to look like. You'll note that I've more or less described the tck format, and yes, there's a fairly obvious conflict of interest here... 😁

However, there's clearly features that the tck format doesn't support that others do, though I've not yet felt the need to use them. The ability to group streamlines within a tractogram is interesting. I would personally find it simpler to define a folder hierarchy to encode this type of information: it uses standard filesystem semantics, allows human-readable names for the different groups, and allows each group to have its own header entries if required. Assuming the header is relatively compact, it also shouldn't take up a lot more storage than otherwise. And it allows applications to use simple load/store routines to handle single tractogram files, and trivially build on them to handle these more complex structures as the need arises. Others may (and no doubt will) disagree with this...

Another issue that hasn't been discussed so far is the possibility of storing additional per-streamline or per-vertex information. That's not currently something that can be done with the tck format, though it may be possible with others. This is actually the main topic of conversation within the MRtrix team. We currently store this type of information using separate files, both because our tck format isn't designed to handle it (probably the main reason, to be fair), but also because it avoids needless duplication in cases where several bits of information need to be stored (this was also one of the motivations for our fixel directory format). For example, if we want to store the sampled FA value for every vertex, we could store it in the file, but what happens if we also want to sample the MD, AD, RD, fODF amplitude, etc.? It's more efficient to store just the sampled values separately alongside the tractogram, than to duplicate the entire tractogram for each measure just so it can reside in the same file. Alternatively, we could allow the format to store multiple values per vertex, but then we'd need additional complexity in the header to encode which value corresponds to what - something that's inherently handled by the filesystem if these values are stored separately. And on top of that, a format that allows for these per-streamline and/or per-vertex information would necessarily be more complex, increasing the scope for incorrect implementations, etc. Again, this is a likely to be a topic where opinions vary widely (including within our own team), but my preference here is again to keep the file format simple and uncluttered, and rely on the filesystem to store/encode additional information: it's simpler, more flexible, leverages existing tools and concepts that everyone understands, and avoids the need for additional tools to produce, inspect and manipulate these more complex datasets.

OK, that's the end of my mind dump. Sorry if it's a bit long-winded...

@eduff
Copy link

eduff commented Aug 5, 2020

Probably the best person to bring in regarding tractography formats from FMRIB would be Saad Jbabdi (or possibly Michiel Cottaar @MichielCottaar).

@neurolabusc
Copy link

I think @jdtournier did a nice job of describing the tradeoffs, and also agree that IEEE-754 float16 (e.g. GLhalf) probably provides more than sufficient precision, which would improve space and load/store efficiency. Thanks @Garyfallidis for noting the Trako format. It clearly achieves good space and load efficiency. It seems weak on the simplicity and store efficiency metrics - on my machine (which has a lot of Python packages installed) the example code required installing an additional 110Mb of additional Python packages, and the current JavaScript code only decodes data. So, at the moment a promising proof of concept, but not without tradeoffs. @jdtournier also makes an interesting case that perhaps scalars should be separate files (e.g. NIfTI volumes) from the tractography file).

@Garyfallidis
Copy link
Member

@jdtournier the support of tck is already available in nibabel and dipy. If your claim is to just use tck then the answer is that many labs are not satisfied with the tck format. If they were fine then we would just use tck.

The effort here is to find a format that will be useful to most software tools. Nonetheless, if you look into the current implementation you will see that the tractograms are always loaded in world coordinates. But the advantage here is that you could have those stored in a different original space in the format. As about storing other metrics I think we still need that information because a) many labs use such a feature, b) if you store the data on other files then you always have to interpolate and perhaps the interpolation used is not trivial. Also you could have metrics that are not related to standard maps such as FA etc. You could have for example curvature saved for each point of the streamline. Would you prefer curvature being saved as a nifti file? That would not make sense right?

@Garyfallidis
Copy link
Member

My suggestion to move forward is that @frheault who has studied multiple file formats and found similarities and differences writes down the specifications of the new format and send it over to the different labs and tools for approval and suggestions. It is important to show that in nibabel we have done the required work to study all or at least most that is out there and that the initial effort is coming out with some consensus of some sort. I hope @frheault that you will accept to lead such a task. And also thank you for your tremendous effort to make some sense in this world of tracks. Of course we will need the help of all of us especially @effigies, @matthew-brett and @MarcCote. But I think you are the right person to finally get this done and move on happily as a community.

@neurolabusc
Copy link

@Garyfallidis, I agree with your view, that formats face a Darwinian selection, and therefore popular formats are filling a niche. However, your comment that If your claim is to just use tck then the answer is that many labs are not satisfied with the tck format. If they were fine then we would just use tck is the fallacy of the converse. Just because popular formats are useful does not mean that unpopular formats are not useful. Consider the as-yet-not-created format advocated by many in this thread: it is currently used by no one, yet that does not mean it can not fill a niche. It could be that people simply use an inferior format because their tool does not support a better format, the better format is not well documented, or they are not aware of the advantages of a better format. I think we want a discussion of the technical merits of the available formats and the desired features for a format. @jdtournier provides a nice list of metrics to select between formats.

@jdtournier the challenge I have with tck is that I can not find any documentation for it. My support for this format was based on porting the Matlab read and write routines. It is unclear if these fully exploit the format as implemented by mrview and other MRTrix tools.

@neurolabusc
Copy link

@jdtournier per your comment Another issue that hasn't been discussed so far is the possibility of storing additional per-streamline or per-vertex information, in TRK-speak these are properties (per-streamline) and scalars (per-vertex). Several comments have noted this as a benefit for the TRK-format. I take your point that voxelwise images (NIfTI, MIF) can provide an alternative method to compute many per-vertex measures, but also @Garyfallidis concern that these can not encode all the measures we might want (e.g. curvature).

Maybe I am naive, but when I explored TrackVis, I thought there was a way to save TRK files that would map MNI world space without knowing the dimensions of the corresponding voxel grid:

vox_to_ras = [1 0 0 0.5; 0 1 0 0.5; 0 0 1 0.5; 0 0 0 1]
voxel_order = 'RAS'
image_orientation_patient = [1 0 0; 0 1 0]
invert_x = 0
invert_y = 0
invert_z = 0
swap_xy = 0
swap_yz = 0
swap_zx = 0 

As I recall, I tried this with TrackVis with several artificial datasets that tested the alignment, and this seemed like an unambiguous way to map files nicely.

From the discussion so far, I still see TRK as the leading format available using the metrics of @jdtournier. I concur with @frheault that regardless of format, one of the core issues is explicitly defining the spatial transform.

So my question is, what problems do people have with TRK, and what new features does the field need? Can improved compliance, documentation and perhaps tweaks allow TRK to fulfill these needs?

@Garyfallidis
Copy link
Member

@neurolabusc the main problem is speed for the TRK. It take long time to load/save big files. But there are also others. For example limitations on what parameters can be saved etc. @MarcCote and @frheault can you explain?

@Garyfallidis
Copy link
Member

Another issue is accessing specific parts of the file. Currently you there is no support for fast access of specific bundles or parts of the tractogram. Another issue is memory management. The trk does not have support for memory mapping or similar. Some of these files are getting too large to load fully in memory and for some applications it is better to keep them in a memory map.

@francopestilli
Copy link
Author

Hi Folks. I support the comments @Garyfallidis reported above. As the size of the tractography increase, we need to use a file format that allows partial loading of the data (say percentages of the streamlines).

@jdtournier
Copy link

@jdtournier the support of tck is already available in nibabel and dipy. If your claim is to just use tck then the answer is that many labs are not satisfied with the tck format. If they were fine then we would just use tck.

OK, obviously my vague attempts at humour have not gone down well. The main point of my message was to provide a list of the criteria that I would consider important for a standard file format. They happen to be mostly embodied in the tck format, perhaps unsurprisingly, and I'm being upfront about the fact that this is likely to be perceived as a conflict of interest - which clearly it has anyway.

I'm not arguing that tck should become the standard, and clearly the fact that there's a discussion about this means that at least some people don't think it should be either. That's fine, but since I've been invited into the discussion, I thought I'd to state my point of view as what such a format should look like. And yes, I have a problem in articulating that without looking like I'm arguing for the tck format, precisely because the considerations that went into its original design 15 years ago are still in my opinion relevant today.

Nonetheless, if you look into the current implementation you will see that the tractograms are always loaded in world coordinates.

But that's a matter of the software implementation, not the file format, right? Perhaps I'm getting confused here, but if we're discussing a new standard file format for tractography, then it should be independent of any specific software implementation or API. This discussion is taking place on the nibabel repo, which is perhaps why we're getting our wires mixed up. I don't wish to belittle the massive efforts that have gone into this project, but I'd understood this discussion to be project-independent.

But the advantage here is that you could have those stored in a different original space in the format.

I understand that, and I can see the appeal. I can also see the overhead this imposes on implementations to support multiple ways of storing otherwise equivalent information. This is why I would argue, on the grounds of simplicity, that a standard file format should adopt a single, standard coordinate system. Otherwise we'll most likely end up with fragmentation in what the different packages support: some will only handle one type of coordinate system because they haven't been updated to support the others, and will hence produce files that other packages won't be able to handle because they only support a different coordinate system. We could of course mandate that to be compliant, implementations should support all allowed coordinate systems, but I don't think this is necessarily how things would work out in practice. And we can provide tools to handle conversions between these so that these different tools can interoperate regardless, but I'm not sure this would be a massive step forward compared to the current situation.

On the other hand, I appreciate that different projects use different coordinate systems internally, and that therefore picking any one coordinate system as the standard will necessarily place some projects at a disadvantage. I don't see a way around this, other than by your suggestion of allowing the coordinate system to be specified within the format. I don't like the idea, because this means we'd effectively be specifying multiple formats, albeit within the same container. But maybe there is no other way around this.

As about storing other metrics I think we still need that information because a) many labs use such a feature, b) if you store the data on other files then you always have to interpolate and perhaps the interpolation used is not trivial.

OK, there's a misunderstanding here as to what I was talking about. First off, no argument: the many labs that need these features include ours, and we routinely make use of such information. But we don't store it as regular 3D images, that would make no sense in anything but the simplest cases. It wouldn't be appropriate for fODF amplitude, or for any other directional measure, or curvature, as you mention.

What I'm suggesting is that the information is stored as separate files that simply contain the associated per-vertex values, with one-to-one correspondence with the vertices in the main tractography file, in the same order. This is what we refer to in MRtrix as a 'track scalar file' - essentially just a long list of numbers, with the same number of entries as there are streamline vertices. We routinely use them to encode per-vertex p-value, effect size, t-value, etc. when displaying the results of our group-wise analyses, for example.

We also use separate files for per-streamline values (used extensively to store the weights for SIFT2), and these are also just a long list of numbers, one per streamline, in the same order as stored in the main file - and in this case, stored simply as ASCII text.

I'm not sure the specific format we've adopted to store these values is necessarily right or optimal in any sense, I'm only talking about the principle of storing these associated data in separate files, for the reasons I've outlined in my previous post: in my opinion, it's more space-efficient, more transparent, and more flexible than trying to store everything in one file.

I should add that storing the data this way does introduce other limitations, notably if the main tractography files need to be edited in some way (e.g to extract tracts of interest from a whole-brain tractogram). This then requires special handling to ensure all the relevant associated files are kept consistent with the newly-produced tractography file. This type of consistency issue is a common problem when storing data across separate files, and I'm not sure I've got a good answer here.

In any case, I've set out my point of view, and I look forward to hearing other opinions on the matter.

@frheault
Copy link

frheault commented Aug 5, 2020

@neurolabusc I think the problem of TRK is related to efficiency when it comes to large datasets.
Selecting only a subset is not optimal (especially if you want a random one), reading is slow, controlling the size of the float (float16/float32/float64) is not possible. When doing connectomics it is impossible to have a hierarchical file that allows the saving of streamlines connecting pair of regions (same logic for bundles), and a personal one is that the header is too complex for most people.

@Garyfallidis Despite all the flaws of tck/trk/vtk people have been using it for more than a decade, I think a first iteration should be as simple as possible. A hierarchical hdf5, readable by chunk using data/offset/length (you read offset/length and then know what data to read, then you reconstruct streamlines as polyline), that can append/delete data in-place, support data_per_streamline and data_per_point (and data_per_group if it is a hierarchical hdf5) with a statefulTractogram compliant header, with a strict saving/loading routine to prevent error.

@jdtournier I don't know if you are familiar with the data/offset/length approach in the ArraySequence of Nibabel. But it is a very simple way to store streamlines in 3 arrays of shape NBR_POINTx3, NBR_STREAMLINE, NBR_STREAMLINE, which I have used in the past with memmap and hdf5 to read quickly specific chunks or do multiprocessing with shared memory. Reconstructing it into streamlines is efficient since the point data is mostly contiguous (depending on the chunk size)

Bonus, I think hdf5 can be specified its own datatype for the array, so using float16 could be achieved and so reducing the size of file. Also matlab and c++ have great hdf5 libraries to help with the reading.

Finally, I agree that storing metrics per point would make an even bigger tractogram, but allowing the use of data per point and per streamline will likely facilitate the live of a few while the other can simply do it their way. I also agree that the way data should be written on disk should be world space (rasmm) as tck, that should be the default. But have the info to convert to tck/trk easily an so on. Leaving compatibility intact for a lot of people.

Your list Simplicity, Space efficiency, Load/store efficiency, Independence, Extensibility is crucial to think about. I think the header would be a much more simple than trk, but slightly more info than tck. I would go for the 4-5 attributes I mentioned earlier, that would be a sweet spot between Simplicity and Independence. As for Extensibility, since hdf5 is basically like a gigantic hierarchical dictionary as long as the mandatory keys are there. Adding more data could be done easily, more header info or even keeping track of processing would be possible (if wanted) like in .mnc file.

However, except for the switch to float16, I think reading/writing is kind of bound to its current limit. Supporting chunk read/write or on-the-fly is nice, but that would not change the speed of writing/reading of a whole tractogram.

@jdtournier
Copy link

@jdtournier the challenge I have with tck is that I can not find any documentation for it.

That's unfortunate, it's documented here. If you'd already come across this page but found it insufficient, please let me know what needs fixing!

@neurolabusc
Copy link

neurolabusc commented Aug 5, 2020

@Garyfallidis I agree TRK is inherently slow to read. Interleaving the integer "Number of points in this track" in the array of float vertices is a poor design. Much better performance could be achieved if the integers were stored sequentially in one array and the vertices were stored in their own array. One could load the vertices directly to a VBO. This would also allow fast traversal of the file, addressing your second criticism. Both would improve the Load efficiency metric.

@frheault adding support for float16 would improve Space efficiency and Load efficiency metrics. Not sure the use case for float64 for vertices, but would be nice for scalars. I also take your point that the header and spatial transforms could be simplified, improving the Simplicity metric.

While hdf5 has some nice wrappers for some languages, the format itself rates very poorly on the Simplicity metric. I think there are clear criticisms of the complexity. This would introduce the some of the same complications as TRAKO, without the space efficiency benefits of TRAKO. It is odd to criticise the TRK format as complex when it is simply described on a short web page, and then advocate the HDF5 format.

@jdtournier thanks for the Documentation. So the Matlab read/write reveal the full capability. TCK is a minimalistic, elegant format that certainly hits your Simplicity metric, but I can see why some users feel it is too limited for their uses.

This sounds like real progress is being made on the features that are desired, and worth the cost of implementing a new format.

@francopestilli
Copy link
Author

One additional comment here @MarcCote , @frheault and @jchoude

what is the status of this work?
https://www.sciencedirect.com/science/article/abs/pii/S1053811914010635

Should we discuss that here?

@francopestilli
Copy link
Author

francopestilli commented Aug 8, 2020

hi @Garyfallidis can you please invite more folks here to pitch in. The document is still pretty much an empty slate. So we can work together on crafting what we need right?

@Garyfallidis
Copy link
Member

Will invite more people after Wednesday as I am working now on a grant deadline. The others should be happy to invite more people. For now I would remove any reference to specific software from the document to show that you want to hear other voices too. Use generic header names etc. Also let's forget about backward compatibility with older file formats. This discussion is for a new file format it does not need to be backwards compatible with anything. If you want to upgrade an existing format then this discussion should be on the specific projects' forum and not here.

@Lestropie
Copy link

The document as it stands at this moment is pretty .tck-heavy. But this is precisely what I had hoped such a document would expose: that different people were getting increasingly confident about increasingly discordant ideas. What the document needs is clear separation between details on existing formats, a list of desirable features of a new format, and then different conceptual designs each described in full under their own heading. Some potential designs are fundamentally different to others from the very outset and so can't be resolved by debating the minutiae; their proponents need the opportunity to present the idea in full without corrupting independent discussions in the process.

@neurolabusc
Copy link

@Garyfallidis I have no intention to co-opt this topic. Anyone is free to introduce other ideas. I also agree that we should not be beholden to existing formats. I simply described how I think tck could be extended. I have no prior experience with this format, but it did have some nice properties. @frankyeh espoused working with it. @frheault noted the benefits of having a unified header at the start of the file (which tck has), and fixed structure formats like TRK are hard to extend. I am perfectly happy if others want to present different conceptual designs. Indeed, going in to this I had felt pretty confident that an offset-based format would prove superior to the restart-based method for storing data used by TCK/TRK. However, to my surprise, my own implementation of such a scheme did not noticably outperform TCK. So while I am an advocate of thinking about this as a clean-sheet design, lets leverage extend existing methods if they can be adapted to suit our goals.

I am not beholden to anything I wrote in that Google document. I tried to outline the motivation at the start, which covered the discussion above as seen from my perspective. I then tried to make a concrete attempt to extend TCK for this purpose. This was interesting, as I had not really thought about the fact that the current TCK/TSF files are not explicit regrading the size of the binary data. So at a minimum, extending this format does seem to require additional tags.

I am happy for others to describe completely different formats in that document, or openly describe the limitations they see with the format I describe it. Any format is a compromise, and concrete examples and frank discussion will help the community decide what they need/want.

@neurolabusc
Copy link

neurolabusc commented Aug 8, 2020

@Lestropie I have updated the document to only really describe TCK in a section where I propose how TCK could be extended. Originally, I had simply copied @frankyeh's comment as preferring TCK, and I do think this colored the tone to suggest that was the only format being considered. I have changed this to note that some prefer using existing formats to the full extent possible. The remaining references to TCK are my direct copies of @jdtournier's description of the metrics we should look for, where he uses TCK as a reference. I am done saying what I want to in the document, and have done my best to simply describe how TCK could be extended as one possible concrete solution. If anyone wants to help make the document sound more balanced, I am happy for them to do so. Any first draft is always biased by the first author's perspective.

I urge anyone who wants to describe an alternative format to do so. Likewise, if anyone wants to criticize my proposed solution, they should feel free to do so. I added a brief section where I tried to discuss the weaknesses of the format, which I hopes help this read as a more balanced suggestion rather than a firm proposal. I have had no involvement with MRItrix, so my suggestions may seem alien to the developers of that tool.

@jdtournier
Copy link

I agree that the discussion should not focus on extending a current file format, whether tck or anything else. Furthermore, even if for some reason there is a decision to extend the tck format (which I'm not sure I would support, personally - I think we can do better), I don't think it would be wise to call it that - it would be a different beast, and we'd likely want to make a clean break with past formats (regardless of any underlying similarities).

On top of that, I'm keen to avoid any perception of bias towards any particular software package, and for that reason alone, I think the discussion should steer clear of suggesting extending existing formats. What might be valuable though is listing the features of the various formats currently available, note similarities between formats, and highlight pros & cons of these different features. For example, I think it's important to talk about the pros & cons of a fixed vs extensible header, and note which existing formats support which of these, but the discussion should be about whether we want a fixed or an extensible header, not about extending whichever format happens to currently support that. Even within that discussion, there are further subtleties worthy of discussion that might get overlooked if we focus on existing formats - for example, if we want an extensible header, how should it be stored? key: value pairs? XML? JSON? YAML? Some other DICOM-like binary format...? (only kidding, @neurolabusc!)

So I reckon if we structure that document in terms of:

  • Overall principles (what I had tried to articulate in my original post)
  • Desired technical features (what @francopestilli's list was building towards)
  • Actual Proposal

Then there's actually no reason to mention any particular existing format. It might be helpful to add a section about existing formats, in which we can outline which formats support which features, but I'm not convinced it adds much to the proposal.

And finally, yes, there's no question that we do need wider community buy-in than the few of us who have already participated in the discussion. On that note, would it be an idea to delay this discussion until after the ISMRM, so there's a chance we might have everyone's attention?

@jdtournier
Copy link

Just one more thought if I may.

I've raised this before, and @Lestropie has also re-iterated the point: in many ways these datasets might be more easily handled as a set of separate files, ideally co-located within a folder - at least conceptually. But ideally, we also want a single file format to maintain integrity, avoid confusion, reduce scope for errors, etc. I think there's a way to have the best of both worlds...

It seems to me what we need is a format for an archive file: a way of storing multiple independent files (corresponding to the header, streamlines data, additional data tables, etc) in the same file. We could devise our own simple container format for that, but I think there's an existing, widely-used option that combines a lot of what we might be after: the good old ZIP format...

This may sound a bit far-fetched, but this is why I think it might be a good idea:

  1. This is how Microsoft stores .docx (and .pptx, etc) documents, using the new OpenDocument specification (also .odt for LibreOffice, or whatever it's called these days) - this is actually what made me think of this. You can extract these documents as a regular ZIP archive and rummage inside the internals of the data. It's actually been really useful on quite a few occasions when I've needed to grab an image from a presentation or document - you can extract the archive, look in the media folder, and trivially find what you need. You'll note they use this format for similar reasons to us: it's like taking a latex document (lots of files) and encapsulating the lot in a single container, while maintaining the file structure.

  2. It's possible to store ZIP files without compression. I've just had a crack at that, and I can see the file data is available as-is within the raw archive. This means it's directly compatible with memory-mapping or rapid binary read/write if we want to. And since it's got a table of contents, it's compatible with random access, etc.

  3. It's directly and trivially compatible with compression. And because each file is compressed independently, it's still simple to list the contents without any actual decompression required. I'm not convinced compression will make much of a difference for these types of data, but at least it's an option that is directly supported within the ZIP spec. Note also that use of compression immediately gets in the way of random access and memory-mapping.

  4. It's easy to inspect and manipulate without dedicated tools (the main benefit from my point of view). If they want to, users can extract the files, modify the contents in place, and ZIP them up when they're done. Some software packages may also choose to operate on the expanded file structure, and users would loose no interoperability since every OS supports creating a ZIP archive.

  5. This leaves us free to store the different tables whichever way we want to. Importantly, this means we can make sure each raw data file (whether streamlines, additional data, etc) is stored using the absolute simplest format we can think of, preferably as raw, header-less, memory-mappable, pure binary data - something that any developer can handle straight away on any platform, without needing to parse anything more than strictly necessary. And headers can be stored as text using whichever format we might settle on.

Why it might not be a good idea:

  1. It's a bit heavier as a container format than we might otherwise use. There's a bit of redundancy in it, and it stores things we might not otherwise care for, like time stamps, etc. That said, we might actually find these things useful in their own right anyway.

  2. Offsets into file contents require a little bit more work to get at (you get an offset into the file's local header, and you need to parse this local header to get the offset to the data itself). However, I suspect any decent ZIP library will give you all the information you need anyway, so it's not likely to be a burden on the developers.

  3. It's not as trivial to code up as we might otherwise like, so developers would be advised to use a third-party library. Thankfully, support already exists in Python, MatLab, and C/C++, amongst others. And we would need third-party libraries to support compression in any case (though many of us will already support gzip compression).

  4. I don't think we can guarantee word alignment - at least I can't find any information about it. This may have performance implications - though in my experience you really have to try very hard to detect much of a difference in practice. If others have different experience with this, I'd like to hear about it.

Anyway, just a thought. I thought I'd share it while it's fresh in my mind...

@neurolabusc
Copy link

neurolabusc commented Aug 11, 2020

Sounds like a good plan. Several NiBabel users have noted very slow tractography load times. In this thread, members of the DiPy team (e.g. @frheault) have suggested investing time into supporting formats that allow random access, memory mapping and efficient storage of ragged arrays. I have written a simple Python script that allows TCK files to be loaded with precisely these properties. This exploits a feature of TCK that does not exist with other formats (TRK, BFloat, NIML, etc): both the vertex positions and end-of-streamline signals are saved using an identical number of bytes on disk. Therefore, one can map the vertices directly from disk to memory. It is simple to generate 1D arrays that track the first and last vertex associated with each streamline. This allows the DiPy developers to use existing TCK files as they develop efficient methods for random access, rapidly masking a portion of streamlines, efficient memory usage, etc.. This does not end discussion of a new format - others have noted the limited features of TCK. However, it could help DiPy developers experiment with the features, and help DiPy users in the short term with their existing TCK datasets during the interim while a new format is developed.

The Matlab and Python code are here, for testing I used the 869mb TCK file @soichih describes, using a Linux Ryzen 3900X with 64Mb of RAM (all tests generated similar results for Ramdisk and ssd). Native code loaded the file in under a second, the Matlab read_mrtrix_tracks.m function required 7.7 seconds to load the file.

./read_nibabel.py track.tck
track.tck loaded in 33.43 seconds
    
./read_mrtrix_tracks.py track.tck
track.tck loaded in 0.73 seconds
streamlines 675000 vertices (72386074, 3)
first line has 100 vertices X Y Z:
 first [-11.223304 -56.063725 -25.203579]
 final [-26.574024 -41.0987   -30.864029]

@frheault
Copy link

frheault commented Aug 11, 2020

@jdtournier Do you think that the features I laid out in 4.2 hypothetical format #1 (.tgy) of the GDocs documents could be applied to your suggestion? I actually like the simplicity of a file format that is actually a zip container. To me, that's a simpler equivalent to the ASDF container I mentioned earlier, but simpler is better so I like your suggestion more.

It also makes data streaming much easier since each "memmap" are separated and the different files simply have to follow a naming convention based on the header (json file or equivalent). This is very simple and intuitive, and people that know it is a zip can play inside (but this should remain "rare", just like most people are not aware docx is a zip file)

If we still require strict coherence between file (length declared in the header is respected, datatype is checked, etc.) I think this could be a much better approach.

PS : If you like the features in 4.2, we could copy it into 4.3 hypothetical format #2 (.tgy) and slightly modify to comments to reflect the pros/cons you listed.

@jdtournier
Copy link

jdtournier commented Aug 11, 2020

Do you think that the features I laid out in 4.2 hypothetical format #1 (.tgy) of the GDocs documents could be applied to your suggestion?

I have to admit I'd not had the time to go through that document in detail - but it seems to reiterate many of the themes discussed so far.

I've had a look at your section 4.2, and I really can't see any reason why there would be any incompatibility. As long as there's agreement that the different independent pieces of information will be stored as independent, separate entities, then ZIP will probably work just as well as any other container format that we might propose. Incompatibility would only be an issue if we were interested in using multiplexing containers to interleave the different types of information (as used in multimedia formats), which would allow data streaming - but I don't see that happening based on the current discussion. I don't think using ZIP as the container format influences anything on your list (other than potentially word alignment, if that's deemed important), with the added benefits that:

  • entities are inherently named
  • entities can be organised in a hierarchical structure using folders (for instance, we could define a standard folder to contain all per-vertex data, another for all per-streamline data, etc).
  • are easily manipulated by all kinds of ZIP archive managers - including Windows Explorer, etc.

On that last point, I might disagree with your comment:

people that know it is a zip can play inside (but this should remain "rare", just like most people are not aware docx is a zip file)

I actually think the opposite: this should be made very clear, and users should be encouraged to use this as required. For example, I may have explored dozens of different statistical hypotheses on my tractogram and generated per-vertex t-statistics, z-scores, and FWE-corrected p-values for each of these hypotheses. At some point, I may want to clean all this up, and remove unnecessary tables. We could provide tools to do this, but since it's a ZIP archive, the tools already exist - and probably do the job better than we would.

PS : If you like the features in 4.2, we could copy it into 4.3 hypothetical format #2 (.tgy) and slightly modify to comments to reflect the pros/cons you listed.

You could, but frankly I think it's so compatible it doesn't need its own section. It's a matter of choosing the container format to store the data you mocked up, which is a discussion that's actually missing from that section anyway. ZIP can be mentioned as one way amongst others, there's many ways to concatenate all of these bits of information (as long as the decision is to avoid multiplexing, which I expect will be the case).

One discussion though will relate to how the header is organised and in what format (e.g. JSON, YAML, etc), and where any header information specific to additional data should be stored. E.g. if we want to have a mini-header for the grouping table, should that reside in the main header (which in a ZIP archive, would be a file in its own right), within the grouping table data file itself, or as an additional header file that can easily be identified as specific to the grouping table? Personally, I would tend towards the latter option, but this is all up for discussion - whether we use ZIP or not.

On a different note: where does the extension .tgy come from?

@frheault
Copy link

frheault commented Aug 11, 2020

@jdtournier Perfect, I won't create a new section. It is true that they are pretty compatible.
I did some test in python and it is fairly easy to implement with zipfile/numpy/json library that everyone has.

About the .tgy extension it was simply to put emphasis on "new file format" rather than re-using an old one. It was a suggestion at the same time as a placeholder (tractography -> tgy).

Also about the playing inside the zip, as long as it remains coherent (data-wise and header-wise) it is true people can do what they want in a file explorer or with code. We should simply plan ahead to make a convention that will prevent "accidents". Something like a filename convention, data-per-streamlines being .dps and data-per-point .dpp and data-per-group .dpg or something like that so the name and type is known instantly would probably be enough.

@jdtournier
Copy link

I've added a few comments to the Google document - see what you think. I hope the discussion doesn't get too fragmented with these disparate discussions...

We should simply plan ahead to make a convention that will prevent "accidents". Something like a filename convention, data-per-streamlines being .dps and data-per-point .dpp and data-per-group .dpg or something like that so the name and type is known instantly would probably be enough.

Yes, such conventions would be needed, and they'd need to be very specific.

On that note, if I can make a suggestion regarding the naming conventions. I've raised the issue of where & how to store the metadata for the additional per-vertex or per-streamline data (or any other data). I think this needs to be stored alongside the additional data themselves somehow, to allow for these data to be easily appended to the file as the need arises. But at the same time, I think if we opt for a clear container format like ZIP, it would make sense to separate metadata and actual data into distinct pure text and pure binary blobs/files respectively. As long as there are clear naming conventions, the metadata for any structure can trivially be located, and even the most basic reader can trivially load the binary data into their own data arrays with little effort.

But depending on what kind of information these mini-headers need to store, we might be able to avoid the need for them altogether. I suspect all we need to know is a human-readable label, the type (per-streamline, per-vertex, per-group, etc), the data type (presumably float16/32/64, int8/16/32/64, and uint8/16/32/64), and the number of items per vertex/streamline/group. I reckon this can all be stored using appropriate naming conventions, in combination with the (known) number of vertices/streamlines/groups: we could have folders to denote the type (e.g. your dpp, dps, etc, or something more explicit), the main part of the filename as the label, and the file extension as the data type. For example, dpp/fa.f32 would be interpreted as containing one or more samples per point in float32 format, and the number of samples per point would be given by the file size divided by (sizeof(float32) * number of points in tractogram). All of the required information is there. The downside is if we need to allow additional information to be stored, we'd need support for that. It could be that we simply make the mini-headers I suggested above optional (e.g. if a file dpp/fa.txt is present, it's understood to correspond to the additional metadata for dpp/fa.f32).

This is just a thought, and I may be going too far in trying to keep everything as ridiculously simple as possible, but with the ZIP format, we have the opportunity to settle on a file structure whose organisation is really obvious and transparent to everyone. If we can minimise the amount of guesswork / pouring over pages of documentation people have to do when trying to parse the data, that's a great thing, IMO.

About the .tgy extension it was simply to put emphasis on "new file format" rather than re-using an old one. It was a suggestion at the same time as a placeholder (tractography -> tgy).

OK, I found the original comment where this first came up, no worries. Though personally, I'm tempted by something like .trx - would stand for tractography exchange, and would naturally be pronounced as 'tracks'... As far as I can tell, currently the extension is only used for Visual Studio test results, hopefully that's a sufficiently different context to avoid clashes. But I appreciate it's not exactly important at this stage...

@frheault
Copy link

frheault commented Aug 11, 2020

@jdtournier choosing the file extension is the hardest part .trx look good (Though I prefer pronouncing it or T-Rex ;) )

I like your idea of having the filename convention telling us most of the information.
Even the grouping could be done with something as simple groups/bundle_cst.indices or groups/pairs_16_2035.indices
I think a 0.1 - 0.5 second of going through the directory tree of the zip file and parse it is low enough for the I/O.

This is a good direction, we have a great agreement on the features and the general way to organize.

Side note for @arokem @Garyfallidis @MarcCote in python, I am currently trying to implement a simple organization like @jdtournier suggested supporting the feature I listed and it works pretty easily with zipfile/numpy/json (super easy in fact with zipfile in python >3.8). Very fast, easily reconstruct a StatefulTractogram, and the code is quite readable.

@jdtournier
Copy link

@frheault, glad you're on board - but let's make sure all the relevant stakeholders are on board before taking this much further. We don't want anyone getting the impression decisions have already been taken before they've had a chance to respond!

Even the grouping could be done with something as simple groups/bundle_cst.indices or groups/pairs_16_2035.indices

Exactly - though I'd make sure the file extension used is one that has been agreed and unambiguously identifies the data type (.u32, short for uint32, would seem appropriate here).

I think a 0.1 - 0.5 second of going through the directory tree of the zip file and parse it is low enough for the I/O.

Would be as long as that...? I expect the directory listing would be read within a matter of milliseconds! You should only need to read the ZIP file's central directory file header, which is the very last part of the file, self-contained, and should fit within ~1kb or so given we're not talking about too many files here.

@francopestilli
Copy link
Author

thanks @jdtournier @frheault I invited others to pitch in (via email), hopefully, they will. It is summer in aCOVID-19 infested earth somehow. But I am very impressed with the discussion here. I understand many will have to pitch back in before we can start organizing back the thoughts of the community. Thanks for all the contributions!
@soichih @dPys @sebastientourbier @eduff @ssothro

@frheault
Copy link

@mdesco is on vacation right now, but I updated him on the side. The feature I listed would meet his expectation, if we can achieve what I listed in section 4.2 I think it is safe to say he would be onboard.

@jdtournier Yeah that's true I forgot zip does some magic with its own header. +1 for the datatype directly in the indices filename)

Anyone new should focus on the features listed in https://docs.google.com/document/d/1GOOlG42rB7dlJizu2RfaF5XNj_pIaVl_6rtBSUhsgbE/edit

The implementation of @jdtournier and @frheault (I) are talking about is mainly related to feature in section 4.2
The focus should be on features first, if something crucial is missing, its important to write it somewhere. Implementation details will depend on features, not the other way around.

@frheault
Copy link

frheault commented Aug 20, 2020

Hello again, just to re-start some form of discussion about the potential new file format.
In order to make sure the list of features was achievable (in python) using the nomenclature proposed by @neurolabusc and @jdtournier I wrote some code that load/save the trx present in the attached file. I can load->save->convert them to tck/trk.

For now, it is quite rudimentary but I think the suggested features could be achieved (at least in Python).
(Only the medium.trx contains groups and data_per_group)

I think this could work pretty well as a format. However, the implementation would be left to the library if the library wants to achieve no features it is easy to simply read everything in memory and get the streamlines and auxiliary data. However, to achieve all features robustly it requires (obviously) more work.

But I can see pretty well the list of things to do to achieve the features. However, I don't think it's gonna be easy to make them compatible with the nibabel streamlines API (i.e the append mode 'a' will likely require temporary files (zip or not, this is due to memmap), when dealing with a compressed zip, its gonna have to be extracted before any reading/writing/append, a.k.a temporary files).

Overall, I would say it is a good nomenclature and the zip idea is working pretty fine for my early tests. I had to try it out, just in case I was 'accepting' design suggestions that could not be achieved in python.

Also for @arokem, @MarcCote and @Garyfallidis in terms of speed/size (compared to nibabel loading the trk file), without any specific optimization the speedup was of factor more than 100 for reading it fully in memory (8-9 if compressed) and 8-9 for writing from memory (6-7 if compressed). When switching to float16 for the data and uint8 for color, the size reduction factor was 3 (5 if compressed). So this is an interesting start.

Please do not judge my code, it's only two afternoon worth of code... But it works, most of it is to 'parse' the nomenclature.
Code:
https://github.com/frheault/nibabel/blob/trx_testing/nibabel/streamlines/trx.py
Data:
https://drive.google.com/file/d/1W41Ys1sVjFOJlp4BvjGs4jR04J1xGvKl/view?usp=sharing

@neurolabusc
Copy link

neurolabusc commented Aug 21, 2020

@frheault this looks nice. The code is clean and to a large degree the format is self documenting. I have only a few minor comments, mostly based on my personal style. Take these as a sign of my enthusiasm, not as fatal flaws:

  • data.float16: I think "position.float16" is more descriptive, as this array only contains vertex position data, and all attributes are forms of data.
  • color_x.uint8/color_y.uint8/color_z.uint8: Any reason why vertex position is saved as SoA and color is saved as AoS? I can understand the appeal of AoS. But it does seem we should be consistent. My own preference would be to define rgb.uint8 and rgba.uint8. Alternatively, to be consistent, shouldn't we store data_x, data_y, data_z?
  • lengths.uint32 and offsets.uint64 are redundant. Given disk performance, I would have thought you simply save offsets and compute lengths from this. This also removes confusion regarding whether lengths refers to length in world space, or number of vertices in the streamline. For formats, I dislike redundancy, not only for file size, but also for conflict resolution (e.g. which gets precedence, consider the competing s-form and q-form in NIfTI).
  • offsets.uint64.: since arrays are always indexed from zero, the first element here is always 0, yet the current storage of offsets does not reveal the index of the last vertex, one needs to know this based on the header (nbr_streamlines) or the size of the data type. My own preference would be to store the index of the last vertex in the streamline, so the array is fully self-contained (since first streamline always starts at 0).
  • With regards to performance, its great that it is fast. I assume the baseline is the current NiBabel routines. If this new format is adopted (potentially with changes to NiBabels internal structure), I would suggest the support for older, more limited but still popular formats gets optimized. For example, the accelerated Python code for TRK and TCK. There is strong rationale for this format beyond performance, and a clean format like this will enhance performance. However, comparison to the current poorly-tuned NiBabel readers is a straw man. My suspicion is that the internal changes to DiPy to support random access and memory mapping can be leveraged to improve other formats as well. So this is a win-win for both old and new.

@frheault
Copy link

@neurolabusc Thanks for the input! I am planning on adding a first draft of a memory-friendly concatenate today!

  • It's true data is not the best name. position is much better, there is also vertices. I will change it to position for clarity, but if anyone has strong opinion on that name, please tell me.
  • The color_x/y/z was an example that was friendly with MI-Brain that reads that convention for having color-changing along streamlines for trk. I agree 100% it would be much better to have an array of nbr_points x 3 unraveled and saved, but I wanted something that mapped directly to trk and MI-Brain for testing. (And array nbr_streamlines x 3 for uniform color along streamlines)
  • I am currently working on that, @Garyfallidis requested something related to lengths, and since I wanted a 1-to-1 mapping with the ArraySequence of nibabel I used it like that. I will likely remove that data, since it is (as you said) obtainable from offsets in a very short time.
  • I agree with the idea, my only concern is that offset is a word with a definition and by definition, the first offset is zero and the last one is the start of the last element. Removing the first one and adding the nbr_of_points at the end would not be an offsets array, or would it? Also the offsets of the ArraySequence in Nibabel and in VTK, numpy, openGL buffer etc. respect that convention (true offsets, starting at 0). I would follow the most well-known convention and rely on the header for the true size (I have a check for the coherence between data size and declared size)
  • Unless there is a well-known, self-describing word for what you mentioned? Is there one, so we don't confuse true offsets with an end-of-each-element offset?
  • Optimization of older format would be left to the library, but yes it would be nice if the speed difference could be smaller (at least in python). Thanks for the optimized code, I will look into it. That's a nice random access feature, but I doubt they can be memory-mappable since data is interleaved (with dpp and dps for trk, and with flag for tck).

@neurolabusc
Copy link

neurolabusc commented Aug 21, 2020

  • position is explicit regarding what is stored. vertices can have many properties (position, color, normal, etc.)
  • If we are putting the effort into a new format, there is some rationale for being consistent. My own preference would be for rgb, not color_x, color_y, color_z. To play devils advocate, for colors one presumes that RRR..RGGG..GBBB..B will almost always lead to better compression with the deflate algorithm than RGBRGBRGB..RGB. A nice example of the benefit of byte-interleaving are CTM and BLOSC coupled with deflate. This comes down to the size efficiency versus simplicity/consistency. For example, one would get better compression by shuffling the offsets.uint64, as MSB changes much less than LSB. However, I doubt many here would advocate for that
  • One option is to store nbr_streamlines+1 values in the offsets array. The first value in the array is 0, the final value is nbr_points. Presumably, this would be how most tools would load the data anyways to avoid range checking errors. The nth streamline begins at offset[n], and ends at offset[n+1]-1. This makes the value self contained, explicitly solving the fence-post problem. To see the usefulness, see how this solves @Garyfallidis desire for line lengths, without requiring the space or time associated with disk storage, and leveraging numpy's in-built vectorization speed. For this example, consider four streamlines with a total of 239 vertices:
import numpy as np
offsets = np.array([0, 22, 95, 151, 239])
lengths = offsets[1:]-offsets[0:-1]
  • Note that my solution for TCK allows memory mapping of the vertex positions. While the TCK data is interleaved, it does not need to be treated as interleaved. This leverages an unusual property of TCK: the primitive restart markers have the same size as the position triplets, so alignment is preserved. As long as you retain the SoA order for vertex position in your new format, DiPy could treat both TCK and TRX as memory mapped in the same way (albeit, with the start offsets, line i begins at offset[i] to offset[i+1]-2 whereas for pure position storage the range is offset[i]..offset[i+1]-1).

@dPys
Copy link

dPys commented Aug 23, 2020

I'm a bit late to this thread, but just wanted to pitch what I personally feel are important targets for a new format (or an upgrade to an existing one):

  1. As @jdtournier, @effigies, and others have already reiterated, random-access support with minimal I/O. Compression is important too, but I'd personally prioritize the former (i.e. reducing bandwidth for batch processing > saving finalized streamlines more compactly to disk).
  2. Support for seamless spatial transformation and stateful representation.
  3. Intuitive metadata handling at any scale (vertices, streamlines, bundles). Frankly, I really like the ArraySequence semantics of Nibabel/DiPy. If people think it'd be scalable enough, maybe just building upon what's already been achieved in this area would help? Otherwise, more generic provenance info / tracking metadata could probably just be relegated to .json.
  4. Low overhead for incorporating whatever format we decide on into existing applications and workflows
  5. Broad support across platforms and languages. Although in some sense a generic format like hdf5 meets this need exactly, it comes with its own limitations as others have rightly noted. In my experience, hdf5 may be better suited for memory-intensive odds-and-ends of tractography (e.g.. storing and reading reconstructions, especially with parallelized tracking).

My two cents,
@dPys

@frheault
Copy link

frheault commented Nov 13, 2020

Hello everyone, just to keep the subject alive I did a small technological survey and tried out different approaches.
The two main contenders were using the Zarr library by @arokem and using a file structure based organization memmaps (possibility inside of a Zip file) by @jdtournier. I also tried on the side having a single gigantic memmaps with a fixed size header as well as trying out ASDF library but it ended up being way too complex and counter-intuitive (as well as not respecting a few key "demands").

My preference is for the memmaps approach inside a zip file where the architecture is self-explanatory.
I implemented basics functions to supports loading/saving/append/concatenate/copy in python to showcase the specification shown in the readme.
This choice respects the majority of key requests while being easy enough to maintain/expand in all programming languages.
https://github.com/frheault/tractography_file_format

I propose that we move the discussion to this thread to disentangle the subject from Nibabel and focus on the specifications.
tee-ar-ex/trx-python#1

This is still ongoing work, there is still discussion to have, but it is a start. It is important to remember that my implementation is simply to test the idea, not an actual final/usable version. If anyone had ideas that they coded in the past 1-2 months related to this thread I would be happy to modify the (fresh/empty) github to include it.
@neurolabusc @MarcCote @Garyfallidis @frankyeh @mdesco @francopestilli @Lestropie

@skoudoro
Copy link
Member

Hi Everyone,

This Wednesday, @frheault will lead a discussion about tractography file format at the DIPY online meeting.

For more information: dipy/dipy#2229 (comment)

Feel free to get in!

@francopestilli
Copy link
Author

@skoudoro great!

@jchoude
Copy link

jchoude commented Dec 2, 2020

Hi all,

I know I'm late to the party, even if @frheault bugged me multiple times about it.

I'm still going through all the thread and the other related documents, but I still want to say that, to me,

  • ease of extensibility (adding various data per point / streamlines, etc) is important in multiple academic and industrial use cases
  • the ability to directly read a portion of the streamlines, without loading everything and all the associated properties is extremely useful as well (we often track once, post-process N times)
  • if coordinates are stored in real space, storing the affine from the "original" source is not mandatory, but is really helpful in multiple real world use cases. It also helps having additional validity checks, to, for example, ensure that users are not trying to relate 2 distinct datasets.

Disclaimer: some of you know that @frheault and I come from the same lab, I'm just disclosing it so that if my comments agree with him, it isn't seen as a conflict of interest. He did all of this on his own, only based on the knowledge and experience that we built through the years.

Thanks everyone for all the hard work, I think this will help allievate some recurring issues we have had over the years!

@baranaydogan
Copy link

Hi,

Unfortunately, I saw these discussions very late and missed the Dec 2 meeting. Hopefully, it was fruitful.

I believe a new tractogram format addressing the needs of the community would be highly useful. Thanks to all who has been active in this so far.

I would be happy to contribute in the discussions and development as well.

@francopestilli
Copy link
Author

hi @baranaydogan I am very sorry you missed this. It would be great to have you interact on this topic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests