-
Notifications
You must be signed in to change notification settings - Fork 262
Tractography Data Format #942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi- I am interested in this. I have worked on the tractography tools in AFNI (with Ziad Saad). I imagine FSL developers would be interested, as would Frank Yeh @frankyeh of DSI-Studio and the developers of MRtrix3. Thanks, |
Yep! We should have this as an open discussion. |
Love to see a new format standard. TRK file has been given me a lot of headaches and limited a lot of possible extension. DSI Studio will surely support any open standard for tractography. |
While I don't necessarily think that a new tractography data format must be constrained by the NiBabel API, @MarcCote did put a fair bit of time and thought into the streamlines API. It might be worth thinking through whether this API is sufficient or if it's missing something. Part of my thinking is that once a sufficient API is settled on, we can turn that around to quickly prototype more-or-less efficient implementations. For what it's worth, I recently explored the API a bit and tried to summarize a bit in the following: |
Trying to include a few other potentially interested people: @bjeurissen @jdtournier @neurolabusc |
@effigies -- would be great to try that API with a demo. Some functionality we value is keeping track of tracts as bundles-- if we put in some N>2 targets, we often care about any pairwise connections amongst those as separate bundles, because we are basically using tractography to parcellate the WM skeleton. Does that labeling/identifying of groups of tracts exist there? |
The FMRIB contacts I know are @eduff and @pauldmccarthy. They might be able to point us to the right people... |
Before creating a new format, it might be worth considering the existing formats and see if any could be enhanced or improved, similar to the way that NIfTI maintained Analyze compatibility while directly addressing the weaknesses. The popular formats seem to be:
Unlike triangulated meshes, tractography can not benefit from indexing and stripping, so the existing formats all seem to be pretty similar (all describe node-to-node straight lines, not splines). I concur with @mrneont that it is nice to have the ability to describe tracks as bundles. I think this makes TRK the most attractive choice (some other formats are not well documented, so they may support this feature). Perhaps @frankyeh can expand on his frustrations with TRK. What are the limitations? Likewise, perhaps everyone (including @francopestilli who started this thread) can provide a wish list for desired features. |
Hi all, And thank you for bringing this up. I have to say this a recurrent topic. Every one or two years this re-emerges. I suggest before you do anything else study what is already developed in the existing API in nibabel. Marc-Alex and myself worked quite a bit to support all the basic needs. Accessing bundles fast, adding properties etc. Actually we have already implemented a fast version that can load/save tracks to npz (a numpy ) format. Which you can use if you have big data. For me the main decision which requires feedback from the community is the formatting technology. Do you want to save the end result using json, hdf5, gITF or something else? If we can decide on that. Then we are set. The work to study previous formats is already mostly done at least on my side. Nonetheless, see also a recent paper for a new format called Trako |
It is important to mention that no matter the file format, the main problems when it comes to standard will remain. No matter the new format, the same difficulties will remain. There is a thousand ways to write a TRK wrong, but many write it wrong and read it wrong too and it can work in their software. I think I was added due to my contribution to Dipy (StatefulTractogram). I think that no matter the new format, as long as people can have header attributes such as:
I think I will be happy. For example, in my own code I used @MarcCote API to write an HDF5 format in which the length, offset, and data of one or multiple 'tractograms' are saved. So I can easily read any of these tractograms (I use it for connectomics, also could be used for bundles) and one could achieve the same to read any streamlines in particular. But as long as the attribute listed earlier are available anything can be done after that. Also, if a new format is added to dipy. If it is statefulTractogram friendly it can easily be converted back to other commonly supported formats (TCK, TRK, VTK/FIB, DPY). If these efforts are made to have more efficient reading for computation, I think there is no problem with supporting more format. If the goal is to reduce confusion in how to write/read the format, I believe that a new format would not/never help. The unstructured nature of tractogram (not grid-like) makes it harder since the header and data are not fixed together when it comes to the spatial coherence. PS : I personally think TRK is fine, everything is in the header. The problem is the variety of ways people can write/read it wrong. Making support across tools and labs pretty difficult. However, I think a strict approach to reading/write in dipy was beneficial on the long term. Short term, sure maybe half of the user probably hate me on some level, but a think a strict TRK (or at least a TCK always with the nifti that generated it) is superior than a lot of format. Just in term of available info, not for large scale computation visualisation and fancy reading/writing. |
At the danger of somewhat rerouting the conversation, I guess since this has come up, this might also be a good time to discuss whether it would be beneficial to "upstream" the And to bring the discussion back to its start -- @francopestilli -- I am curious to hear: what needs are not currently addressed and prompted your original post here? Are you looking to store information that is not in the currently-supported formats? Or is there some limitation on performance that needs to be addressed? |
Oh - just saw the mailing list messages and now understand where this originated (here: https://mail.python.org/pipermail/neuroimaging/2020-July/002161.html, and in other parts of that thread). Sorry: hard to stay on top of everything... |
Oh - just saw the mailing list messages and now understand where this originated (here: https://mail.python.org/pipermail/neuroimaging/2020-July/002161.html, and in other parts of that thread). Sorry: hard to stay on top of everything...
The 10-million tractogram mentioned in that original thread is quite a
challenge, and I am not sure if a new file format would be the answer.
Emanuele already achieved 1 min loading time with TRK (pretty impressive).
…
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I'm looping in Rob Smith (@Lestropie) into this conversation, this is something we've discussed many times in the past. I've not had a chance to look into all the details here, but here's just a few of my unsolicited thoughts on the matter, for what they're worth:
So that's my 2 cents on what I would like a standard tracrography format to look like. You'll note that I've more or less described the tck format, and yes, there's a fairly obvious conflict of interest here... 😁 However, there's clearly features that the tck format doesn't support that others do, though I've not yet felt the need to use them. The ability to group streamlines within a tractogram is interesting. I would personally find it simpler to define a folder hierarchy to encode this type of information: it uses standard filesystem semantics, allows human-readable names for the different groups, and allows each group to have its own header entries if required. Assuming the header is relatively compact, it also shouldn't take up a lot more storage than otherwise. And it allows applications to use simple load/store routines to handle single tractogram files, and trivially build on them to handle these more complex structures as the need arises. Others may (and no doubt will) disagree with this... Another issue that hasn't been discussed so far is the possibility of storing additional per-streamline or per-vertex information. That's not currently something that can be done with the tck format, though it may be possible with others. This is actually the main topic of conversation within the MRtrix team. We currently store this type of information using separate files, both because our tck format isn't designed to handle it (probably the main reason, to be fair), but also because it avoids needless duplication in cases where several bits of information need to be stored (this was also one of the motivations for our fixel directory format). For example, if we want to store the sampled FA value for every vertex, we could store it in the file, but what happens if we also want to sample the MD, AD, RD, fODF amplitude, etc.? It's more efficient to store just the sampled values separately alongside the tractogram, than to duplicate the entire tractogram for each measure just so it can reside in the same file. Alternatively, we could allow the format to store multiple values per vertex, but then we'd need additional complexity in the header to encode which value corresponds to what - something that's inherently handled by the filesystem if these values are stored separately. And on top of that, a format that allows for these per-streamline and/or per-vertex information would necessarily be more complex, increasing the scope for incorrect implementations, etc. Again, this is a likely to be a topic where opinions vary widely (including within our own team), but my preference here is again to keep the file format simple and uncluttered, and rely on the filesystem to store/encode additional information: it's simpler, more flexible, leverages existing tools and concepts that everyone understands, and avoids the need for additional tools to produce, inspect and manipulate these more complex datasets. OK, that's the end of my mind dump. Sorry if it's a bit long-winded... |
Probably the best person to bring in regarding tractography formats from FMRIB would be Saad Jbabdi (or possibly Michiel Cottaar @MichielCottaar). |
I think @jdtournier did a nice job of describing the tradeoffs, and also agree that IEEE-754 float16 (e.g. GLhalf) probably provides more than sufficient precision, which would improve space and load/store efficiency. Thanks @Garyfallidis for noting the Trako format. It clearly achieves good space and load efficiency. It seems weak on the simplicity and store efficiency metrics - on my machine (which has a lot of Python packages installed) the example code required installing an additional 110Mb of additional Python packages, and the current JavaScript code only decodes data. So, at the moment a promising proof of concept, but not without tradeoffs. @jdtournier also makes an interesting case that perhaps scalars should be separate files (e.g. NIfTI volumes) from the tractography file). |
@jdtournier the support of tck is already available in nibabel and dipy. If your claim is to just use tck then the answer is that many labs are not satisfied with the tck format. If they were fine then we would just use tck. The effort here is to find a format that will be useful to most software tools. Nonetheless, if you look into the current implementation you will see that the tractograms are always loaded in world coordinates. But the advantage here is that you could have those stored in a different original space in the format. As about storing other metrics I think we still need that information because a) many labs use such a feature, b) if you store the data on other files then you always have to interpolate and perhaps the interpolation used is not trivial. Also you could have metrics that are not related to standard maps such as FA etc. You could have for example curvature saved for each point of the streamline. Would you prefer curvature being saved as a nifti file? That would not make sense right? |
My suggestion to move forward is that @frheault who has studied multiple file formats and found similarities and differences writes down the specifications of the new format and send it over to the different labs and tools for approval and suggestions. It is important to show that in nibabel we have done the required work to study all or at least most that is out there and that the initial effort is coming out with some consensus of some sort. I hope @frheault that you will accept to lead such a task. And also thank you for your tremendous effort to make some sense in this world of tracks. Of course we will need the help of all of us especially @effigies, @matthew-brett and @MarcCote. But I think you are the right person to finally get this done and move on happily as a community. |
@Garyfallidis, I agree with your view, that formats face a Darwinian selection, and therefore popular formats are filling a niche. However, your comment that @jdtournier the challenge I have with tck is that I can not find any documentation for it. My support for this format was based on porting the Matlab read and write routines. It is unclear if these fully exploit the format as implemented by mrview and other MRTrix tools. |
@jdtournier per your comment Maybe I am naive, but when I explored TrackVis, I thought there was a way to save TRK files that would map MNI world space without knowing the dimensions of the corresponding voxel grid:
As I recall, I tried this with TrackVis with several artificial datasets that tested the alignment, and this seemed like an unambiguous way to map files nicely. From the discussion so far, I still see TRK as the leading format available using the metrics of @jdtournier. I concur with @frheault that regardless of format, one of the core issues is explicitly defining the spatial transform. So my question is, what problems do people have with TRK, and what new features does the field need? Can improved compliance, documentation and perhaps tweaks allow TRK to fulfill these needs? |
@neurolabusc the main problem is speed for the TRK. It take long time to load/save big files. But there are also others. For example limitations on what parameters can be saved etc. @MarcCote and @frheault can you explain? |
Another issue is accessing specific parts of the file. Currently you there is no support for fast access of specific bundles or parts of the tractogram. Another issue is memory management. The trk does not have support for memory mapping or similar. Some of these files are getting too large to load fully in memory and for some applications it is better to keep them in a memory map. |
Hi Folks. I support the comments @Garyfallidis reported above. As the size of the tractography increase, we need to use a file format that allows partial loading of the data (say percentages of the streamlines). |
OK, obviously my vague attempts at humour have not gone down well. The main point of my message was to provide a list of the criteria that I would consider important for a standard file format. They happen to be mostly embodied in the tck format, perhaps unsurprisingly, and I'm being upfront about the fact that this is likely to be perceived as a conflict of interest - which clearly it has anyway. I'm not arguing that tck should become the standard, and clearly the fact that there's a discussion about this means that at least some people don't think it should be either. That's fine, but since I've been invited into the discussion, I thought I'd to state my point of view as what such a format should look like. And yes, I have a problem in articulating that without looking like I'm arguing for the tck format, precisely because the considerations that went into its original design 15 years ago are still in my opinion relevant today.
But that's a matter of the software implementation, not the file format, right? Perhaps I'm getting confused here, but if we're discussing a new standard file format for tractography, then it should be independent of any specific software implementation or API. This discussion is taking place on the nibabel repo, which is perhaps why we're getting our wires mixed up. I don't wish to belittle the massive efforts that have gone into this project, but I'd understood this discussion to be project-independent.
I understand that, and I can see the appeal. I can also see the overhead this imposes on implementations to support multiple ways of storing otherwise equivalent information. This is why I would argue, on the grounds of simplicity, that a standard file format should adopt a single, standard coordinate system. Otherwise we'll most likely end up with fragmentation in what the different packages support: some will only handle one type of coordinate system because they haven't been updated to support the others, and will hence produce files that other packages won't be able to handle because they only support a different coordinate system. We could of course mandate that to be compliant, implementations should support all allowed coordinate systems, but I don't think this is necessarily how things would work out in practice. And we can provide tools to handle conversions between these so that these different tools can interoperate regardless, but I'm not sure this would be a massive step forward compared to the current situation. On the other hand, I appreciate that different projects use different coordinate systems internally, and that therefore picking any one coordinate system as the standard will necessarily place some projects at a disadvantage. I don't see a way around this, other than by your suggestion of allowing the coordinate system to be specified within the format. I don't like the idea, because this means we'd effectively be specifying multiple formats, albeit within the same container. But maybe there is no other way around this.
OK, there's a misunderstanding here as to what I was talking about. First off, no argument: the many labs that need these features include ours, and we routinely make use of such information. But we don't store it as regular 3D images, that would make no sense in anything but the simplest cases. It wouldn't be appropriate for fODF amplitude, or for any other directional measure, or curvature, as you mention. What I'm suggesting is that the information is stored as separate files that simply contain the associated per-vertex values, with one-to-one correspondence with the vertices in the main tractography file, in the same order. This is what we refer to in MRtrix as a 'track scalar file' - essentially just a long list of numbers, with the same number of entries as there are streamline vertices. We routinely use them to encode per-vertex p-value, effect size, t-value, etc. when displaying the results of our group-wise analyses, for example. We also use separate files for per-streamline values (used extensively to store the weights for SIFT2), and these are also just a long list of numbers, one per streamline, in the same order as stored in the main file - and in this case, stored simply as ASCII text. I'm not sure the specific format we've adopted to store these values is necessarily right or optimal in any sense, I'm only talking about the principle of storing these associated data in separate files, for the reasons I've outlined in my previous post: in my opinion, it's more space-efficient, more transparent, and more flexible than trying to store everything in one file. I should add that storing the data this way does introduce other limitations, notably if the main tractography files need to be edited in some way (e.g to extract tracts of interest from a whole-brain tractogram). This then requires special handling to ensure all the relevant associated files are kept consistent with the newly-produced tractography file. This type of consistency issue is a common problem when storing data across separate files, and I'm not sure I've got a good answer here. In any case, I've set out my point of view, and I look forward to hearing other opinions on the matter. |
@neurolabusc I think the problem of TRK is related to efficiency when it comes to large datasets. @Garyfallidis Despite all the flaws of tck/trk/vtk people have been using it for more than a decade, I think a first iteration should be as simple as possible. A hierarchical hdf5, readable by chunk using data/offset/length (you read offset/length and then know what data to read, then you reconstruct streamlines as polyline), that can append/delete data in-place, support data_per_streamline and data_per_point (and data_per_group if it is a hierarchical hdf5) with a statefulTractogram compliant header, with a strict saving/loading routine to prevent error. @jdtournier I don't know if you are familiar with the data/offset/length approach in the ArraySequence of Nibabel. But it is a very simple way to store streamlines in 3 arrays of shape NBR_POINTx3, NBR_STREAMLINE, NBR_STREAMLINE, which I have used in the past with memmap and hdf5 to read quickly specific chunks or do multiprocessing with shared memory. Reconstructing it into streamlines is efficient since the point data is mostly contiguous (depending on the chunk size) Bonus, I think hdf5 can be specified its own datatype for the array, so using float16 could be achieved and so reducing the size of file. Also matlab and c++ have great hdf5 libraries to help with the reading. Finally, I agree that storing metrics per point would make an even bigger tractogram, but allowing the use of data per point and per streamline will likely facilitate the live of a few while the other can simply do it their way. I also agree that the way data should be written on disk should be world space (rasmm) as tck, that should be the default. But have the info to convert to tck/trk easily an so on. Leaving compatibility intact for a lot of people. Your list Simplicity, Space efficiency, Load/store efficiency, Independence, Extensibility is crucial to think about. I think the header would be a much more simple than trk, but slightly more info than tck. I would go for the 4-5 attributes I mentioned earlier, that would be a sweet spot between Simplicity and Independence. As for Extensibility, since hdf5 is basically like a gigantic hierarchical dictionary as long as the mandatory keys are there. Adding more data could be done easily, more header info or even keeping track of processing would be possible (if wanted) like in .mnc file. However, except for the switch to float16, I think reading/writing is kind of bound to its current limit. Supporting chunk read/write or on-the-fly is nice, but that would not change the speed of writing/reading of a whole tractogram. |
That's unfortunate, it's documented here. If you'd already come across this page but found it insufficient, please let me know what needs fixing! |
@Garyfallidis I agree TRK is inherently slow to read. Interleaving the integer "Number of points in this track" in the array of float vertices is a poor design. Much better performance could be achieved if the integers were stored sequentially in one array and the vertices were stored in their own array. One could load the vertices directly to a VBO. This would also allow fast traversal of the file, addressing your second criticism. Both would improve the Load efficiency metric. @frheault adding support for float16 would improve Space efficiency and Load efficiency metrics. Not sure the use case for float64 for vertices, but would be nice for scalars. I also take your point that the header and spatial transforms could be simplified, improving the Simplicity metric. While hdf5 has some nice wrappers for some languages, the format itself rates very poorly on the Simplicity metric. I think there are clear criticisms of the complexity. This would introduce the some of the same complications as TRAKO, without the space efficiency benefits of TRAKO. It is odd to criticise the TRK format as complex when it is simply described on a short web page, and then advocate the HDF5 format. @jdtournier thanks for the Documentation. So the Matlab read/write reveal the full capability. TCK is a minimalistic, elegant format that certainly hits your Simplicity metric, but I can see why some users feel it is too limited for their uses. This sounds like real progress is being made on the features that are desired, and worth the cost of implementing a new format. |
One additional comment here @MarcCote , @frheault and @jchoude what is the status of this work? Should we discuss that here? |
hi @Garyfallidis can you please invite more folks here to pitch in. The document is still pretty much an empty slate. So we can work together on crafting what we need right? |
Will invite more people after Wednesday as I am working now on a grant deadline. The others should be happy to invite more people. For now I would remove any reference to specific software from the document to show that you want to hear other voices too. Use generic header names etc. Also let's forget about backward compatibility with older file formats. This discussion is for a new file format it does not need to be backwards compatible with anything. If you want to upgrade an existing format then this discussion should be on the specific projects' forum and not here. |
The document as it stands at this moment is pretty |
@Garyfallidis I have no intention to co-opt this topic. Anyone is free to introduce other ideas. I also agree that we should not be beholden to existing formats. I simply described how I think tck could be extended. I have no prior experience with this format, but it did have some nice properties. @frankyeh espoused working with it. @frheault noted the benefits of having a unified header at the start of the file (which tck has), and fixed structure formats like TRK are hard to extend. I am perfectly happy if others want to present different conceptual designs. Indeed, going in to this I had felt pretty confident that an offset-based format would prove superior to the restart-based method for storing data used by TCK/TRK. However, to my surprise, my own implementation of such a scheme did not noticably outperform TCK. So while I am an advocate of thinking about this as a clean-sheet design, lets leverage extend existing methods if they can be adapted to suit our goals. I am not beholden to anything I wrote in that Google document. I tried to outline the motivation at the start, which covered the discussion above as seen from my perspective. I then tried to make a concrete attempt to extend TCK for this purpose. This was interesting, as I had not really thought about the fact that the current TCK/TSF files are not explicit regrading the size of the binary data. So at a minimum, extending this format does seem to require additional tags. I am happy for others to describe completely different formats in that document, or openly describe the limitations they see with the format I describe it. Any format is a compromise, and concrete examples and frank discussion will help the community decide what they need/want. |
@Lestropie I have updated the document to only really describe TCK in a section where I propose how TCK could be extended. Originally, I had simply copied @frankyeh's comment as preferring TCK, and I do think this colored the tone to suggest that was the only format being considered. I have changed this to note that some prefer using existing formats to the full extent possible. The remaining references to TCK are my direct copies of @jdtournier's description of the metrics we should look for, where he uses TCK as a reference. I am done saying what I want to in the document, and have done my best to simply describe how TCK could be extended as one possible concrete solution. If anyone wants to help make the document sound more balanced, I am happy for them to do so. Any first draft is always biased by the first author's perspective. I urge anyone who wants to describe an alternative format to do so. Likewise, if anyone wants to criticize my proposed solution, they should feel free to do so. I added a brief section where I tried to discuss the weaknesses of the format, which I hopes help this read as a more balanced suggestion rather than a firm proposal. I have had no involvement with MRItrix, so my suggestions may seem alien to the developers of that tool. |
I agree that the discussion should not focus on extending a current file format, whether tck or anything else. Furthermore, even if for some reason there is a decision to extend the tck format (which I'm not sure I would support, personally - I think we can do better), I don't think it would be wise to call it that - it would be a different beast, and we'd likely want to make a clean break with past formats (regardless of any underlying similarities). On top of that, I'm keen to avoid any perception of bias towards any particular software package, and for that reason alone, I think the discussion should steer clear of suggesting extending existing formats. What might be valuable though is listing the features of the various formats currently available, note similarities between formats, and highlight pros & cons of these different features. For example, I think it's important to talk about the pros & cons of a fixed vs extensible header, and note which existing formats support which of these, but the discussion should be about whether we want a fixed or an extensible header, not about extending whichever format happens to currently support that. Even within that discussion, there are further subtleties worthy of discussion that might get overlooked if we focus on existing formats - for example, if we want an extensible header, how should it be stored? So I reckon if we structure that document in terms of:
Then there's actually no reason to mention any particular existing format. It might be helpful to add a section about existing formats, in which we can outline which formats support which features, but I'm not convinced it adds much to the proposal. And finally, yes, there's no question that we do need wider community buy-in than the few of us who have already participated in the discussion. On that note, would it be an idea to delay this discussion until after the ISMRM, so there's a chance we might have everyone's attention? |
Just one more thought if I may. I've raised this before, and @Lestropie has also re-iterated the point: in many ways these datasets might be more easily handled as a set of separate files, ideally co-located within a folder - at least conceptually. But ideally, we also want a single file format to maintain integrity, avoid confusion, reduce scope for errors, etc. I think there's a way to have the best of both worlds... It seems to me what we need is a format for an archive file: a way of storing multiple independent files (corresponding to the header, streamlines data, additional data tables, etc) in the same file. We could devise our own simple container format for that, but I think there's an existing, widely-used option that combines a lot of what we might be after: the good old ZIP format... This may sound a bit far-fetched, but this is why I think it might be a good idea:
Why it might not be a good idea:
Anyway, just a thought. I thought I'd share it while it's fresh in my mind... |
Sounds like a good plan. Several NiBabel users have noted very slow tractography load times. In this thread, members of the DiPy team (e.g. @frheault) have suggested investing time into supporting formats that allow random access, memory mapping and efficient storage of ragged arrays. I have written a simple Python script that allows TCK files to be loaded with precisely these properties. This exploits a feature of TCK that does not exist with other formats (TRK, BFloat, NIML, etc): both the vertex positions and end-of-streamline signals are saved using an identical number of bytes on disk. Therefore, one can map the vertices directly from disk to memory. It is simple to generate 1D arrays that track the first and last vertex associated with each streamline. This allows the DiPy developers to use existing TCK files as they develop efficient methods for random access, rapidly masking a portion of streamlines, efficient memory usage, etc.. This does not end discussion of a new format - others have noted the limited features of TCK. However, it could help DiPy developers experiment with the features, and help DiPy users in the short term with their existing TCK datasets during the interim while a new format is developed. The Matlab and Python code are here, for testing I used the 869mb TCK file @soichih describes, using a Linux Ryzen 3900X with 64Mb of RAM (all tests generated similar results for Ramdisk and ssd). Native code loaded the file in under a second, the Matlab read_mrtrix_tracks.m function required 7.7 seconds to load the file.
|
@jdtournier Do you think that the features I laid out in It also makes data streaming much easier since each "memmap" are separated and the different files simply have to follow a naming convention based on the header (json file or equivalent). This is very simple and intuitive, and people that know it is a zip can play inside (but this should remain "rare", just like most people are not aware docx is a zip file) If we still require strict coherence between file (length declared in the header is respected, datatype is checked, etc.) I think this could be a much better approach. PS : If you like the features in 4.2, we could copy it into |
I have to admit I'd not had the time to go through that document in detail - but it seems to reiterate many of the themes discussed so far. I've had a look at your section 4.2, and I really can't see any reason why there would be any incompatibility. As long as there's agreement that the different independent pieces of information will be stored as independent, separate entities, then ZIP will probably work just as well as any other container format that we might propose. Incompatibility would only be an issue if we were interested in using multiplexing containers to interleave the different types of information (as used in multimedia formats), which would allow data streaming - but I don't see that happening based on the current discussion. I don't think using ZIP as the container format influences anything on your list (other than potentially word alignment, if that's deemed important), with the added benefits that:
On that last point, I might disagree with your comment:
I actually think the opposite: this should be made very clear, and users should be encouraged to use this as required. For example, I may have explored dozens of different statistical hypotheses on my tractogram and generated per-vertex t-statistics, z-scores, and FWE-corrected p-values for each of these hypotheses. At some point, I may want to clean all this up, and remove unnecessary tables. We could provide tools to do this, but since it's a ZIP archive, the tools already exist - and probably do the job better than we would.
You could, but frankly I think it's so compatible it doesn't need its own section. It's a matter of choosing the container format to store the data you mocked up, which is a discussion that's actually missing from that section anyway. ZIP can be mentioned as one way amongst others, there's many ways to concatenate all of these bits of information (as long as the decision is to avoid multiplexing, which I expect will be the case). One discussion though will relate to how the header is organised and in what format (e.g. JSON, YAML, etc), and where any header information specific to additional data should be stored. E.g. if we want to have a mini-header for the grouping table, should that reside in the main header (which in a ZIP archive, would be a file in its own right), within the grouping table data file itself, or as an additional header file that can easily be identified as specific to the grouping table? Personally, I would tend towards the latter option, but this is all up for discussion - whether we use ZIP or not. On a different note: where does the extension |
@jdtournier Perfect, I won't create a new section. It is true that they are pretty compatible. About the Also about the playing inside the zip, as long as it remains coherent (data-wise and header-wise) it is true people can do what they want in a file explorer or with code. We should simply plan ahead to make a convention that will prevent "accidents". Something like a filename convention, data-per-streamlines being . |
I've added a few comments to the Google document - see what you think. I hope the discussion doesn't get too fragmented with these disparate discussions...
Yes, such conventions would be needed, and they'd need to be very specific. On that note, if I can make a suggestion regarding the naming conventions. I've raised the issue of where & how to store the metadata for the additional per-vertex or per-streamline data (or any other data). I think this needs to be stored alongside the additional data themselves somehow, to allow for these data to be easily appended to the file as the need arises. But at the same time, I think if we opt for a clear container format like ZIP, it would make sense to separate metadata and actual data into distinct pure text and pure binary blobs/files respectively. As long as there are clear naming conventions, the metadata for any structure can trivially be located, and even the most basic reader can trivially load the binary data into their own data arrays with little effort. But depending on what kind of information these mini-headers need to store, we might be able to avoid the need for them altogether. I suspect all we need to know is a human-readable label, the type (per-streamline, per-vertex, per-group, etc), the data type (presumably float16/32/64, int8/16/32/64, and uint8/16/32/64), and the number of items per vertex/streamline/group. I reckon this can all be stored using appropriate naming conventions, in combination with the (known) number of vertices/streamlines/groups: we could have folders to denote the type (e.g. your This is just a thought, and I may be going too far in trying to keep everything as ridiculously simple as possible, but with the ZIP format, we have the opportunity to settle on a file structure whose organisation is really obvious and transparent to everyone. If we can minimise the amount of guesswork / pouring over pages of documentation people have to do when trying to parse the data, that's a great thing, IMO.
OK, I found the original comment where this first came up, no worries. Though personally, I'm tempted by something like |
@jdtournier choosing the file extension is the hardest part I like your idea of having the filename convention telling us most of the information. This is a good direction, we have a great agreement on the features and the general way to organize. Side note for @arokem @Garyfallidis @MarcCote in python, I am currently trying to implement a simple organization like @jdtournier suggested supporting the feature I listed and it works pretty easily with zipfile/numpy/json (super easy in fact with zipfile in python >3.8). Very fast, easily reconstruct a StatefulTractogram, and the code is quite readable. |
@frheault, glad you're on board - but let's make sure all the relevant stakeholders are on board before taking this much further. We don't want anyone getting the impression decisions have already been taken before they've had a chance to respond!
Exactly - though I'd make sure the file extension used is one that has been agreed and unambiguously identifies the data type (
Would be as long as that...? I expect the directory listing would be read within a matter of milliseconds! You should only need to read the ZIP file's central directory file header, which is the very last part of the file, self-contained, and should fit within ~1kb or so given we're not talking about too many files here. |
thanks @jdtournier @frheault I invited others to pitch in (via email), hopefully, they will. It is summer in aCOVID-19 infested earth somehow. But I am very impressed with the discussion here. I understand many will have to pitch back in before we can start organizing back the thoughts of the community. Thanks for all the contributions! |
@mdesco is on vacation right now, but I updated him on the side. The feature I listed would meet his expectation, if we can achieve what I listed in section 4.2 I think it is safe to say he would be onboard. @jdtournier Yeah that's true I forgot zip does some magic with its own header. +1 for the datatype directly in the indices filename) Anyone new should focus on the features listed in https://docs.google.com/document/d/1GOOlG42rB7dlJizu2RfaF5XNj_pIaVl_6rtBSUhsgbE/edit The implementation of @jdtournier and @frheault (I) are talking about is mainly related to feature in section 4.2 |
Hello again, just to re-start some form of discussion about the potential new file format. For now, it is quite rudimentary but I think the suggested features could be achieved (at least in Python). I think this could work pretty well as a format. However, the implementation would be left to the library if the library wants to achieve no features it is easy to simply read everything in memory and get the streamlines and auxiliary data. However, to achieve all features robustly it requires (obviously) more work. But I can see pretty well the list of things to do to achieve the features. However, I don't think it's gonna be easy to make them compatible with the Overall, I would say it is a good nomenclature and the zip idea is working pretty fine for my early tests. I had to try it out, just in case I was 'accepting' design suggestions that could not be achieved in python. Also for @arokem, @MarcCote and @Garyfallidis in terms of speed/size (compared to nibabel loading the Please do not judge my code, it's only two afternoon worth of code... But it works, most of it is to 'parse' the nomenclature. |
@frheault this looks nice. The code is clean and to a large degree the format is self documenting. I have only a few minor comments, mostly based on my personal style. Take these as a sign of my enthusiasm, not as fatal flaws:
|
@neurolabusc Thanks for the input! I am planning on adding a first draft of a memory-friendly concatenate today!
|
|
I'm a bit late to this thread, but just wanted to pitch what I personally feel are important targets for a new format (or an upgrade to an existing one):
My two cents, |
Hello everyone, just to keep the subject alive I did a small technological survey and tried out different approaches. My preference is for the memmaps approach inside a zip file where the architecture is self-explanatory. I propose that we move the discussion to this thread to disentangle the subject from Nibabel and focus on the specifications. This is still ongoing work, there is still discussion to have, but it is a start. It is important to remember that my implementation is simply to test the idea, not an actual final/usable version. If anyone had ideas that they coded in the past 1-2 months related to this thread I would be happy to modify the (fresh/empty) github to include it. |
Hi Everyone, This Wednesday, @frheault will lead a discussion about tractography file format at the DIPY online meeting. For more information: dipy/dipy#2229 (comment) Feel free to get in! |
@skoudoro great! |
Hi all, I know I'm late to the party, even if @frheault bugged me multiple times about it. I'm still going through all the thread and the other related documents, but I still want to say that, to me,
Disclaimer: some of you know that @frheault and I come from the same lab, I'm just disclosing it so that if my comments agree with him, it isn't seen as a conflict of interest. He did all of this on his own, only based on the knowledge and experience that we built through the years. Thanks everyone for all the hard work, I think this will help allievate some recurring issues we have had over the years! |
Hi, Unfortunately, I saw these discussions very late and missed the Dec 2 meeting. Hopefully, it was fruitful. I believe a new tractogram format addressing the needs of the community would be highly useful. Thanks to all who has been active in this so far. I would be happy to contribute in the discussions and development as well. |
hi @baranaydogan I am very sorry you missed this. It would be great to have you interact on this topic |
It would be terrific to start a conversation about an agreed-upon data format for tractography.
@arokem @Garyfallidis
The text was updated successfully, but these errors were encountered: