-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Given recent comments in dipy/dipy#2229, and further back in nipy/nibabel#942, I think it is worth dedicating a discussion thread to the issue of data compression.
I think it is fair to say that this particular format proposal has prioritised extensibility and simplicity over compression. But for some users / developers the latter may be the higher priority. So here we can discuss whether the gains in this department from the current TRX proposal are adequate, whether it is possible to bootstrap any kind of explicit data compression into this format proposal, or whether a drastically different proposal would be necessary in order to satisfy such demands.
Please amend or correct as necessary, and comment as you see fit.
Where compression may be possible in TRX:
- Use of float16's for vertex locations -> halves storage over float32
- No need for delimiters -> one less vertex per streamline compared to delimiter-based formats, though explicitly storing offsets reduces that gain slightly
- Compatible with downsampling, i.e. storing less vertices per streamline than what was used in generation of the streamline; this can be a naive integer factor reduction in vertex count, or based on estimation of error accumulation (e.g. https://www.sciencedirect.com/science/article/pii/S1053811914010635); in either case for algorithms with small step size this can be the primary source of storage reduction
- Zipping of full directory structure (albeit depending on implementation this may impact IO substantially), though yields with this tend to be minimal due to the nature of the data
- Support for more dedicated compression strategies as in-place alternatives for vertex position (or indeed DPP / DPS / DPG) data within TRX.
Point 5 may be interesting to explore: there's been a number of manuscripts published explicitly on storage reduction of streamline vertex information, so simply having the option to utilise within TRX one such format for vertex position information would theoretically be possible. It would make the technical specification considerably more complex, but technically one could defer to external specifications for datasets doing such. I'm not a fan of this personally, I'm simply stating that it's a prospect.
In the converse case, i.e. a format from the ground up that would prioritise compression: A couple of these already exist, but don't have the extensibility of TRX. Trying to extend the capabilities of compressed streamline data storage could look something like bootstrapping compression into the .trk
format, or it could be something entirely novel. It probably falls upon someone who prioritises such to propose something that still has a chance of satisfying the desires of those prioritising extensibility. But this is likely one of multiple dimensions of difference that caused the original discussion in nipy/nibabel#942 to go slightly astray, so warrants greater precision here.