-
Notifications
You must be signed in to change notification settings - Fork 1
UnixFS Reboot #28
Description
TLDR;
I’ve listed every feature I can find that has been considered for UnixFSv2 below. We discussed this in a short meeting (notes at the end of the document, recording posted soon) and the following action items surfaced:
- @mikeal will kick off an issue in
ipfs/specto add file metadata to UnixFSv1 - @mikeal will kick off an issue in this repo to define and scope a UnixFSv2* we can ship on a reasonable timeline.
UnixFS vNext Reboot
For some time we’ve been directing issues, feature requests, and the general future of UnixFS at “UnixFSv2.” Since the size and scope of this future version were never locked down this has delayed improvements to UnixFSv1 and has failed to tie UnixFSv2 to a clear deadline and set of functionality.
The goal of this document is to describe the various issues and features we’d like to see in UnixFS and link to the historical discussions about those features. We can then use this document to discuss and prioritize each feature and find the best path to development whether it be improvements to UnixFSv1, an incremental UnixFSv2 on dag-cbor, or a bigger future version built on features that are still being researched.
General Links
- Requirements 2017
- UnixFSv1 -> v2 upgrade path
- Prioritizing UnixFSv2
- UnixFSv2 Draft Implementation in JS
Development Targets
This section briefly describes the difficulties and limitations of different development strategies which should help inform how to best approach solving each issues.
Improvements to UnixFSv1
One problem with improving UnixFSv1 is that every generic improvement we make cannot be leveraged by other applications outside of IPFS. For instance, the work we’ve done for directory sharding lives in UnixFSv1 and can’t be used for other generic sharding problems. This means that solving fairly generic problems via UnixFSv1 is less valuable and eventually duplicated effort.
The other problem is dag-pb, best summarized by @stebalian. In short, it’s very rigid and adding fields and other features are more cumbersome than dag-cbor.
UnixFSv2 on dag-cbor soonish
This development route solves the dag-pb related issues and makes some of the generic improvements leveragable outside of IPFS.
However, there is one major problem remaining: upgradability. All new features and improvements must exist and be relatively consistent between two versions of IPFS manipulating the same data. There is no good way to ensure this without future IPLD features that are still in the research phase.
This route of development is most problematic when tackling the “Reproducible Hashes” issue.
It should also be noted that, given we know that there is future un-developed IPLD work that we want to leverage for UnixFS we have a high degree of certainty that if we were to release this version of UnixFSv2 that we would still at some point in the future have another major version migration as well.
The actual development time for this would not be very long. @mikeal has already written draft implementations of several iterations of the UnixFSv2 spec in JS. A much more important factor to consider is the upgrade cost to IPFS users.
UnixFSv2 on “IPLD Future”
Most of the big problems facing UnixFS are problems facing IPLD generally. These problems are all being actively worked on in the form of engineering and research and at some future date can be leveraged for an ideal, future-proof (upgradable), version of UnixFS. However, when this will be available can’t be predicted with a high level of certainty.
Issues
Standard File/Directory metadata
- Permissions
- Executable bit
- Ownership (user and group)
- Filename in file object
- Number of files in directory (HAMT)
- Cumulative size of files in directory
- mtime
- mtime as BigInt
- content-type
Links
Arbitrary file metadata
The ability for users to add their own optional metadata to files could be very useful. However, doing arbitrary anything in dag-pb is problematic.
Reproducible Hashing
Put simply, this is the ability for a given UnixFS implementation to look at an existing UnixFS encoded file and a file on a traditional file system and to reproduce the UnixFS encode identically.
This feature is relatively simple if there is no optionality and every version of IPFS is in perfect alignment. However, this is almost never the case.
IPFS has several options that can be used when encoding a file that alter the encode.
One path is to encode all options into the encoded version of the file. This would work as long as both versions of IPFS are in alignment, which means this can fail to produce identical hashes often in new upgrade scenarios. The only to way to completely guarantee reproducible hashing is to have a guarantee that the applications are also identical but this is very difficult without “IPLD Future.”
- Reproducible file imports, Sept 2018
- Deep Dive IPFSCamp 2019: Deterministic CIDs! Reproducible File Imports! Verifiable HTTP Gateways!
“Inline” files and directories
For small files and directories the benefits of de-duplication are often out-weighed by the cost of retrieving additional blocks.
There are also use cases, like websites, where it may be highly beneficial to inline certain data into the root block of the directory tree for faster early rendering.
Support for non-utf8 Filenames
Seeking in large directories
It’s often necessary to paginate through large directories and the current implementations do not easily support this.
Question: Given that you can only paginate through a randomized ordering using the current sharding data structure, how useful would this be without ordered collections?
Symlinks
Protobuf Performance
While I’ve heard people say on numerous occations that dag-pb performance is an issue (compared to dag-cbor) I can‘t find any good links or resources to what the real impact of this is.
Miscellaneous
- Size fields to keep or potentially remove
- Support for other hash linked filesystems
- Slicing chunks
- Comment: “Our plan is to switch to rabin (or similar), CIDv1, raw leaves, UnixFSv2 etc. all in one go.”
- UnixFSv2 spike in IPLD Schema
Meeting Notes: August 8th 2019
-
performance things
- issues with old unixfs hamt
- batching issues
- fans out at the bottom way to fast
- really deep tree even in cases that's unnecessary
- issues with old unixfs hamt
-
questions about external information we can feed into priorities
- some other major user stories about high level apis have also come up...
- it's hard to add directories to ipfs currently without re-scanning all files... incremental adds wanted
- this is very much edge tooling and not unixfsv2 asks
- it's hard to add directories to ipfs currently without re-scanning all files... incremental adds wanted
- "we took everything that was blocked on unixfsv2 off our q3 list"
- doesn't mean we don't still want it, just choose to route elsewhere in other teams :)
- ... additional comments about "these workaround are terrible"
- some other major user stories about high level apis have also come up...
-
generation style versioning?
-
more worried about changes to things like rabin chunking than anything else
- moves (cancels dedup of) vast amounts of the data
- changing metadata much lighter comparatively (still not free)
-
some kinds of data might be easier to maintain read of and maybe that's useful?
- e.g. concatenating all the bytes in a
[][]byteis easy, even if chunker to write it changed
- e.g. concatenating all the bytes in a
-
worth mentioning that dir list order in most existing filesystems isn't... really specified.
- you can't seek it -- there are not syscalls for that.
anyone wanna talk about attribs?
https://gist.github.com/warpfork/3948bd951e93c0f0b4e355d78b736f83
- we should ping djdv on this as well