|
| 1 | +--- |
| 2 | +feature: git-hashing |
| 3 | +start-date: 2022-08-27 |
| 4 | +author: John Ericsion (@Ericson2314) on behalf of [Obsidian Systems](https://obsidian.systems) |
| 5 | +co-authors: (find a buddy later to help out with the RFC) |
| 6 | +shepherd-team: edolstra, kevincox, gador, @amjoseph-nixpkgs |
| 7 | +shepherd-leader: amjoseph-nixpkgs |
| 8 | +related-issues: (will contain links to implementation PRs) |
| 9 | +--- |
| 10 | + |
| 11 | +# Summary |
| 12 | +[summary]: #summary |
| 13 | + |
| 14 | +Integrate Git hashing with Nix. |
| 15 | + |
| 16 | +Nix should support content-addressed store objects using git blob + tree hashing, and Nix-unaware remote stores that serve git objects. |
| 17 | + |
| 18 | +This follows the work done and described in https://github.com/obsidiansystems/ipfs-nix-guide/ . |
| 19 | + |
| 20 | +# Motivation |
| 21 | +[motivation]: #motivation |
| 22 | + |
| 23 | +## Binary distribution |
| 24 | + |
| 25 | +Currently distributing Nix binaries takes a lot of bandwidth and storage. |
| 26 | +This is a barrier to being a Nix user in areas of slower internet --- which includes the vast majority of the world's population at this time. |
| 27 | +This is also a barrier to users running their own caches. |
| 28 | + |
| 29 | +Content-addressing opens up a *huge* design space of solutions to get around such problems. |
| 30 | + |
| 31 | +The first steps proposed below do *not* tackle this problem directly, but it lays the ground-work for future experiments in this direction. |
| 32 | + |
| 33 | +## Source distribution and archival |
| 34 | + |
| 35 | +Source code used by Nix expressions frequently goes off-line. It would be beneficial if there was some resistance to this form of bitrot. |
| 36 | +The Software Heritage archive stores much of the source code that Nix expressions use. They would be a natural partner in this effort. |
| 37 | + |
| 38 | +Unfortunately, as https://www.tweag.io/blog/2020-06-18-software-heritage/ describes at the end, a major challenge is the way Nix content-addresses software. |
| 39 | +First of all, Nix hashes sources in bespoke ways that no other project will adopt. |
| 40 | +Second of all, hashing tarballs instead of the underlying files leads to non-normative details (compression, odd perms, etc.). |
| 41 | + |
| 42 | +We should natively support Git file hashing, which is supported both by Git repos and Software Heritage. |
| 43 | +This will completely obliterate these issues. |
| 44 | + |
| 45 | +Overall, we are building out a uniform way to work with source code, regardless of its origins or the exact tools involved. |
| 46 | + |
| 47 | +# Detailed design |
| 48 | +[design]: #detailed-design |
| 49 | + |
| 50 | +Each item can be done separately provided its dependent items are also done. |
| 51 | +These are the items we wish to commit to at this time. |
| 52 | +(The goals mentioned under [future work](#future-work) are, in a separate document, also broken down into a dependency graph of smaller steps.) |
| 53 | + |
| 54 | +## Git file hashing |
| 55 | + |
| 56 | +- **Purpose**: Source distribution and archival |
| 57 | + |
| 58 | +In addition to the various forms of content-addressing Nix supports today ("text", "fixed" with either "flat" or "nar" serialization of file system objects), Nix should support Git hashing. |
| 59 | +This support entails two basic things: |
| 60 | + |
| 61 | + - Content addresses are used to compute store paths. |
| 62 | + - Content addresses are used to verify store object integrity. |
| 63 | + |
| 64 | +Git hashing would not (in this first proposed version) support references, since references in Nix's sense are not part of Git's data model. |
| 65 | +This is OK for now; encoding references is not needed for the intended initial use-case of exchanging source code. |
| 66 | + |
| 67 | +## Git file hashing for `buitins.fetch*` |
| 68 | + |
| 69 | +- **Purpose**: Source distribution and archival |
| 70 | +- **Depends on**: Git file hashing |
| 71 | + |
| 72 | +The built-in fetchers can also be made to work with Git file hashing just as they support the other types. |
| 73 | +In addition, Git repo fetching can leverage this better to than the other formats since the data in Git repos is already content-addressed in this way. |
| 74 | + |
| 75 | +## Nix-agnostic content-addressing "stores" |
| 76 | + |
| 77 | +- **Purpose**: All distribution |
| 78 | + |
| 79 | +We want to be able to substitute from an arbitrary store (in the general, non-Nix sense) of content-addressed objects. |
| 80 | +For the purpose of this RFC, that means querying objects by Git hash, and being able to trust the results because we can verify them against the Git hash. |
| 81 | + |
| 82 | +In the implementation, we could accomplish this in a variety of ways. |
| 83 | + |
| 84 | +- On one extreme, we could have a `ContentAddressedSubstitutor` abstract interface completely separate from Nix's `Store` interface. |
| 85 | + |
| 86 | +- On the other extreme, we can generalize `Store` itself to allow taking content addresses or store paths as references. |
| 87 | + |
| 88 | +Exactly how this shakes out is to be determined post-RFC, but it would be nice to use Nix-agnostic persistent methods with `--store` and `--substituters`. |
| 89 | + |
| 90 | +If we do go the route of modifying the `Store` class, note that these things will need to happen: |
| 91 | + |
| 92 | + - Many store interface methods that today take store paths will need to also accept names & content address pairs. |
| 93 | + |
| 94 | + For stores that are purpose-built for Nix, like the ones we support today, all addressing can be done with store paths, so the current interface is fine. |
| 95 | + But for Nix-agnostic stores, store paths are rather useless as a key type because Nix-agnostic tools don't know about them. |
| 96 | + Those store can, however, understand content addresses. |
| 97 | + And from such a name + content address, we can always produce a store path again, so there is no loss of functionality with existing stores. |
| 98 | + |
| 99 | +- Relax `ValidPathInfo` to merely require that *either* the pair of `NarHash` and `NarSize` or just `CA` alone be defined. |
| 100 | + |
| 101 | + As described in the first step, currently `NarHash` and `NarSize` are the *normative* fields which are used to verify a store object. |
| 102 | + But if the store object is content-addressed, we don't need these, because the content address (`CA` field) will also suffice, all by itself. |
| 103 | + |
| 104 | + Existing Nix stores types are still required to contain a `NarHash` and `NarSize`, which is good for backwards compatibility and don't come with a cost. |
| 105 | + Only new Nix-agnostic store types would take advantage of these new, relaxed rules. |
| 106 | + |
| 107 | +# Examples and Interactions |
| 108 | +[examples-and-interactions]: #examples-and-interactions |
| 109 | + |
| 110 | +We encourage anyone interested to check our tutorial in https://github.com/obsidiansystems/ipfs-nix-guide/ which demonstrates the above functionality. |
| 111 | +Note at the time of writing this guide uses our original 2020 fork of Nix. |
| 112 | + |
| 113 | +# Drawbacks |
| 114 | +[drawbacks]: #drawbacks |
| 115 | + |
| 116 | +## Complexity |
| 117 | + |
| 118 | +The main cost is more complexity to the store layer. |
| 119 | +For a few reasons we think this is not so bad. |
| 120 | + |
| 121 | +Most importantly is the division of the work into a dependency graph of steps. |
| 122 | +This allows us to slowly try out things like IPFS that leverage Git hashing, and not commit to more change than we want to up front. |
| 123 | + |
| 124 | +Even if we do end up adopting everything though, we think for the following two reasons the complexity can still be kept manageable: |
| 125 | + |
| 126 | +1. Per the abstract vs concrete model of the Nix store in https://github.com/NixOS/nix/pull/6877, everything we are doing is simply flushing out alternative interpretations of the abstract model. |
| 127 | + This is the sense in which we are, per the Scheme mantra, "removing the weaknesses and restrictions that make additional features appear necessary": |
| 128 | + Instead of extending the model with new features, we are relaxing concrete model assumptions (e.g. references are always opaque store paths) while keeping the abstract model the same. |
| 129 | + |
| 130 | +2. We also support plans to decouple the layers of Nix further, and update our educational and marketing material to reflect it. |
| 131 | + Layering will "divide and conquer" the project so the interfaces between each layer are still rigorously enforced preventing a combinatorial explosion in complexity. |
| 132 | + That frees up "complexity budget" for projects like this. |
| 133 | + |
| 134 | +## Git and Nix's file system data models do not entirely coincide |
| 135 | + |
| 136 | +Nix puts the permission info of a file (executable bit for now) with that file, whereas Git puts it with the name and hash in the directory. |
| 137 | +The practical effect of this discrepancy is that a root file (as opposed to directory) in Nix has permission info, but does not in Git. |
| 138 | + |
| 139 | +If we are trying to convert existing Nix data into Git, this is a problem. |
| 140 | +Assuming we treat "no permission bits" as meaning "non-executable", we will have a partial conversion that will fail on executable files without a parent directory. |
| 141 | +Tricks like always wrapping everything in a directory get around this, but then we have to be careful the directory is exactly as expected when "unwrapping" in the other direction. |
| 142 | + |
| 143 | +For now, we only focus on ingesting data *from* Git *to* Nix, and this side-steps the issue. |
| 144 | +That mapping is total, i.e. all Git data can be mapped, and injective, i.e. each Git data has a unique Nix data representative (though not surjective, i.e. not all Nix data can be represented as a piece of Git data), and so there is no problem for now. |
| 145 | + |
| 146 | +# Alternatives |
| 147 | +[alternatives]: #alternatives |
| 148 | + |
| 149 | +The dependency graph of steps can be sliced to save some for future work. |
| 150 | +For now they are all written together, but during the RFC meetings we will decide which steps (if any) to ratify now, and which steps to save for later. |
| 151 | + |
| 152 | +# Unresolved questions |
| 153 | +[unresolved]: #unresolved-questions |
| 154 | + |
| 155 | +None at this time. |
| 156 | + |
| 157 | +# Future work |
| 158 | +[future]: #future-work |
| 159 | + |
| 160 | +- Integrate with outside content-addressing storage/transmission like |
| 161 | + |
| 162 | + - The Software Heritage archive |
| 163 | + |
| 164 | + - IPFS |
0 commit comments