Skip to content

Commit f470655

Browse files
Ericson2314kevincoxAdam Josephedolstralheckemann
authored
[RFC 0133] Git hashing and Git-hashing-based remote stores (#133)
* ipfs: Copy Template * ipfs: Start drafting * ipfs: Finish draft * ipfs: Expand discussion of managing complexity * ipfs: Fix typos Thanks! * ipfs: Fix more typos Thanks! * ipfs: FInish motivation on source distribution and archival * ipfs: Rename now that we have number * Apply suggestions from code review Thanks! Co-authored-by: Kevin Cox <[email protected]> * Fix typos Thanks! Co-authored-by: Adam Joseph <[email protected]> * 133: Add shepherd team! Co-authored-by: Eelco Dolstra <[email protected]> * 133: Fix shepherds list mjoseph -> amjoseph * 133: Move non-`git` steps to future work * 133: Move one more section out of future work * 133: Move IPFS-specific motivation to future work too * 133: Rename feature in light of changes * 133: Rename RFC in light of changes * 133: Discuss the downside of git's file system model being different * Split future work, clean up Nix-agnostic stores section * Fix numerious typos Thanks, all of you! Co-authored-by: Kevin Cox <[email protected]> Co-authored-by: Adam Joseph <[email protected]> Co-authored-by: Linus Heckemann <[email protected]> * Add RFC open PR date * Be clearer about not supporting references to start * Update rfcs/0133-git-hashing.md Co-authored-by: Kevin Cox <[email protected]> * Rip out both RFC-scal Future Work sections They are now in an `ipfs-2` branch in this repo. * Remove "Build adoption through seamless interop" That can go in a separate blog post. * Apply suggestions from code review Thank you both!! Co-authored-by: Valentin Gagarin <[email protected]> Co-authored-by: Ryan Lahfa <[email protected]> * Slim down the layering section The other stuff is already in flight, we don't need to talk about it so much here. Co-authored-by: Valentin Gagarin <[email protected]> --------- Co-authored-by: Kevin Cox <[email protected]> Co-authored-by: Adam Joseph <[email protected]> Co-authored-by: Eelco Dolstra <[email protected]> Co-authored-by: Linus Heckemann <[email protected]> Co-authored-by: Valentin Gagarin <[email protected]> Co-authored-by: Ryan Lahfa <[email protected]>
1 parent 8c86187 commit f470655

File tree

1 file changed

+164
-0
lines changed

1 file changed

+164
-0
lines changed

rfcs/0133-git-hashing.md

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
---
2+
feature: git-hashing
3+
start-date: 2022-08-27
4+
author: John Ericsion (@Ericson2314) on behalf of [Obsidian Systems](https://obsidian.systems)
5+
co-authors: (find a buddy later to help out with the RFC)
6+
shepherd-team: edolstra, kevincox, gador, @amjoseph-nixpkgs
7+
shepherd-leader: amjoseph-nixpkgs
8+
related-issues: (will contain links to implementation PRs)
9+
---
10+
11+
# Summary
12+
[summary]: #summary
13+
14+
Integrate Git hashing with Nix.
15+
16+
Nix should support content-addressed store objects using git blob + tree hashing, and Nix-unaware remote stores that serve git objects.
17+
18+
This follows the work done and described in https://github.com/obsidiansystems/ipfs-nix-guide/ .
19+
20+
# Motivation
21+
[motivation]: #motivation
22+
23+
## Binary distribution
24+
25+
Currently distributing Nix binaries takes a lot of bandwidth and storage.
26+
This is a barrier to being a Nix user in areas of slower internet --- which includes the vast majority of the world's population at this time.
27+
This is also a barrier to users running their own caches.
28+
29+
Content-addressing opens up a *huge* design space of solutions to get around such problems.
30+
31+
The first steps proposed below do *not* tackle this problem directly, but it lays the ground-work for future experiments in this direction.
32+
33+
## Source distribution and archival
34+
35+
Source code used by Nix expressions frequently goes off-line. It would be beneficial if there was some resistance to this form of bitrot.
36+
The Software Heritage archive stores much of the source code that Nix expressions use. They would be a natural partner in this effort.
37+
38+
Unfortunately, as https://www.tweag.io/blog/2020-06-18-software-heritage/ describes at the end, a major challenge is the way Nix content-addresses software.
39+
First of all, Nix hashes sources in bespoke ways that no other project will adopt.
40+
Second of all, hashing tarballs instead of the underlying files leads to non-normative details (compression, odd perms, etc.).
41+
42+
We should natively support Git file hashing, which is supported both by Git repos and Software Heritage.
43+
This will completely obliterate these issues.
44+
45+
Overall, we are building out a uniform way to work with source code, regardless of its origins or the exact tools involved.
46+
47+
# Detailed design
48+
[design]: #detailed-design
49+
50+
Each item can be done separately provided its dependent items are also done.
51+
These are the items we wish to commit to at this time.
52+
(The goals mentioned under [future work](#future-work) are, in a separate document, also broken down into a dependency graph of smaller steps.)
53+
54+
## Git file hashing
55+
56+
- **Purpose**: Source distribution and archival
57+
58+
In addition to the various forms of content-addressing Nix supports today ("text", "fixed" with either "flat" or "nar" serialization of file system objects), Nix should support Git hashing.
59+
This support entails two basic things:
60+
61+
- Content addresses are used to compute store paths.
62+
- Content addresses are used to verify store object integrity.
63+
64+
Git hashing would not (in this first proposed version) support references, since references in Nix's sense are not part of Git's data model.
65+
This is OK for now; encoding references is not needed for the intended initial use-case of exchanging source code.
66+
67+
## Git file hashing for `buitins.fetch*`
68+
69+
- **Purpose**: Source distribution and archival
70+
- **Depends on**: Git file hashing
71+
72+
The built-in fetchers can also be made to work with Git file hashing just as they support the other types.
73+
In addition, Git repo fetching can leverage this better to than the other formats since the data in Git repos is already content-addressed in this way.
74+
75+
## Nix-agnostic content-addressing "stores"
76+
77+
- **Purpose**: All distribution
78+
79+
We want to be able to substitute from an arbitrary store (in the general, non-Nix sense) of content-addressed objects.
80+
For the purpose of this RFC, that means querying objects by Git hash, and being able to trust the results because we can verify them against the Git hash.
81+
82+
In the implementation, we could accomplish this in a variety of ways.
83+
84+
- On one extreme, we could have a `ContentAddressedSubstitutor` abstract interface completely separate from Nix's `Store` interface.
85+
86+
- On the other extreme, we can generalize `Store` itself to allow taking content addresses or store paths as references.
87+
88+
Exactly how this shakes out is to be determined post-RFC, but it would be nice to use Nix-agnostic persistent methods with `--store` and `--substituters`.
89+
90+
If we do go the route of modifying the `Store` class, note that these things will need to happen:
91+
92+
- Many store interface methods that today take store paths will need to also accept names & content address pairs.
93+
94+
For stores that are purpose-built for Nix, like the ones we support today, all addressing can be done with store paths, so the current interface is fine.
95+
But for Nix-agnostic stores, store paths are rather useless as a key type because Nix-agnostic tools don't know about them.
96+
Those store can, however, understand content addresses.
97+
And from such a name + content address, we can always produce a store path again, so there is no loss of functionality with existing stores.
98+
99+
- Relax `ValidPathInfo` to merely require that *either* the pair of `NarHash` and `NarSize` or just `CA` alone be defined.
100+
101+
As described in the first step, currently `NarHash` and `NarSize` are the *normative* fields which are used to verify a store object.
102+
But if the store object is content-addressed, we don't need these, because the content address (`CA` field) will also suffice, all by itself.
103+
104+
Existing Nix stores types are still required to contain a `NarHash` and `NarSize`, which is good for backwards compatibility and don't come with a cost.
105+
Only new Nix-agnostic store types would take advantage of these new, relaxed rules.
106+
107+
# Examples and Interactions
108+
[examples-and-interactions]: #examples-and-interactions
109+
110+
We encourage anyone interested to check our tutorial in https://github.com/obsidiansystems/ipfs-nix-guide/ which demonstrates the above functionality.
111+
Note at the time of writing this guide uses our original 2020 fork of Nix.
112+
113+
# Drawbacks
114+
[drawbacks]: #drawbacks
115+
116+
## Complexity
117+
118+
The main cost is more complexity to the store layer.
119+
For a few reasons we think this is not so bad.
120+
121+
Most importantly is the division of the work into a dependency graph of steps.
122+
This allows us to slowly try out things like IPFS that leverage Git hashing, and not commit to more change than we want to up front.
123+
124+
Even if we do end up adopting everything though, we think for the following two reasons the complexity can still be kept manageable:
125+
126+
1. Per the abstract vs concrete model of the Nix store in https://github.com/NixOS/nix/pull/6877, everything we are doing is simply flushing out alternative interpretations of the abstract model.
127+
This is the sense in which we are, per the Scheme mantra, "removing the weaknesses and restrictions that make additional features appear necessary":
128+
Instead of extending the model with new features, we are relaxing concrete model assumptions (e.g. references are always opaque store paths) while keeping the abstract model the same.
129+
130+
2. We also support plans to decouple the layers of Nix further, and update our educational and marketing material to reflect it.
131+
Layering will "divide and conquer" the project so the interfaces between each layer are still rigorously enforced preventing a combinatorial explosion in complexity.
132+
That frees up "complexity budget" for projects like this.
133+
134+
## Git and Nix's file system data models do not entirely coincide
135+
136+
Nix puts the permission info of a file (executable bit for now) with that file, whereas Git puts it with the name and hash in the directory.
137+
The practical effect of this discrepancy is that a root file (as opposed to directory) in Nix has permission info, but does not in Git.
138+
139+
If we are trying to convert existing Nix data into Git, this is a problem.
140+
Assuming we treat "no permission bits" as meaning "non-executable", we will have a partial conversion that will fail on executable files without a parent directory.
141+
Tricks like always wrapping everything in a directory get around this, but then we have to be careful the directory is exactly as expected when "unwrapping" in the other direction.
142+
143+
For now, we only focus on ingesting data *from* Git *to* Nix, and this side-steps the issue.
144+
That mapping is total, i.e. all Git data can be mapped, and injective, i.e. each Git data has a unique Nix data representative (though not surjective, i.e. not all Nix data can be represented as a piece of Git data), and so there is no problem for now.
145+
146+
# Alternatives
147+
[alternatives]: #alternatives
148+
149+
The dependency graph of steps can be sliced to save some for future work.
150+
For now they are all written together, but during the RFC meetings we will decide which steps (if any) to ratify now, and which steps to save for later.
151+
152+
# Unresolved questions
153+
[unresolved]: #unresolved-questions
154+
155+
None at this time.
156+
157+
# Future work
158+
[future]: #future-work
159+
160+
- Integrate with outside content-addressing storage/transmission like
161+
162+
- The Software Heritage archive
163+
164+
- IPFS

0 commit comments

Comments
 (0)