@@ -127,6 +127,197 @@ Design Details
127
127
helpful for these clones, anyway. The commit-graph will not be read or
128
128
written when shallow commits are present.
129
129
130
+ Commit Graphs Chains
131
+ --------------------
132
+
133
+ Typically, repos grow with near-constant velocity (commits per day). Over time,
134
+ the number of commits added by a fetch operation is much smaller than the
135
+ number of commits in the full history. By creating a "chain" of commit-graphs,
136
+ we enable fast writes of new commit data without rewriting the entire commit
137
+ history -- at least, most of the time.
138
+
139
+ ## File Layout
140
+
141
+ A commit-graph chain uses multiple files, and we use a fixed naming convention
142
+ to organize these files. Each commit-graph file has a name
143
+ `$OBJDIR/info/commit-graphs/graph-{hash}.graph` where `{hash}` is the hex-
144
+ valued hash stored in the footer of that file (which is a hash of the file's
145
+ contents before that hash). For a chain of commit-graph files, a plain-text
146
+ file at `$OBJDIR/info/commit-graphs/commit-graph-chain` contains the
147
+ hashes for the files in order from "lowest" to "highest".
148
+
149
+ For example, if the `commit-graph-chain` file contains the lines
150
+
151
+ ```
152
+ {hash0}
153
+ {hash1}
154
+ {hash2}
155
+ ```
156
+
157
+ then the commit-graph chain looks like the following diagram:
158
+
159
+ +-----------------------+
160
+ | graph-{hash2}.graph |
161
+ +-----------------------+
162
+ |
163
+ +-----------------------+
164
+ | |
165
+ | graph-{hash1}.graph |
166
+ | |
167
+ +-----------------------+
168
+ |
169
+ +-----------------------+
170
+ | |
171
+ | |
172
+ | |
173
+ | graph-{hash0}.graph |
174
+ | |
175
+ | |
176
+ | |
177
+ +-----------------------+
178
+
179
+ Let X0 be the number of commits in `graph-{hash0}.graph`, X1 be the number of
180
+ commits in `graph-{hash1}.graph`, and X2 be the number of commits in
181
+ `graph-{hash2}.graph`. If a commit appears in position i in `graph-{hash2}.graph`,
182
+ then we interpret this as being the commit in position (X0 + X1 + i), and that
183
+ will be used as its "graph position". The commits in `graph-{hash2}.graph` use these
184
+ positions to refer to their parents, which may be in `graph-{hash1}.graph` or
185
+ `graph-{hash0}.graph`. We can navigate to an arbitrary commit in position j by checking
186
+ its containment in the intervals [0, X0), [X0, X0 + X1), [X0 + X1, X0 + X1 +
187
+ X2).
188
+
189
+ Each commit-graph file (except the base, `graph-{hash0}.graph`) contains data
190
+ specifying the hashes of all files in the lower layers. In the above example,
191
+ `graph-{hash1}.graph` contains `{hash0}` while `graph-{hash2}.graph` contains
192
+ `{hash0}` and `{hash1}`.
193
+
194
+ ## Merging commit-graph files
195
+
196
+ If we only added a new commit-graph file on every write, we would run into a
197
+ linear search problem through many commit-graph files. Instead, we use a merge
198
+ strategy to decide when the stack should collapse some number of levels.
199
+
200
+ The diagram below shows such a collapse. As a set of new commits are added, it
201
+ is determined by the merge strategy that the files should collapse to
202
+ `graph-{hash1}`. Thus, the new commits, the commits in `graph-{hash2}` and
203
+ the commits in `graph-{hash1}` should be combined into a new `graph-{hash3}`
204
+ file.
205
+
206
+ +---------------------+
207
+ | |
208
+ | (new commits) |
209
+ | |
210
+ +---------------------+
211
+ | |
212
+ +-----------------------+ +---------------------+
213
+ | graph-{hash2} |->| |
214
+ +-----------------------+ +---------------------+
215
+ | | |
216
+ +-----------------------+ +---------------------+
217
+ | | | |
218
+ | graph-{hash1} |->| |
219
+ | | | |
220
+ +-----------------------+ +---------------------+
221
+ | tmp_graphXXX
222
+ +-----------------------+
223
+ | |
224
+ | |
225
+ | |
226
+ | graph-{hash0} |
227
+ | |
228
+ | |
229
+ | |
230
+ +-----------------------+
231
+
232
+ During this process, the commits to write are combined, sorted and we write the
233
+ contents to a temporary file, all while holding a `commit-graph-chain.lock`
234
+ lock-file. When the file is flushed, we rename it to `graph-{hash3}`
235
+ according to the computed `{hash3}`. Finally, we write the new chain data to
236
+ `commit-graph-chain.lock`:
237
+
238
+ ```
239
+ {hash3}
240
+ {hash0}
241
+ ```
242
+
243
+ We then close the lock-file.
244
+
245
+ ## Merge Strategy
246
+
247
+ When writing a set of commits that do not exist in the commit-graph stack of
248
+ height N, we default to creating a new file at level N + 1. We then decide to
249
+ merge with the Nth level if one of two conditions hold:
250
+
251
+ 1. `--size-multiple=<X>` is specified or X = 2, and the number of commits in
252
+ level N is less than X times the number of commits in level N + 1.
253
+
254
+ 2. `--max-commits=<C>` is specified with non-zero C and the number of commits
255
+ in level N + 1 is more than C commits.
256
+
257
+ This decision cascades down the levels: when we merge a level we create a new
258
+ set of commits that then compares to the next level.
259
+
260
+ The first condition bounds the number of levels to be logarithmic in the total
261
+ number of commits. The second condition bounds the total number of commits in
262
+ a `graph-{hashN}` file and not in the `commit-graph` file, preventing
263
+ significant performance issues when the stack merges and another process only
264
+ partially reads the previous stack.
265
+
266
+ The merge strategy values (2 for the size multiple, 64,000 for the maximum
267
+ number of commits) could be extracted into config settings for full
268
+ flexibility.
269
+
270
+ ## Deleting graph-{hash} files
271
+
272
+ After a new tip file is written, some `graph-{hash}` files may no longer
273
+ be part of a chain. It is important to remove these files from disk, eventually.
274
+ The main reason to delay removal is that another process could read the
275
+ `commit-graph-chain` file before it is rewritten, but then look for the
276
+ `graph-{hash}` files after they are deleted.
277
+
278
+ To allow holding old split commit-graphs for a while after they are unreferenced,
279
+ we update the modified times of the files when they become unreferenced. Then,
280
+ we scan the `$OBJDIR/info/commit-graphs/` directory for `graph-{hash}`
281
+ files whose modified times are older than a given expiry window. This window
282
+ defaults to zero, but can be changed using command-line arguments or a config
283
+ setting.
284
+
285
+ ## Chains across multiple object directories
286
+
287
+ In a repo with alternates, we look for the `commit-graph-chain` file starting
288
+ in the local object directory and then in each alternate. The first file that
289
+ exists defines our chain. As we look for the `graph-{hash}` files for
290
+ each `{hash}` in the chain file, we follow the same pattern for the host
291
+ directories.
292
+
293
+ This allows commit-graphs to be split across multiple forks in a fork network.
294
+ The typical case is a large "base" repo with many smaller forks.
295
+
296
+ As the base repo advances, it will likely update and merge its commit-graph
297
+ chain more frequently than the forks. If a fork updates their commit-graph after
298
+ the base repo, then it should "reparent" the commit-graph chain onto the new
299
+ chain in the base repo. When reading each `graph-{hash}` file, we track
300
+ the object directory containing it. During a write of a new commit-graph file,
301
+ we check for any changes in the source object directory and read the
302
+ `commit-graph-chain` file for that source and create a new file based on those
303
+ files. During this "reparent" operation, we necessarily need to collapse all
304
+ levels in the fork, as all of the files are invalid against the new base file.
305
+
306
+ It is crucial to be careful when cleaning up "unreferenced" `graph-{hash}.graph`
307
+ files in this scenario. It falls to the user to define the proper settings for
308
+ their custom environment:
309
+
310
+ 1. When merging levels in the base repo, the unreferenced files may still be
311
+ referenced by chains from fork repos.
312
+
313
+ 2. The expiry time should be set to a length of time such that every fork has
314
+ time to recompute their commit-graph chain to "reparent" onto the new base
315
+ file(s).
316
+
317
+ 3. If the commit-graph chain is updated in the base, the fork will not have
318
+ access to the new chain until its chain is updated to reference those files.
319
+ (This may change in the future [5].)
320
+
130
321
Related Links
131
322
-------------
132
323
[0] https://bugs.chromium.org/p/git/issues/detail?id=8
@@ -153,3 +344,7 @@ Related Links
153
344
154
345
[4] https://public-inbox.org/git/
[email protected] /T/#u
155
346
A patch to remove the ahead-behind calculation from 'status'.
347
+
348
+ [5] https://public-inbox.org/git/
[email protected] /
349
+ A discussion of a "two-dimensional graph position" that can allow reading
350
+ multiple commit-graph chains at the same time.
0 commit comments