Write commit-graph on fetch #328

gitgitgadget · 2019-09-03T19:08:14Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes: > diff --git a/builtin/fetch.c b/builtin/fetch.c > index 53ce99d2bb..d36a403859 100644 > --- a/builtin/fetch.c > +++ b/builtin/fetch.c > @@ -23,6 +23,7 @@ > #include "packfile.h" > #include "list-objects-filter-options.h" > #include "commit-reach.h" > +#include "commit-graph.h" > > #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000) > > @@ -1715,6 +1716,20 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) > > string_list_clear(&list, 0); > > + prepare_repo_settings(the_repository); > + if (the_repository->settings.fetch_write_commit_graph) { > + int commit_graph_flags = COMMIT_GRAPH_SPLIT; > + struct split_commit_graph_opts split_opts; > + memset(&split_opts, 0, sizeof(struct split_commit_graph_opts)); > + > + if (progress) > + commit_graph_flags |= COMMIT_GRAPH_PROGRESS; > + > + write_commit_graph_reachable(get_object_directory(), > + commit_graph_flags, > + &split_opts); > + } As a low-impact change this is good. For longer term, it feels a bit unfortunate that this is still a separate phase of the program, though. We know what new refs we added, we know what new objects we received, and we even scanned each and every one of them while running the index-pack step to store the .pack and compute the .idx file, i.e. it feels that we have most of the information already in-core to extend the commit graph for new parts of the history we just received. Thanks.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/3/2019 3:05 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <[email protected]> writes: > >> diff --git a/builtin/fetch.c b/builtin/fetch.c >> index 53ce99d2bb..d36a403859 100644 >> --- a/builtin/fetch.c >> +++ b/builtin/fetch.c >> @@ -23,6 +23,7 @@ >> #include "packfile.h" >> #include "list-objects-filter-options.h" >> #include "commit-reach.h" >> +#include "commit-graph.h" >> >> #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000) >> >> @@ -1715,6 +1716,20 @@ int cmd_fetch(int argc, const char **argv, const char *prefix) >> >> string_list_clear(&list, 0); >> >> + prepare_repo_settings(the_repository); >> + if (the_repository->settings.fetch_write_commit_graph) { >> + int commit_graph_flags = COMMIT_GRAPH_SPLIT; >> + struct split_commit_graph_opts split_opts; >> + memset(&split_opts, 0, sizeof(struct split_commit_graph_opts)); >> + >> + if (progress) >> + commit_graph_flags |= COMMIT_GRAPH_PROGRESS; >> + >> + write_commit_graph_reachable(get_object_directory(), >> + commit_graph_flags, >> + &split_opts); >> + } > > As a low-impact change this is good. > > For longer term, it feels a bit unfortunate that this is still a > separate phase of the program, though. We know what new refs we > added, we know what new objects we received, and we even scanned > each and every one of them while running the index-pack step to > store the .pack and compute the .idx file, i.e. it feels that we > have most of the information already in-core to extend the commit > graph for new parts of the history we just received. You're right that we could isolate the new write to the refs we just received. We could use the more cumbersome write_commit_graph() method with a list of commit oids as starting points. I'm happy to make that change if we see a lot of value there. However, the current patch also gives us a way to add local refs to the commit graph and "catch up" to the work the user had done. With this in mind, I do think the simpler code has another functional advantage. Thanks, -Stolee

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <[email protected]> writes: >>> + write_commit_graph_reachable(get_object_directory(), >>> + commit_graph_flags, >>> + &split_opts); >>> + } >> >> As a low-impact change this is good. >> >> For longer term, it feels a bit unfortunate that this is still a >> separate phase of the program, though. We know what new refs we >> added, we know what new objects we received, and we even scanned >> each and every one of them while running the index-pack step to >> store the .pack and compute the .idx file, i.e. it feels that we >> have most of the information already in-core to extend the commit >> graph for new parts of the history we just received. > > You're right that we could isolate the new write to the refs we > just received. We could use the more cumbersome write_commit_graph() > method with a list of commit oids as starting points. I'm happy to > make that change if we see a lot of value there. Well, that is not the kind of information reuse I am talking about. I was wondering if "index-pack" has enough information in-core after it receives and processes the incoming pack data, scanned each and every object in it, in order to write out the commit graph _without_ having to do a lot of duplicate computation and enumeration of the objects done in the current commit-graph.c::write_commit_graph(), so that it can learn a "--write-commit-graph" option that performs much better than running "git fetch && git commit-graph write". Perhaps that would require too much refactoring of both index-pack and commit-graph code and infeasible in short to medium term and that is why I said "for longer term, it feels a bit unfortunate". Thanks.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 02:46:22PM -0700, Junio C Hamano wrote: > > You're right that we could isolate the new write to the refs we > > just received. We could use the more cumbersome write_commit_graph() > > method with a list of commit oids as starting points. I'm happy to > > make that change if we see a lot of value there. > > Well, that is not the kind of information reuse I am talking about. > > I was wondering if "index-pack" has enough information in-core after > it receives and processes the incoming pack data, scanned each and > every object in it, in order to write out the commit graph _without_ > having to do a lot of duplicate computation and enumeration of the > objects done in the current commit-graph.c::write_commit_graph(), so > that it can learn a "--write-commit-graph" option that performs much > better than running "git fetch && git commit-graph write". > > Perhaps that would require too much refactoring of both index-pack > and commit-graph code and infeasible in short to medium term and > that is why I said "for longer term, it feels a bit unfortunate". I think the basic metadata should be easy. We have each commit expanded in memory at some point, so parsing it and filing away its parents, trees, etc isn't too hard. Generation numbers are a little trickier, though, because they imply an actual topological traversal. It might actually be easier to couple this with the connectivity check we do after index-pack finishes (though I've often wondered if we could drop that check in favor of making index-pack smarter about finding the boundaries). -Peff

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes: > Generation numbers are a little trickier, though, because they imply an > actual topological traversal. It might actually be easier to couple this > with the connectivity check we do after index-pack finishes (though I've > often wondered if we could drop that check in favor of making index-pack > smarter about finding the boundaries). Currently after scanning the incoming objects with the fsck machinery, we count the number of objects that are pointed at by these objects in the pack and are not in the pack in the builtin/index-pack.c::check_object() function; at that point, we no longer know who points at the object in question, which is needed if we want to compute the boundary, so we need a bit of work here. With boundary information, we could be smarter about lazy fetching, I presume?

gitgitgadget · 2019-09-04T03:11:33Z

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Sep 02, 2019 at 07:22:02PM -0700, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <[email protected]> > > The commit-graph feature is now on by default, and is being > written during 'git gc' by default. Typically, Git only writes > a commit-graph when a 'git gc --auto' command passes the gc.auto > setting to actualy do work. This means that a commit-graph will > typically fall behind the commits that are being used every day. > > To stay updated with the latest commits, add a step to 'git > fetch' to write a commit-graph after fetching new objects. The > fetch.writeCommitGraph config setting enables writing a split > commit-graph, so on average the cost of writing this file is > very small. Occasionally, the commit-graph chain will collapse > to a single level, and this could be slow for very large repos. > > For additional use, adjust the default to be true when > feature.experimental is enabled. Seems like a good time to do it. We'd eventually want a similar option for updating it on the receiving side of a push, too. I don't insist that come at the same time, but we should at least have a plan about how it will look to the user. Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph, and then a master transfer.writeCommitGraph? -Peff

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes: > Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph, > and then a master transfer.writeCommitGraph? If anything, it may be good for consistency. I am not sure if it is a good idea to trigger writing the commit graph when accepting a push, though. It tends to be a lot finer grained than fetching, right?

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/5/2019 4:37 PM, Junio C Hamano wrote: > Jeff King <[email protected]> writes: > >> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph, >> and then a master transfer.writeCommitGraph? > > If anything, it may be good for consistency. > > I am not sure if it is a good idea to trigger writing the commit > graph when accepting a push, though. It tends to be a lot finer > grained than fetching, right? And I expect a push to include many fewer commits than a fetch. In a server environment, I would expect to have a separate maintenance task responsible for updating the commit-graph after receiving new data, but not in an in-line fashion with the push. Think about the situation of many pushes that happen in a short burst: one write after all of the pushes would have close to the same performance benefits as writing with every push, but does a lot less work. Thanks, -Stolee

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 01:00:40PM -0400, Derrick Stolee wrote: > On 9/5/2019 4:37 PM, Junio C Hamano wrote: > > Jeff King <[email protected]> writes: > > > >> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph, > >> and then a master transfer.writeCommitGraph? > > > > If anything, it may be good for consistency. > > > > I am not sure if it is a good idea to trigger writing the commit > > graph when accepting a push, though. It tends to be a lot finer > > grained than fetching, right? > > And I expect a push to include many fewer commits than a fetch. > In a server environment, I would expect to have a separate > maintenance task responsible for updating the commit-graph after > receiving new data, but not in an in-line fashion with the push. That's probably how we'll end up doing it at GitHub, because we run a big server farm that has a job-queueing system for periodic maintenance, etc. But keep in mind the "little guy" who is hosting a few repositories for themselves or their company. They'd presumably want the benefits here, too, right? We already have options to trigger auto-gc and update-server-info for them, so why not this maintenance, too? > Think about the situation of many pushes that happen in a short > burst: one write after all of the pushes would have close to the > same performance benefits as writing with every push, but does > a lot less work. Sure, but wouldn't that similarly apply to fetching? What is it that makes bursts of pushes more likely than bursts of fetches? -Peff

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes: > Sure, but wouldn't that similarly apply to fetching? What is it that > makes bursts of pushes more likely than bursts of fetches? Because people tend to use a repository as a gathering point? You may periodically fetch from and push to a repository, and you may even do so at the same interval, but simply because there are more "other" people than you alone as a single developer in the project, your fetch tends to grab work from more people than yoru push that publish your work alone?

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes: > I suppose so. But I think the "stock git without any other job > infrastructure" case would still benefit. Oh, no question about it. I did question if an automatic writing would benefit the side that receives a push when you brought up the usual "fetch.* and receive.* for separate configuration, transfer.* when want to rule them both" convention, which is a good idea if only for consistency, but the question was if it helps the receiving end of a push to the same degree as it would help the repository that fetches.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/6/2019 4:42 PM, Junio C Hamano wrote: > Jeff King <[email protected]> writes: > >> I suppose so. But I think the "stock git without any other job >> infrastructure" case would still benefit. > > Oh, no question about it. > > I did question if an automatic writing would benefit the side that > receives a push when you brought up the usual "fetch.* and receive.* > for separate configuration, transfer.* when want to rule them both" > convention, which is a good idea if only for consistency, but the > question was if it helps the receiving end of a push to the same > degree as it would help the repository that fetches. I think the "stock git without any other job infrastructure" is a very important scenario. Putting the simplest version of "commit-graph writes in-line with every push" seems to be ripe for failure under load. I'd rather think deeply about what is best for this scenario. Spit-balling: what if we used the same mechanism as running GC in the background to launch 'git commit-graph write' commands? And we could have the command give up (without failure) if the commit-graph.lock file is already created, so concurrent pushes would not fight each other. Thanks, -Stolee

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <[email protected]> writes: > On 9/6/2019 4:42 PM, Junio C Hamano wrote: >> Jeff King <[email protected]> writes: >> >>> I suppose so. But I think the "stock git without any other job >>> infrastructure" case would still benefit. >> >> Oh, no question about it. >> >> I did question if an automatic writing would benefit the side that >> receives a push when you brought up the usual "fetch.* and receive.* >> for separate configuration, transfer.* when want to rule them both" >> convention, which is a good idea if only for consistency, but the >> question was if it helps the receiving end of a push to the same >> degree as it would help the repository that fetches. > > I think the "stock git without any other job infrastructure" is > a very important scenario. Putting the simplest version of > "commit-graph writes in-line with every push" seems to be ripe > for failure under load. I'd rather think deeply about what is > best for this scenario. As to what to do on the push side, I suppose we can afford to let it simmer in the back of our heads while moving this topic forward. Whether we'd later decide to have receive.writeCommitGraph (in which case we would add transfer.writeCommitGraph, too) or not, this change on the fetch side can independently be used, right? Thanks.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 05:04:17PM -0400, Derrick Stolee wrote: > On 9/6/2019 4:42 PM, Junio C Hamano wrote: > > Jeff King <[email protected]> writes: > > > >> I suppose so. But I think the "stock git without any other job > >> infrastructure" case would still benefit. > > > > Oh, no question about it. > > > > I did question if an automatic writing would benefit the side that > > receives a push when you brought up the usual "fetch.* and receive.* > > for separate configuration, transfer.* when want to rule them both" > > convention, which is a good idea if only for consistency, but the > > question was if it helps the receiving end of a push to the same > > degree as it would help the repository that fetches. > > I think the "stock git without any other job infrastructure" is > a very important scenario. Putting the simplest version of > "commit-graph writes in-line with every push" seems to be ripe > for failure under load. I'd rather think deeply about what is > best for this scenario. If it's going to be a problem under load, that makes me worry about the same thing for fetches. Whether you see a lot of either depends on your workflow. But as long as neither option is enabled by default (or at least if it becomes common knowledge to _disable_ them if you have a high rate), it may be OK. I do agree that the normal "busy" repos you and I have both seen in enterprise settings (where people are literally pushing multiple times per second at peak) would want some kind of throttling. But I think there's a long tail of "push once a week" repos. > Spit-balling: what if we used the same mechanism as running GC > in the background to launch 'git commit-graph write' commands? > And we could have the command give up (without failure) if the > commit-graph.lock file is already created, so concurrent pushes > would not fight each other. I have mixed feelings. It's nice not to stand in the critical path of the user, but background tasks have a way of finding funny corner cases (e.g., packfiles moving around, or the issues we've had with deciding when to shut down auto-gc for a grace period due to warnings/errors). I think since this is generally additive (adding new commit-graph files), it might be less finicky, though. -Peff

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 02:57:55PM -0700, Junio C Hamano wrote: > > I think the "stock git without any other job infrastructure" is > > a very important scenario. Putting the simplest version of > > "commit-graph writes in-line with every push" seems to be ripe > > for failure under load. I'd rather think deeply about what is > > best for this scenario. > > As to what to do on the push side, I suppose we can afford to let it > simmer in the back of our heads while moving this topic forward. > Whether we'd later decide to have receive.writeCommitGraph (in which > case we would add transfer.writeCommitGraph, too) or not, this > change on the fetch side can independently be used, right? Yeah, I'm OK with proceeding without the receive-pack side for now. -Peff

-Original file line number
+Diff line change
@@ Expand Up @@
     +
     * `fetch.negotiationAlgorithm=skipping` may improve fetch negotiation times by
     skipping more commits at a time, reducing the number of round trips.
+    +
+    * `fetch.writeCommitGraph=true` writes a commit-graph after every `git fetch`
+    command that downloads a pack-file from a remote. Using the `--split` option,
+    most executions will create a very small commit-graph file on top of the
+    existing commit-graph file(s). Occasionally, these files will merge and the
+    write may take longer. Having an updated commit-graph file helps performance
+    of many Git commands, including `git merge-base`, `git push -f`, and
+    `git log --graph`.
     feature.manyFiles::
     	Enable config options that optimize for repos with many files in the
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -69,3 +69,13 @@ fetch.showForcedUpdates:: @@
     	Set to false to enable `--no-show-forced-updates` in
     	linkgit:git-fetch[1] and linkgit:git-pull[1] commands.
     	Defaults to true.
+    fetch.writeCommitGraph::
+    	Set to true to write a commit-graph after every `git fetch` command
+    	that downloads a pack-file from a remote. Using the `--split` option,
+    	most executions will create a very small commit-graph file on top of
+    	the existing commit-graph file(s). Occasionally, these files will
+    	merge and the write may take longer. Having an updated commit-graph
+    	file helps performance of many Git commands, including `git merge-base`,
+    	`git push -f`, and `git log --graph`. Defaults to false, unless
+    	`feature.experimental` is true.

-Original file line number
+Diff line change
@@ Expand Up / @@ -49,10 +49,14 @@ void prepare_repo_settings(struct repository *r) @@
     		UPDATE_DEFAULT_BOOL(r->settings.index_version, 4);
     		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_WRITE);
     	}
+    	if (!repo_config_get_bool(r, "fetch.writecommitgraph", &value))
+    		r->settings.fetch_write_commit_graph = value;
     	if (!repo_config_get_bool(r, "feature.experimental", &value) && value) {
     		UPDATE_DEFAULT_BOOL(r->settings.pack_use_sparse, 1);
     		UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_SKIPPING);
+    		UPDATE_DEFAULT_BOOL(r->settings.fetch_write_commit_graph, 1);
     	}
+    	UPDATE_DEFAULT_BOOL(r->settings.fetch_write_commit_graph, 0);
     	/* Hack for test programs like test-dump-untracked-cache */
     	if (ignore_untracked_cache_config)
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -30,6 +30,7 @@ struct repo_settings { @@
     	int core_commit_graph;
     	int gc_write_commit_graph;
+    	int fetch_write_commit_graph;
     	int index_version;
     	enum untracked_cache_setting core_untracked_cache;
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up @@
     	)
     '
+    test_expect_success 'fetch.writeCommitGraph' '
+    	git clone three write &&
+    	(
+    		cd three &&
+    		test_commit new
+    	) &&
+    	(
+    		cd write &&
+    		git -c fetch.writeCommitGraph fetch origin &&
+    		test_path_is_file .git/objects/info/commit-graphs/commit-graph-chain
+    	)
+    '
     # configured prune tests
     set_config_tristate () {
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Write commit-graph on fetch #328

Uh oh!

Diff view

Diff view

There are no files selected for viewing

gitgitgadget bot Sep 3, 2019

Uh oh!

gitgitgadget bot Sep 3, 2019

Uh oh!

gitgitgadget bot Sep 6, 2019

Uh oh!

gitgitgadget bot Sep 7, 2019

Uh oh!

gitgitgadget bot Sep 9, 2019

Uh oh!

gitgitgadget bot Sep 4, 2019

Uh oh!

gitgitgadget bot Sep 5, 2019

Uh oh!

gitgitgadget bot Sep 6, 2019

Uh oh!

gitgitgadget bot Sep 6, 2019

Uh oh!

gitgitgadget bot Sep 6, 2019

Uh oh!

gitgitgadget bot Sep 6, 2019

Uh oh!

gitgitgadget bot Sep 6, 2019

Uh oh!

gitgitgadget bot Sep 6, 2019

Uh oh!

gitgitgadget bot Sep 7, 2019

Uh oh!

gitgitgadget bot Sep 7, 2019

Uh oh!

-Original file line number
+Diff line change
@@ Expand Up / @@ -23,6 +23,7 @@ @@
     #include "packfile.h"
     #include "list-objects-filter-options.h"
     #include "commit-reach.h"
+    #include "commit-graph.h"
     #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)
@@ Expand Down Expand Up @@
     	string_list_clear(&list, 0);
+    	prepare_repo_settings(the_repository);
+    	if (the_repository->settings.fetch_write_commit_graph) {
+    		int commit_graph_flags = COMMIT_GRAPH_SPLIT;
+    		struct split_commit_graph_opts split_opts;
+    		memset(&split_opts, 0, sizeof(struct split_commit_graph_opts));
+    		if (progress)
+    			commit_graph_flags |= COMMIT_GRAPH_PROGRESS;
+    		write_commit_graph_reachable(get_object_directory(),
+    					     commit_graph_flags,
+    					     &split_opts);
+    	}
     	close_object_store(the_repository->objects);
     	if (enable_auto_gc) {
@@ Expand Down @@

Write commit-graph on fetch #328

Uh oh!

Write commit-graph on fetch #328

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

gitgitgadget bot Sep 3, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 3, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 7, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 9, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 4, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 5, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 6, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 7, 2019

Choose a reason for hiding this comment

Uh oh!

gitgitgadget bot Sep 7, 2019

Choose a reason for hiding this comment

Uh oh!