Skip to content

Write commit-graph on fetch #328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions Documentation/config/feature.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,14 @@ which can improve `git push` performance in repos with many files.
+
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 53ce99d2bb..d36a403859 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -23,6 +23,7 @@
>  #include "packfile.h"
>  #include "list-objects-filter-options.h"
>  #include "commit-reach.h"
> +#include "commit-graph.h"
>  
>  #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)
>  
> @@ -1715,6 +1716,20 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>  
>  	string_list_clear(&list, 0);
>  
> +	prepare_repo_settings(the_repository);
> +	if (the_repository->settings.fetch_write_commit_graph) {
> +		int commit_graph_flags = COMMIT_GRAPH_SPLIT;
> +		struct split_commit_graph_opts split_opts;
> +		memset(&split_opts, 0, sizeof(struct split_commit_graph_opts));
> +
> +		if (progress)
> +			commit_graph_flags |= COMMIT_GRAPH_PROGRESS;
> +
> +		write_commit_graph_reachable(get_object_directory(),
> +					     commit_graph_flags,
> +					     &split_opts);
> +	}

As a low-impact change this is good.  

For longer term, it feels a bit unfortunate that this is still a
separate phase of the program, though.  We know what new refs we
added, we know what new objects we received, and we even scanned
each and every one of them while running the index-pack step to
store the .pack and compute the .idx file, i.e. it feels that we
have most of the information already in-core to extend the commit
graph for new parts of the history we just received.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/3/2019 3:05 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> 
>> diff --git a/builtin/fetch.c b/builtin/fetch.c
>> index 53ce99d2bb..d36a403859 100644
>> --- a/builtin/fetch.c
>> +++ b/builtin/fetch.c
>> @@ -23,6 +23,7 @@
>>  #include "packfile.h"
>>  #include "list-objects-filter-options.h"
>>  #include "commit-reach.h"
>> +#include "commit-graph.h"
>>  
>>  #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)
>>  
>> @@ -1715,6 +1716,20 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
>>  
>>  	string_list_clear(&list, 0);
>>  
>> +	prepare_repo_settings(the_repository);
>> +	if (the_repository->settings.fetch_write_commit_graph) {
>> +		int commit_graph_flags = COMMIT_GRAPH_SPLIT;
>> +		struct split_commit_graph_opts split_opts;
>> +		memset(&split_opts, 0, sizeof(struct split_commit_graph_opts));
>> +
>> +		if (progress)
>> +			commit_graph_flags |= COMMIT_GRAPH_PROGRESS;
>> +
>> +		write_commit_graph_reachable(get_object_directory(),
>> +					     commit_graph_flags,
>> +					     &split_opts);
>> +	}
> 
> As a low-impact change this is good.  
> 
> For longer term, it feels a bit unfortunate that this is still a
> separate phase of the program, though.  We know what new refs we
> added, we know what new objects we received, and we even scanned
> each and every one of them while running the index-pack step to
> store the .pack and compute the .idx file, i.e. it feels that we
> have most of the information already in-core to extend the commit
> graph for new parts of the history we just received.

You're right that we could isolate the new write to the refs we
just received. We could use the more cumbersome write_commit_graph()
method with a list of commit oids as starting points. I'm happy to
make that change if we see a lot of value there.

However, the current patch also gives us a way to add local refs
to the commit graph and "catch up" to the work the user had done.
With this in mind, I do think the simpler code has another functional
advantage.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <[email protected]> writes:

>>> +		write_commit_graph_reachable(get_object_directory(),
>>> +					     commit_graph_flags,
>>> +					     &split_opts);
>>> +	}
>> 
>> As a low-impact change this is good.  
>> 
>> For longer term, it feels a bit unfortunate that this is still a
>> separate phase of the program, though.  We know what new refs we
>> added, we know what new objects we received, and we even scanned
>> each and every one of them while running the index-pack step to
>> store the .pack and compute the .idx file, i.e. it feels that we
>> have most of the information already in-core to extend the commit
>> graph for new parts of the history we just received.
>
> You're right that we could isolate the new write to the refs we
> just received. We could use the more cumbersome write_commit_graph()
> method with a list of commit oids as starting points. I'm happy to
> make that change if we see a lot of value there.

Well, that is not the kind of information reuse I am talking about.

I was wondering if "index-pack" has enough information in-core after
it receives and processes the incoming pack data, scanned each and
every object in it, in order to write out the commit graph _without_
having to do a lot of duplicate computation and enumeration of the
objects done in the current commit-graph.c::write_commit_graph(), so
that it can learn a "--write-commit-graph" option that performs much
better than running "git fetch && git commit-graph write".

Perhaps that would require too much refactoring of both index-pack
and commit-graph code and infeasible in short to medium term and
that is why I said "for longer term, it feels a bit unfortunate".

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 02:46:22PM -0700, Junio C Hamano wrote:

> > You're right that we could isolate the new write to the refs we
> > just received. We could use the more cumbersome write_commit_graph()
> > method with a list of commit oids as starting points. I'm happy to
> > make that change if we see a lot of value there.
> 
> Well, that is not the kind of information reuse I am talking about.
> 
> I was wondering if "index-pack" has enough information in-core after
> it receives and processes the incoming pack data, scanned each and
> every object in it, in order to write out the commit graph _without_
> having to do a lot of duplicate computation and enumeration of the
> objects done in the current commit-graph.c::write_commit_graph(), so
> that it can learn a "--write-commit-graph" option that performs much
> better than running "git fetch && git commit-graph write".
> 
> Perhaps that would require too much refactoring of both index-pack
> and commit-graph code and infeasible in short to medium term and
> that is why I said "for longer term, it feels a bit unfortunate".

I think the basic metadata should be easy. We have each commit expanded
in memory at some point, so parsing it and filing away its parents,
trees, etc isn't too hard.

Generation numbers are a little trickier, though, because they imply an
actual topological traversal. It might actually be easier to couple this
with the connectivity check we do after index-pack finishes (though I've
often wondered if we could drop that check in favor of making index-pack
smarter about finding the boundaries).

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes:

> Generation numbers are a little trickier, though, because they imply an
> actual topological traversal. It might actually be easier to couple this
> with the connectivity check we do after index-pack finishes (though I've
> often wondered if we could drop that check in favor of making index-pack
> smarter about finding the boundaries).

Currently after scanning the incoming objects with the fsck
machinery, we count the number of objects that are pointed at by
these objects in the pack and are not in the pack in the
builtin/index-pack.c::check_object() function; at that point, we no
longer know who points at the object in question, which is needed if
we want to compute the boundary, so we need a bit of work here.

With boundary information, we could be smarter about lazy fetching,
I presume?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Sep 02, 2019 at 07:22:02PM -0700, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <[email protected]>
> 
> The commit-graph feature is now on by default, and is being
> written during 'git gc' by default. Typically, Git only writes
> a commit-graph when a 'git gc --auto' command passes the gc.auto
> setting to actualy do work. This means that a commit-graph will
> typically fall behind the commits that are being used every day.
> 
> To stay updated with the latest commits, add a step to 'git
> fetch' to write a commit-graph after fetching new objects. The
> fetch.writeCommitGraph config setting enables writing a split
> commit-graph, so on average the cost of writing this file is
> very small. Occasionally, the commit-graph chain will collapse
> to a single level, and this could be slow for very large repos.
> 
> For additional use, adjust the default to be true when
> feature.experimental is enabled.

Seems like a good time to do it. We'd eventually want a similar option
for updating it on the receiving side of a push, too. I don't insist
that come at the same time, but we should at least have a plan about how
it will look to the user.

Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph,
and then a master transfer.writeCommitGraph?

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes:

> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph,
> and then a master transfer.writeCommitGraph?

If anything, it may be good for consistency.

I am not sure if it is a good idea to trigger writing the commit
graph when accepting a push, though.  It tends to be a lot finer
grained than fetching, right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/5/2019 4:37 PM, Junio C Hamano wrote:
> Jeff King <[email protected]> writes:
> 
>> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph,
>> and then a master transfer.writeCommitGraph?
> 
> If anything, it may be good for consistency.
> 
> I am not sure if it is a good idea to trigger writing the commit
> graph when accepting a push, though.  It tends to be a lot finer
> grained than fetching, right?

And I expect a push to include many fewer commits than a fetch.
In a server environment, I would expect to have a separate
maintenance task responsible for updating the commit-graph after
receiving new data, but not in an in-line fashion with the push.

Think about the situation of many pushes that happen in a short
burst: one write after all of the pushes would have close to the
same performance benefits as writing with every push, but does
a lot less work.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 01:00:40PM -0400, Derrick Stolee wrote:

> On 9/5/2019 4:37 PM, Junio C Hamano wrote:
> > Jeff King <[email protected]> writes:
> > 
> >> Do we want to to have fetch.writeCommitGraph, receive.writeCommitGraph,
> >> and then a master transfer.writeCommitGraph?
> > 
> > If anything, it may be good for consistency.
> > 
> > I am not sure if it is a good idea to trigger writing the commit
> > graph when accepting a push, though.  It tends to be a lot finer
> > grained than fetching, right?
> 
> And I expect a push to include many fewer commits than a fetch.
> In a server environment, I would expect to have a separate
> maintenance task responsible for updating the commit-graph after
> receiving new data, but not in an in-line fashion with the push.

That's probably how we'll end up doing it at GitHub, because we run a
big server farm that has a job-queueing system for periodic maintenance,
etc.

But keep in mind the "little guy" who is hosting a few repositories for
themselves or their company. They'd presumably want the benefits here,
too, right? We already have options to trigger auto-gc and
update-server-info for them, so why not this maintenance, too?

> Think about the situation of many pushes that happen in a short
> burst: one write after all of the pushes would have close to the
> same performance benefits as writing with every push, but does
> a lot less work.

Sure, but wouldn't that similarly apply to fetching? What is it that
makes bursts of pushes more likely than bursts of fetches?

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes:

> Sure, but wouldn't that similarly apply to fetching? What is it that
> makes bursts of pushes more likely than bursts of fetches?

Because people tend to use a repository as a gathering point?  You
may periodically fetch from and push to a repository, and you may
even do so at the same interval, but simply because there are more
"other" people than you alone as a single developer in the project,
your fetch tends to grab work from more people than yoru push that
publish your work alone?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <[email protected]> writes:

> I suppose so. But I think the "stock git without any other job
> infrastructure" case would still benefit.

Oh, no question about it.

I did question if an automatic writing would benefit the side that
receives a push when you brought up the usual "fetch.* and receive.*
for separate configuration, transfer.* when want to rule them both"
convention, which is a good idea if only for consistency, but the
question was if it helps the receiving end of a push to the same
degree as it would help the repository that fetches.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/6/2019 4:42 PM, Junio C Hamano wrote:
> Jeff King <[email protected]> writes:
> 
>> I suppose so. But I think the "stock git without any other job
>> infrastructure" case would still benefit.
> 
> Oh, no question about it.
> 
> I did question if an automatic writing would benefit the side that
> receives a push when you brought up the usual "fetch.* and receive.*
> for separate configuration, transfer.* when want to rule them both"
> convention, which is a good idea if only for consistency, but the
> question was if it helps the receiving end of a push to the same
> degree as it would help the repository that fetches.

I think the "stock git without any other job infrastructure" is
a very important scenario. Putting the simplest version of
"commit-graph writes in-line with every push" seems to be ripe
for failure under load. I'd rather think deeply about what is
best for this scenario.

Spit-balling: what if we used the same mechanism as running GC
in the background to launch 'git commit-graph write' commands?
And we could have the command give up (without failure) if the
commit-graph.lock file is already created, so concurrent pushes
would not fight each other.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <[email protected]> writes:

> On 9/6/2019 4:42 PM, Junio C Hamano wrote:
>> Jeff King <[email protected]> writes:
>> 
>>> I suppose so. But I think the "stock git without any other job
>>> infrastructure" case would still benefit.
>> 
>> Oh, no question about it.
>> 
>> I did question if an automatic writing would benefit the side that
>> receives a push when you brought up the usual "fetch.* and receive.*
>> for separate configuration, transfer.* when want to rule them both"
>> convention, which is a good idea if only for consistency, but the
>> question was if it helps the receiving end of a push to the same
>> degree as it would help the repository that fetches.
>
> I think the "stock git without any other job infrastructure" is
> a very important scenario. Putting the simplest version of
> "commit-graph writes in-line with every push" seems to be ripe
> for failure under load. I'd rather think deeply about what is
> best for this scenario.

As to what to do on the push side, I suppose we can afford to let it
simmer in the back of our heads while moving this topic forward.
Whether we'd later decide to have receive.writeCommitGraph (in which
case we would add transfer.writeCommitGraph, too) or not, this
change on the fetch side can independently be used, right?

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 05:04:17PM -0400, Derrick Stolee wrote:

> On 9/6/2019 4:42 PM, Junio C Hamano wrote:
> > Jeff King <[email protected]> writes:
> > 
> >> I suppose so. But I think the "stock git without any other job
> >> infrastructure" case would still benefit.
> > 
> > Oh, no question about it.
> > 
> > I did question if an automatic writing would benefit the side that
> > receives a push when you brought up the usual "fetch.* and receive.*
> > for separate configuration, transfer.* when want to rule them both"
> > convention, which is a good idea if only for consistency, but the
> > question was if it helps the receiving end of a push to the same
> > degree as it would help the repository that fetches.
> 
> I think the "stock git without any other job infrastructure" is
> a very important scenario. Putting the simplest version of
> "commit-graph writes in-line with every push" seems to be ripe
> for failure under load. I'd rather think deeply about what is
> best for this scenario.

If it's going to be a problem under load, that makes me worry about the
same thing for fetches. Whether you see a lot of either depends on your
workflow. But as long as neither option is enabled by default (or at
least if it becomes common knowledge to _disable_ them if you have a
high rate), it may be OK.

I do agree that the normal "busy" repos you and I have both seen in
enterprise settings (where people are literally pushing multiple times
per second at peak) would want some kind of throttling. But I think
there's a long tail of "push once a week" repos.

> Spit-balling: what if we used the same mechanism as running GC
> in the background to launch 'git commit-graph write' commands?
> And we could have the command give up (without failure) if the
> commit-graph.lock file is already created, so concurrent pushes
> would not fight each other.

I have mixed feelings. It's nice not to stand in the critical path of
the user, but background tasks have a way of finding funny corner cases
(e.g., packfiles moving around, or the issues we've had with deciding
when to shut down auto-gc for a grace period due to warnings/errors).

I think since this is generally additive (adding new commit-graph
files), it might be less finicky, though.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Sep 06, 2019 at 02:57:55PM -0700, Junio C Hamano wrote:

> > I think the "stock git without any other job infrastructure" is
> > a very important scenario. Putting the simplest version of
> > "commit-graph writes in-line with every push" seems to be ripe
> > for failure under load. I'd rather think deeply about what is
> > best for this scenario.
> 
> As to what to do on the push side, I suppose we can afford to let it
> simmer in the back of our heads while moving this topic forward.
> Whether we'd later decide to have receive.writeCommitGraph (in which
> case we would add transfer.writeCommitGraph, too) or not, this
> change on the fetch side can independently be used, right?

Yeah, I'm OK with proceeding without the receive-pack side for now.

-Peff

* `fetch.negotiationAlgorithm=skipping` may improve fetch negotiation times by
skipping more commits at a time, reducing the number of round trips.
+
* `fetch.writeCommitGraph=true` writes a commit-graph after every `git fetch`
command that downloads a pack-file from a remote. Using the `--split` option,
most executions will create a very small commit-graph file on top of the
existing commit-graph file(s). Occasionally, these files will merge and the
write may take longer. Having an updated commit-graph file helps performance
of many Git commands, including `git merge-base`, `git push -f`, and
`git log --graph`.

feature.manyFiles::
Enable config options that optimize for repos with many files in the
Expand Down
10 changes: 10 additions & 0 deletions Documentation/config/fetch.txt
Original file line number Diff line number Diff line change
Expand Up @@ -69,3 +69,13 @@ fetch.showForcedUpdates::
Set to false to enable `--no-show-forced-updates` in
linkgit:git-fetch[1] and linkgit:git-pull[1] commands.
Defaults to true.

fetch.writeCommitGraph::
Set to true to write a commit-graph after every `git fetch` command
that downloads a pack-file from a remote. Using the `--split` option,
most executions will create a very small commit-graph file on top of
the existing commit-graph file(s). Occasionally, these files will
merge and the write may take longer. Having an updated commit-graph
file helps performance of many Git commands, including `git merge-base`,
`git push -f`, and `git log --graph`. Defaults to false, unless
`feature.experimental` is true.
15 changes: 15 additions & 0 deletions builtin/fetch.c
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
#include "packfile.h"
#include "list-objects-filter-options.h"
#include "commit-reach.h"
#include "commit-graph.h"

#define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)

Expand Down Expand Up @@ -1715,6 +1716,20 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)

string_list_clear(&list, 0);

prepare_repo_settings(the_repository);
if (the_repository->settings.fetch_write_commit_graph) {
int commit_graph_flags = COMMIT_GRAPH_SPLIT;
struct split_commit_graph_opts split_opts;
memset(&split_opts, 0, sizeof(struct split_commit_graph_opts));

if (progress)
commit_graph_flags |= COMMIT_GRAPH_PROGRESS;

write_commit_graph_reachable(get_object_directory(),
commit_graph_flags,
&split_opts);
}

close_object_store(the_repository->objects);

if (enable_auto_gc) {
Expand Down
4 changes: 4 additions & 0 deletions repo-settings.c
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,14 @@ void prepare_repo_settings(struct repository *r)
UPDATE_DEFAULT_BOOL(r->settings.index_version, 4);
UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_WRITE);
}
if (!repo_config_get_bool(r, "fetch.writecommitgraph", &value))
r->settings.fetch_write_commit_graph = value;
if (!repo_config_get_bool(r, "feature.experimental", &value) && value) {
UPDATE_DEFAULT_BOOL(r->settings.pack_use_sparse, 1);
UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_SKIPPING);
UPDATE_DEFAULT_BOOL(r->settings.fetch_write_commit_graph, 1);
}
UPDATE_DEFAULT_BOOL(r->settings.fetch_write_commit_graph, 0);

/* Hack for test programs like test-dump-untracked-cache */
if (ignore_untracked_cache_config)
Expand Down
1 change: 1 addition & 0 deletions repository.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ struct repo_settings {

int core_commit_graph;
int gc_write_commit_graph;
int fetch_write_commit_graph;

int index_version;
enum untracked_cache_setting core_untracked_cache;
Expand Down
13 changes: 13 additions & 0 deletions t/t5510-fetch.sh
Original file line number Diff line number Diff line change
Expand Up @@ -570,6 +570,19 @@ test_expect_success 'LHS of refspec follows ref disambiguation rules' '
)
'

test_expect_success 'fetch.writeCommitGraph' '
git clone three write &&
(
cd three &&
test_commit new
) &&
(
cd write &&
git -c fetch.writeCommitGraph fetch origin &&
test_path_is_file .git/objects/info/commit-graphs/commit-graph-chain
)
'

# configured prune tests

set_config_tristate () {
Expand Down