Skip to content

Changed Paths Bloom Filters #497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions Documentation/git-commit-graph.txt
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,11 @@ or `--stdin-packs`.)
With the `--append` option, include all commits that are present in the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jakub Narebski wrote (reply to this):

"Garima Singh via GitGitGadget" <[email protected]> writes:

> From: Garima Singh <[email protected]>
>
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.

In the manpage you write that this operation (computing Bloom filters)
can take a while on large repositories.  Could you perhaps provide some
numbers: how much longer does it take to write commit-graph file with
and without '--changed-paths' for example for Linux kernel, or some
other large repository?  Thanks in advance.

>
> Helped-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  Documentation/git-commit-graph.txt | 5 +++++
>  builtin/commit-graph.c             | 9 +++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)

What is missing is some sanity tests: that bloom index and bloom data
chunks are not present without '--changed-paths', and that they are
added with '--changed-paths'.

If possible, maybe also check in a separate test that the size of
bloom_index chunk agrees with the number of commits in the commit graph.


Also, we can now add those tests I have wrote about in my review of
previous patch, that is:

1. If you write commit-graph with --changed-paths, and either add some
   commits later or exclude some commits from the commit graph, then:

   a.) commit(s) in commit-graph have Bloom filter
   b.) commit(s) not in commit-graph do not have Bloom filter

2. If you write commit-graph without --changed-paths as base layer,
   and then write next layer with --changed-paths and --split, then:

   a.) commit(s) in top layer have Bloom filter(s)
   b.) commit(s) in bottom layer don't have Bloom filter(s)

>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index bcd85c1976..907d703b30 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
>  +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can
> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.
> ++

Should we write about limitation that the topmost layer in the split
commit graph needs to be written with '--changed-paths' for Git to use
this information?  Or perhaps we should try (in the future) to remove
this limitation??

>  With the `--split` option, write the commit-graph as a chain of multiple
>  commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
>  not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index e0c6fc4bbf..261dcce091 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>  
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>  
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  };
>  
>  static const char * const builtin_commit_graph_write_usage[] = {
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>

All right.

> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
>  	int split;
>  	int shallow;
>  	int progress;
> +	int enable_changed_paths;

Bikeshed painting: should this field be called enable_changed_paths or
simply changed_paths?

>  } opts;
>  
>  static int graph_verify(int argc, const char **argv)
> @@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
>  			N_("start walk at commits listed by stdin")),
>  		OPT_BOOL(0, "append", &opts.append,
>  			N_("include all commits already in the commit-graph file")),
> +		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> +			N_("enable computation for changed paths")),
>  		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
>  		OPT_BOOL(0, "split", &opts.split,
>  			N_("allow writing an incremental commit-graph file")),

All right.

> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>  	if (opts.progress)
>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +	if (opts.enable_changed_paths)
> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>  
>  	read_replace_refs = 0;

All right.  This actually turns on calculation Bloom filters for changed
paths, thanks to

 	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;

that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
filters for changed paths" patch.

Though... should this enabling be split into two separate patches like
this?


Best,
-- 
Jakub Nar�bski

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Garima Singh wrote (reply to this):



On 2/20/2020 3:28 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <[email protected]> writes:
> 
>> From: Garima Singh <[email protected]>
>>
>> Add --changed-paths option to git commit-graph write. This option will
>> allow users to compute information about the paths that have changed
>> between a commit and its first parent, and write it into the commit graph
>> file. If the option is passed to the write subcommand we set the
>> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
>> commit-graph logic.
> 
> In the manpage you write that this operation (computing Bloom filters)
> can take a while on large repositories.  Could you perhaps provide some
> numbers: how much longer does it take to write commit-graph file with
> and without '--changed-paths' for example for Linux kernel, or some
> other large repository?  Thanks in advance.
> 

Yes. Will include numbers as appropriate in v3. 

>>
>> Helped-by: Derrick Stolee <[email protected]>
>> Signed-off-by: Garima Singh <[email protected]>
>> ---
>>  Documentation/git-commit-graph.txt | 5 +++++
>>  builtin/commit-graph.c             | 9 +++++++--
>>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> What is missing is some sanity tests: that bloom index and bloom data
> chunks are not present without '--changed-paths', and that they are
> added with '--changed-paths'.
> 
> If possible, maybe also check in a separate test that the size of
> bloom_index chunk agrees with the number of commits in the commit graph.
> 
> 
> Also, we can now add those tests I have wrote about in my review of
> previous patch, that is:
> 
> 1. If you write commit-graph with --changed-paths, and either add some
>    commits later or exclude some commits from the commit graph, then:
> 
>    a.) commit(s) in commit-graph have Bloom filter
>    b.) commit(s) not in commit-graph do not have Bloom filter
> 
> 2. If you write commit-graph without --changed-paths as base layer,
>    and then write next layer with --changed-paths and --split, then:
> 
>    a.) commit(s) in top layer have Bloom filter(s)
>    b.) commit(s) in bottom layer don't have Bloom filter(s)
> 

I will see what more can be done here. 

>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index bcd85c1976..907d703b30 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>  With the `--append` option, include all commits that are present in the
>>  existing commit-graph file.
>>  +
>> +With the `--changed-paths` option, compute and write information about the
>> +paths changed between a commit and it's first parent. This operation can
>> +take a while on large repositories. It provides significant performance gains
>> +for getting history of a directory or a file with `git log -- <path>`.
>> ++
> 
> Should we write about limitation that the topmost layer in the split
> commit graph needs to be written with '--changed-paths' for Git to use
> this information?  Or perhaps we should try (in the future) to remove
> this limitation??
> 

Given that this information is going to be used best effort, it would be 
superfluous to describe every case and conditional that decides whether 
this information is being used.
>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>>  	if (opts.progress)
>>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>> +	if (opts.enable_changed_paths)
>> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>  
>>  	read_replace_refs = 0;
> 
> All right.  This actually turns on calculation Bloom filters for changed
> paths, thanks to
> 
>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> 
> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
> filters for changed paths" patch.
> 
> Though... should this enabling be split into two separate patches like
> this?
> 

The idea is that in 4/11 We compute only if the flag is set. 
And between that patch and this one: we prepare the foundational code 
that is now ready for that flag to be set via an opt-in by the user. 

> 
> Best,
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jakub Narebski wrote (reply to this):

Garima Singh <[email protected]> writes:
> On 2/20/2020 3:28 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <[email protected]> writes:

[...]
>>> --- a/Documentation/git-commit-graph.txt
>>> +++ b/Documentation/git-commit-graph.txt
>>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>>  With the `--append` option, include all commits that are present in the
>>>  existing commit-graph file.
>>>  +
>>> +With the `--changed-paths` option, compute and write information about the
>>> +paths changed between a commit and it's first parent. This operation can
>>> +take a while on large repositories. It provides significant performance gains
>>> +for getting history of a directory or a file with `git log -- <path>`.
>>> ++
>> 
>> Should we write about limitation that the topmost layer in the split
>> commit graph needs to be written with '--changed-paths' for Git to use
>> this information?  Or perhaps we should try (in the future) to remove
>> this limitation?
>
> Given that this information is going to be used best effort, it would be 
> superfluous to describe every case and conditional that decides whether 
> this information is being used.

I can somewhat agree with this reasoning.

However what I would like to avoid is surprising users.  If one creates
base commit-graph with Bloom filters data, but then when creating
new layer of commit-graph (updating it incrementally), it may be
surprising that `git log -- <path>` is now much slower.

On the other hand if one would update commit-graph in a non-incremental
way (rewriting the commit-graph file), loosing the Bloom filter
information and performance of `git log -- <path>` because one forgot to
include `--changed-paths` is not that unexpected.

Anyway, in the future when this mechanism will be controlled by
appropriate config variable, this whole discussion would become somewhat
moot.


Thought for the future: perhaps `git commit-graph verify` could detect
that split graph has Bloom filters only for some layers, and inform the
user?  But that is almost certainly out of scope of this patch series.

>>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>>>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>>>  	if (opts.progress)
>>>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>>> +	if (opts.enable_changed_paths)
>>> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>>  
>>>  	read_replace_refs = 0;
>> 
>> All right.  This actually turns on calculation Bloom filters for changed
>> paths, thanks to
>> 
>>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
>> 
>> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
>> filters for changed paths" patch.
>> 
>> Though... should this enabling be split into two separate patches like
>> this?
>
> The idea is that in 4/11 We compute only if the flag is set. 
> And between that patch and this one: we prepare the foundational code 
> that is now ready for that flag to be set via an opt-in by the user. 

All right.

Choosing how to split large change into series is not easy.  One one
hand one would want for each change to be small and self contained.  On
the other hand it would be good if each change was testable (test-tool
can help here).

Best,
-- 
Jakub Nar�bski

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Bryan Turner wrote (reply to this):

On Wed, Feb 5, 2020 at 2:56 PM Garima Singh via GitGitGadget
<[email protected]> wrote:
>
> From: Garima Singh <[email protected]>
>
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.
>
> Helped-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  Documentation/git-commit-graph.txt | 5 +++++
>  builtin/commit-graph.c             | 9 +++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index bcd85c1976..907d703b30 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
>  +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can

"its first parent"

(Pardon the grammar nit from the peanut gallery!)

> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.
> ++
>  With the `--split` option, write the commit-graph as a chain of multiple
>  commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
>  not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index e0c6fc4bbf..261dcce091 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>
>  static char const * const builtin_commit_graph_usage[] = {
>         N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> -       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>         NULL
>  };
>
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  };
>
>  static const char * const builtin_commit_graph_write_usage[] = {
> -       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +       N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>         NULL
>  };
>
> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
>         int split;
>         int shallow;
>         int progress;
> +       int enable_changed_paths;
>  } opts;
>
>  static int graph_verify(int argc, const char **argv)
> @@ -110,6 +111,8 @@ static int graph_write(int argc, const char **argv)
>                         N_("start walk at commits listed by stdin")),
>                 OPT_BOOL(0, "append", &opts.append,
>                         N_("include all commits already in the commit-graph file")),
> +               OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> +                       N_("enable computation for changed paths")),
>                 OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
>                 OPT_BOOL(0, "split", &opts.split,
>                         N_("allow writing an incremental commit-graph file")),
> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>                 flags |= COMMIT_GRAPH_WRITE_SPLIT;
>         if (opts.progress)
>                 flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +       if (opts.enable_changed_paths)
> +               flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>
>         read_replace_refs = 0;
>
> --
> gitgitgadget
>

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Garima Singh wrote (reply to this):


On 2/20/2020 5:10 PM, Bryan Turner wrote:
> On Wed, Feb 5, 2020 at 2:56 PM Garima Singh via GitGitGadget
> <[email protected]> wrote:
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index bcd85c1976..907d703b30 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>  With the `--append` option, include all commits that are present in the
>>  existing commit-graph file.
>>  +
>> +With the `--changed-paths` option, compute and write information about the
>> +paths changed between a commit and it's first parent. This operation can
> 
> "its first parent"
> 
> (Pardon the grammar nit from the peanut gallery!)
> 

:)
Thank you! Fixed in v3. 

Cheers! 
Garima Singh

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:51PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <[email protected]>
> 
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.
> 
> Helped-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  Documentation/git-commit-graph.txt | 5 +++++
>  builtin/commit-graph.c             | 9 +++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 28d1fee5053..f4b13c005b8 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -57,6 +57,11 @@ or `--stdin-packs`.)
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
>  +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can
> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.

So 'git commit-graph write' only computes and writes changed path
Bloom filters if this option is specified.  Though not mentioned in
the documentation or in the commit message, the negated
'--no-changed-paths' is supported as well, and it removes Bloom
filters from the commit-graph file.  All this is quite reasonable.

However, the most important question is what happens when the
commit-graph file already contains Bloom filters and neither of these
options are specified on the command line.  This isn't mentioned in
the docs or in the commit message, either, but as it is implemented in
this patch (i.e. COMMIT_GRAPH_WRITE_BLOOM_FILTERS is not passed from
the builtin to the commit-graph logic) all those existing Bloom
filters are removed from the commit-graph.  Considering how expensive
it was to compute those Bloom filters this might not be the most
desirable behaviour.

This is important, because 'git commit-graph write' is not the only
command that writes the commit-graph file.  'git gc' does that by
default, too, and will wipe out any modified path Bloom filters while
doing so.  Worse, the user doesn't even have to invoke 'git gc'
manually, because a lot of git commands invoke 'git gc --auto'.

  $ git commit-graph write --reachable --changed-paths
  $ ~/src/git/t/helper/test-tool read-graph |grep ^chunks
  chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
  $ git gc --quiet 
  $ ~/src/git/t/helper/test-tool read-graph |grep ^chunks
  chunks: oid_fanout oid_lookup commit_metadata

Consequently, if users want to use modified path Bloom filters, then
they should avoid gc, both manual and auto, or they'll have to
re-generate the Bloom filters every once in a while.  That is
definitely not the desired behaviour.


Now compare this e.g. to the behaviour of 'git update-index
--split-index' and '--untracked-cache': both of these options turn on
features that improve performance and write extra stuff to the index,
and after they did so all subsequent git commands updating the index
will keep writing that extra stuff, including 'git update-index'
itself even without those options, until it's finally invoked with the
corresponding '--no-...' option.  I particularly like how
'--[no-]untracked-cache' and 'core.untrackedCache' work together and
warn when the given command line option goes against the configured
value, and I think the command line options and configuration
variables controlling modified path Bloom filters should behave
similarly.

>  With the `--split` option, write the commit-graph as a chain of multiple
>  commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
>  not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index d1ab6625f63..cacb5d04a80 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>  
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>  
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  };
>  
>  static const char * const builtin_commit_graph_write_usage[] = {
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>  
> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
>  	int split;
>  	int shallow;
>  	int progress;
> +	int enable_changed_paths;
>  } opts;
>  
>  static struct object_directory *find_odb(struct repository *r,
> @@ -135,6 +136,8 @@ static int graph_write(int argc, const char **argv)
>  			N_("start walk at commits listed by stdin")),
>  		OPT_BOOL(0, "append", &opts.append,
>  			N_("include all commits already in the commit-graph file")),
> +		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> +			N_("enable computation for changed paths")),
>  		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
>  		OPT_BOOL(0, "split", &opts.split,
>  			N_("allow writing an incremental commit-graph file")),
> @@ -168,6 +171,8 @@ static int graph_write(int argc, const char **argv)
>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>  	if (opts.progress)
>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +	if (opts.enable_changed_paths)
> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>  
>  	read_replace_refs = 0;
>  	odb = find_odb(the_repository, opts.obj_dir);
> -- 
> gitgitgadget
> 

existing commit-graph file.
+
With the `--changed-paths` option, compute and write information about the
paths changed between a commit and it's first parent. This operation can
take a while on large repositories. It provides significant performance gains
for getting history of a directory or a file with `git log -- <path>`.
+
With the `--split` option, write the commit-graph as a chain of multiple
commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
not already in the commit-graph are added in a new "tip" file. This file
Expand Down
30 changes: 30 additions & 0 deletions Documentation/technical/commit-graph-format.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ metadata, including:
- The parents of the commit, stored using positional references within
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jakub Narebski wrote (reply to this):

"Garima Singh via GitGitGadget" <[email protected]> writes:

> From: Garima Singh <[email protected]>
>
> Update the technical documentation for commit-graph-format with BIDX
> and BDAT chunk information.
>
> RFC Notes:
> 1. [Call for advice] We specifically mention that we are using Bloom
>    filters in this technical document. Should this document also be
>    made open to other data structures in the future, with versioning
>    information?

I'm not sure.  In theory we might want to switch to another
probabilistic set inclusion query structure, like xor filters or cuckoo
hashing.

On one hand side we could use separate chunks (e.g. XIDX, XDAT for xor
filters), on the other hand we need only one such structure.  On the
gripping hand this can be left for the future, if needed.

Sidenote: using Bloom filters is somewhat encoded in the name of chunk
(B from Bloom filter).  I don't have a better poposal for 4-char name
(XIDX / XDAT for cXange?  CHDX / CHDT for CHange?  FIDX / FDAT for
changed Files?... I don't know).

>
> 2. [Call for advice] We are also not describing the explicit nature
>    of how we store the bloom filter binary data. Would it be useful
>    to document details about the hash algorithm, the number of hashes
>    and the specific seed values we are using in a separate document,
>    or perhaps in a separate section in this document?

I think it would be best to keep description of the commit graph format
concise.  The details about Bloom filter implementation would be better
put in Documentation/technical/commit-graph.txt in my opinion, together
with reasoning behind it (perhaps borrowing from Derrick Stolee blog
post).

This could be done as a separate patch.

>
> Helped-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  Documentation/technical/commit-graph-format.txt | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index a4f17441ae..6497f19f08 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -17,6 +17,9 @@ metadata, including:
>  - The parents of the commit, stored using positional references within
>    the graph file.
>  
> +- The bloom filter of the commit carrying the paths that were changed between
> +  the commit and it's first parent.

s/bloom/Bloom/ and s/it's/its/

I am not sure about exact wording, but I could at this time think of a
better but concise way of stating it.

> +
>  These positional references are stored as unsigned 32-bit integers
>  corresponding to the array position within the list of commit OIDs. Due
>  to some special constants we use to track parents, we can store at most
> @@ -93,6 +96,20 @@ CHUNK DATA:
>        positions for the parents until reaching a value with the most-significant
>        bit on. The other bits correspond to the position of the last parent.
>  
> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) [Optional]
> +      For each commit we store the offset of its bloom filter in the BDAT chunk
> +      as follows:
> +      BIDX[i] = number of 8-byte words in all the bloom filters from commit 0 to
> +		commit i (inclusive)

I think it would be better for consistency and ease of reading to follow
the example of OID Fanout (OIDF) chunk description:

 +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
 +      The ith entry, BIDX[i], stores the number of 8-byte word blocks
 +      in all Bloom filters from commit 0 up to commit i (inclusive)
 +      in lexicographical order.

Maybe even add the following to make implementing it easier:

 +      Data for Bloom filter for i-th commit spans from BIDX[i-1] to
 +      BIDX[i] (plus header length), where we take BIDX[-1] to be 0.

Is it possible for (BIDX[i] - BIDX[i-1]) to be zero (no Bloom filter),
for example for commits with more than 512 changes?  Or is this case
handled by 1 8-byte word Bloom filter of all bits sets to '1', i.e.
0xffffffffffffffff?

How the case of too many changes is distingushed from the case of no
changes (`git commit --allow-empty`, or `git merge --ours`)?  Is the
case of no changes uninteresting, i.e. Bloom filter consisting of zero,
that is with all bits set to '0'?

> +
> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> +      * It starts with three 32 bit integers for the

I would say "It starts with the header consisting of three unsigned
32-bit integers:" (but current version is not bad).

I wonder if this metadata should perhaps be put in a separate chunk,
BMET... (Bloom filter METadata).

> +	    - version of the hash algorithm being used

This number not only encodes that the base hash algorithm being used is
32-bit Murmur3 hash, but that 'k' hashes used in the computation are
created out of Murmur3 hash using double hashing technique, and
specifies two specific seed values for this double hashing technique.

[Maybe we should store those two seed values here too?]

It might be important to say that the currently supported version is
'1', and if Git encounters unknown hashing algorithm version it should
not use Bloom filter data.

Unless we store encoded _name_ of the hash algorithm, e.g. bytes
'm','u','r','3' for MurmurHash3_32... though it is about more than
a base hash.

Do we need whole 4 bytes for hash version, or is it for ease of use and
alignment?

> +	    - the number of hashes used in the computation

All right.  Perhaps we should test in the future patches that the value
different from the default of 7 would also work.

Also 8-bits / 1 byte for number of hashes (hash functions) should be
enough: as I have written in prevous reply there is no need for k > 32.

> +	    - the number of bits per entry

This is important for construction of Bloom filter, but I think it is
not necessary to use it -- so it may not be necessary to store it.

Would also fit in a single byte: we don't need exceedingly low false
positive probability.

We could use it to estimate the false positive probability, and...

> +	  * The rest of the chunk is the concatenation of all the computed bloom 
> +	  filters for the commits in lexicographic order.

  +	 * The rest of the chunk is the concatenation of all the computed Bloom 
  +	   filters for the commits in lexicographic order.

It would be, I think, a good idea to make it explicit that BDAT is
present iff BIDX is present (iff == if and only if), i.e. that either
both or neither of those chunks should be present.

> +
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>        This list of H-byte hashes describe a set of B commit-graph files that
>        form a commit-graph chain. The graph position for the ith commit in this

Best,
-- 
Jakub Nar�bski

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Garima Singh wrote (reply to this):



On 2/19/2020 10:13 AM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <[email protected]> writes:
> 
>> From: Garima Singh <[email protected]>
>>
>> Update the technical documentation for commit-graph-format with the formats for
>> the Bloom filter index (BIDX) and Bloom filter data (BDAT) chunks. Write the
>> computed Bloom filters information to the commit graph file using this format.
> 
> Nice description.
> 
> The only minor nitpick is with the formating: it is 80-character wide,
> which is a bit wide.
> 

Fixed in v3. Thanks! 

>>
>> Helped-by: Derrick Stolee <[email protected]>
>> Signed-off-by: Garima Singh <[email protected]>
>> ---
>>  .../technical/commit-graph-format.txt         |  24 ++++
>>  commit-graph.c                                | 118 +++++++++++++++++-
>>  commit-graph.h                                |   7 +-
>>  3 files changed, 145 insertions(+), 4 deletions(-)
>>
>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>> index a4f17441ae..22e511643d 100644
>> --- a/Documentation/technical/commit-graph-format.txt
>> +++ b/Documentation/technical/commit-graph-format.txt
>> @@ -17,6 +17,9 @@ metadata, including:
>>  - The parents of the commit, stored using positional references within
>>    the graph file.
>>  
>> +- The Bloom filter of the commit carrying the paths that were changed between
>> +  the commit and its first parent.
>> +
> 
> All right.
> 
> Should we also state that it is optional (meta)data?  This would be
> first optional piece of data stored in commit-graph, I think.
> 

However the entire commit graph file is non critical metadata since git commands
work just fine without it, just slower. The same applies to the changed path
bloom filters. 

Based on the definition of optional you are suggesting, edge data is optional
because not every commit-graph has octopus merges. 

>> +
>> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>> +    * It starts with header consisting of three unsigned 32-bit integers:
>> +      - Version of the hash algorithm being used. We currently only support
>> +	value 1 which implies the murmur3 hash implemented exactly as described
>> +	in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> 
> First a minor issue: shouldn't this nested unordered list be indented
> with a hanging indent formatted with spaces?  That is be formatted like
> the following:
> 
>   +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>   +    * It starts with header consisting of three unsigned 32-bit integers:
>   +      - Version of the hash algorithm being used. We currently only support
>   +        value 1 which implies the murmur3 hash implemented exactly as
>   +        described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> 
> But the existing formatting with spaces and tabs might be fine as it is,
> that is it renders as nested list with Asciidoc; it only looks a bit
> weird as patch, not so as text.
> 
> Second, and more important: it is in my opinion not enough information,
> at least if we are assuming that the information in this document should
> be enough for clean-room reimplementation of Bloom filter functionality
> (for example by JGit).  To generate compatible Bloom filters, one needs
> also the information on how to create $k$ functionally-independent hash
> functions out of murmur3 hash.  We do it currently using double hashing
> technique; if that changes then the exact set of bits in the Bloom
> filter would also change.
> 
> The additional description could look something like the following:
> 
>   +    * It starts with header consisting of three unsigned 32-bit integers:
>   +      - Version of the hash algorithm being used. We currently only support
>   +        value 1 which implies the murmur3_32 hash implemented exactly as
>   +        described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>   +        and double hashing technique with 0x293ae76f and 0x7e646e2c seeds
>   +        as described in https://doi.org/10.1007/978-3-540-30494-4_26
>   +        "Bloom Filters in Probabilistic Verification"
> 
> Also, it should be explicitly noted that we use murmur3_32, because
> there is also 128-bit version of murmur3 hash.
> 

I will incorporate this in. Thanks! 


>> +    * The BDAT chunk is present iff BIDX is present.
> 
> Perhaps we should spell 'iff' in full, that is 'if and only if'?
> 

Sure. 

>> +
>>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>>        This list of H-byte hashes describe a set of B commit-graph files that
>>        form a commit-graph chain. The graph position for the ith commit in this
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 32a315058f..4585b3b702 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -24,8 +24,10 @@
>>  #define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
>>  #define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
>>  #define GRAPH_CHUNKID_EXTRAEDGES 0x45444745 /* "EDGE" */
>> +#define GRAPH_CHUNKID_BLOOMINDEXES 0x42494458 /* "BIDX" */
>> +#define GRAPH_CHUNKID_BLOOMDATA 0x42444154 /* "BDAT" */
>>  #define GRAPH_CHUNKID_BASE 0x42415345 /* "BASE" */
>> -#define MAX_NUM_CHUNKS 5
>> +#define MAX_NUM_CHUNKS 7
>>  
>>  #define GRAPH_DATA_WIDTH (the_hash_algo->rawsz + 16)
>>  
>> @@ -325,6 +327,32 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
>>  				chunk_repeated = 1;
>>  			else
>>  				graph->chunk_base_graphs = data + chunk_offset;
>> +			break;
>> +
>> +		case GRAPH_CHUNKID_BLOOMINDEXES:
>> +			if (graph->chunk_bloom_indexes)
>> +				chunk_repeated = 1;
>> +			else
>> +				graph->chunk_bloom_indexes = data + chunk_offset;
>> +			break;
>> +
>> +		case GRAPH_CHUNKID_BLOOMDATA:
>> +			if (graph->chunk_bloom_data)
>> +				chunk_repeated = 1;
>> +			else {
>> +				uint32_t hash_version;
>> +				graph->chunk_bloom_data = data + chunk_offset;
>> +				hash_version = get_be32(data + chunk_offset);
>> +
>> +				if (hash_version != 1)
>> +					break;
> 
> Shouldn't we mark Bloom filter as not to be used?  Or is it left for
> later commit?
> 

We take care of this in line 375. 

> In the future it might be good idea to notify the user (perhaps
> protected with some advice.* option) that there is problem with Bloom
> filter data, namely that we have encountered unsupported hash version.
> 
>> +
>> +				graph->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings));
> 
> Why is this structure allocated dynamically?  We are leaking admittedly
> a small amount of memory because we never free this xmalloc() result.
> 
> If we need this field being a pointer to struct to have NULL mean no
> supported Bloom filter data, we could have instead use chunk_bloom_*
> fields instead - we can set at least one of them to NULL.
> 

I am freeing this up in free_commit_graph but I messed up putting it in the right commit. 
Sorry about that. Fixed in v3. 

Also as discussed in https://lore.kernel.org/git/[email protected]/
there is a bug in commit-graph.c where we should be calling free_commit_graph() instead of 
just free(graph). I will do this in a separate series. 

>> +			}
>> +			break;
>>  		}
>>  
>>  		if (chunk_repeated) {
>> @@ -343,6 +371,17 @@ struct commit_graph *parse_commit_graph(void *graph_map, int fd,
>>  		last_chunk_offset = chunk_offset;
>>  	}
>>  
>> +	/* We need both the bloom chunks to exist together. Else ignore the data */
>> +	if ((graph->chunk_bloom_indexes && !graph->chunk_bloom_data)
>> +		 || (!graph->chunk_bloom_indexes && graph->chunk_bloom_data)) {
>> +		graph->chunk_bloom_indexes = NULL;
>> +		graph->chunk_bloom_data = NULL;
>> +		graph->bloom_filter_settings = NULL;
>> +	}
>> +
>> +	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data)
>> +		load_bloom_filters();
> 
> Wouldn't it be simpler to rely on the fact that both Bloom chunks must
> exists for it to matter, and write it like this:
> 
>   +	if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) {
>   +		load_bloom_filters();
>   +	} else {
>   +		graph->chunk_bloom_indexes = NULL;
>   +		graph->chunk_bloom_data = NULL;
>   +		graph->bloom_filter_settings = NULL;
>   +	}
> 

:) Yes. Fixed in v3. 

>> +
>>  static int oid_compare(const void *_a, const void *_b)
>>  {
>>  	const struct object_id *a = (const struct object_id *)_a;
>> @@ -1198,8 +1290,8 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>  	load_bloom_filters();
>>  
>>  	if (ctx->report_progress)
>> -		progress = start_progress(
>> -			_("Computing commit diff Bloom filters"),
>> +		progress = start_delayed_progress(
>> +			_("Computing changed paths Bloom filters"),
>>  			ctx->commits.nr);
>>
> 
> Ooops.  This look like a fixup which should be made to the original
> earlier commit instead, isn't it?


Yes. Should have been in a previous commit. Fixed in v3. 


>>  };
>>  
>>  struct commit_graph *load_commit_graph_one_fd_st(int fd, struct stat *st);
>> @@ -77,7 +82,7 @@ enum commit_graph_write_flags {
>>  	COMMIT_GRAPH_WRITE_SPLIT      = (1 << 2),
>>  	/* Make sure that each OID in the input is a valid commit OID. */
>>  	COMMIT_GRAPH_WRITE_CHECK_OIDS = (1 << 3),
>> -	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4)
>> +	COMMIT_GRAPH_WRITE_BLOOM_FILTERS = (1 << 4),
> 
> This looks like accidental change; if we want to use trailing comma in
> enum, this change should be in my opinion done in the commit that added
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS (as I have written in a comment there).
> 

Yes, I noticed the lack of the comma later and forgot to move it to the right
commit. Fixed in v3. 

> 
> Thank you for your work on this series.
> 
> Best,
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jakub Narebski wrote (reply to this):

Garima Singh <[email protected]> writes:
> On 2/19/2020 10:13 AM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <[email protected]> writes:
[...]
>>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>>> index a4f17441ae..22e511643d 100644
>>> --- a/Documentation/technical/commit-graph-format.txt
>>> +++ b/Documentation/technical/commit-graph-format.txt
>>> @@ -17,6 +17,9 @@ metadata, including:
>>>  - The parents of the commit, stored using positional references within
>>>    the graph file.
>>>  
>>> +- The Bloom filter of the commit carrying the paths that were changed between
>>> +  the commit and its first parent.
>>> +
>> 
>> All right.
>> 
>> Should we also state that it is optional (meta)data?  This would be
>> first optional piece of data stored in commit-graph, I think.
>> 
>
> However the entire commit graph file is non critical metadata since git commands
> work just fine without it, just slower. The same applies to the changed path
> bloom filters. 
>
> Based on the definition of optional you are suggesting, edge data is optional
> because not every commit-graph has octopus merges. 

Well, edge data (EDGE chunk) is optional in different way from Bloom
filter data.  The former depends on the repository (whether there are
octopus merges used), the latter is opt-in user choice (whether to run
`git commit-graph write` with the `--changed-paths` option, or in the
future equivalent config option).

To provide some advise that can be acted upon: perhaps it would be
better to start with "It can store", or end with "if requested" or
"optionally".  For example the change could look like the following
suggestion:


 The Git commit graph stores a list of commit OIDs and some associated
 metadata, including:
[...]
+- The Bloom filter of the commit carrying the paths that were changed between
+  the commit and its first parent, if requested.
+

Best,
-- 
Jakub Nar�bski

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Garima Singh wrote (reply to this):


On 2/25/2020 6:40 AM, Jakub Narebski wrote:
> Garima Singh <[email protected]> writes:
>> On 2/19/2020 10:13 AM, Jakub Narebski wrote:
>>> "Garima Singh via GitGitGadget" <[email protected]> writes:
> [...]
>>>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>>>> index a4f17441ae..22e511643d 100644
>>>> --- a/Documentation/technical/commit-graph-format.txt
>>>> +++ b/Documentation/technical/commit-graph-format.txt
>>>> @@ -17,6 +17,9 @@ metadata, including:
>>>>  - The parents of the commit, stored using positional references within
>>>>    the graph file.
>>>>  
>>>> +- The Bloom filter of the commit carrying the paths that were changed between
>>>> +  the commit and its first parent.
>>>> +
>>>
>>> All right.
>>>
>>> Should we also state that it is optional (meta)data?  This would be
>>> first optional piece of data stored in commit-graph, I think.
>>>
>>
>> However the entire commit graph file is non critical metadata since git commands
>> work just fine without it, just slower. The same applies to the changed path
>> bloom filters. 
>>
>> Based on the definition of optional you are suggesting, edge data is optional
>> because not every commit-graph has octopus merges. 
> 
> Well, edge data (EDGE chunk) is optional in different way from Bloom
> filter data.  The former depends on the repository (whether there are
> octopus merges used), the latter is opt-in user choice (whether to run
> `git commit-graph write` with the `--changed-paths` option, or in the
> future equivalent config option).
> 
> To provide some advise that can be acted upon: perhaps it would be
> better to start with "It can store", or end with "if requested" or
> "optionally".  For example the change could look like the following
> suggestion:
> 
> 
>  The Git commit graph stores a list of commit OIDs and some associated
>  metadata, including:
> [...]
> +- The Bloom filter of the commit carrying the paths that were changed between
> +  the commit and its first parent, if requested.
> +
> 
> Best,
> 

Sure. That makes sense. Will incorporate in v3. 

Cheers!
Garima Singh

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:49PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <[email protected]>
> 
> Update the technical documentation for commit-graph-format with
> the formats for the Bloom filter index (BIDX) and Bloom filter
> data (BDAT) chunks. Write the computed Bloom filters information
> to the commit graph file using this format.
> 
> Helped-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  .../technical/commit-graph-format.txt         |  30 +++++
>  commit-graph.c                                | 113 +++++++++++++++++-
>  commit-graph.h                                |   5 +
>  3 files changed, 147 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index a4f17441aed..de56f9f1efd 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -17,6 +17,9 @@ metadata, including:
>  - The parents of the commit, stored using positional references within
>    the graph file.
>  
> +- The Bloom filter of the commit carrying the paths that were changed between
> +  the commit and its first parent, if requested.
> +
>  These positional references are stored as unsigned 32-bit integers
>  corresponding to the array position within the list of commit OIDs. Due
>  to some special constants we use to track parents, we can store at most
> @@ -93,6 +96,33 @@ CHUNK DATA:
>        positions for the parents until reaching a value with the most-significant
>        bit on. The other bits correspond to the position of the last parent.
>  
> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> +    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all

This is inconsistent with the implementation: according to the code in
one of the previous patches these entries are simple byte offsets, not
8-byte word offsets, i.e. the combined size of all modified path
Bloom filters can be at most 2^32 bytes.

The commit-graph file can contain information about at most 2^31-1
commits.  This means that with that many commits each commit can have
a merely 2 byte Bloom filter on average.  When using 7 hashes we'd
need 10 bits per path, so in two bytes we could store only a single
path.

Clearly, using 4 byte index entries significantly lowers the max
number of commits that can be stored with modified path Bloom filters.
IMO every new chunk must support at least 2^31-1 commits.

> +      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> +      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> +      BIDX[i] (plus header length), where BIDX[-1] is 0.
> +    * The BIDX chunk is ignored if the BDAT chunk is not present.
> +
> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> +    * It starts with header consisting of three unsigned 32-bit integers:
> +      - Version of the hash algorithm being used. We currently only support
> +	value 1 which corresponds to the 32-bit version of the murmur3 hash
> +	implemented exactly as described in
> +	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
> +	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
> +	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> +	in Probabilistic Verification"

How should double hashing compute the k hashes, i.e. using 64 bit or
32 bit unsigned integer arithmetic?

I'm puzzled that you link to this paper and still use double hashing.

Two of the contributions of that paper are that it points out some
shortcomings of the double hashing scheme and provides a better
alternative in the form of enhanced double hashing, which can cut the
false positive rate in half.

However, that paper considers the hashing scheme only in the context
of one big Bloom filter.  I've found that when it comes to many small
Bloom filters then the k hashes produced by any double hashing variant
are not independent enough, and "standard" double hashing fares the
worst among them.  There are real repositories out there where double
hashing has over an order of magnitude higher average false positive
rate than enhanced double hashing.  Though that's not to say that
enhanced double hashing is good...

For details on these issues see

  https://public-inbox.org/git/[email protected]

> +      - The number of times a path is hashed and hence the number of bit positions
> +	      that cumulatively determine whether a file is present in the commit.
> +      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
> +	      contains 'n' entries, then the filter size is the minimum number of 64-bit
> +	      words that contain n*b bits.

Since the ideal number of bits per element depends only on the number
of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
in the commit-graph?

> +    * The rest of the chunk is the concatenation of all the computed Bloom
> +      filters for the commits in lexicographic order.
> +    * Note: Commits with no changes or more than 512 changes have Bloom filters
> +      of length zero.

What does this "Note:" prefix mean in the file format specification?

Can an implementation use a one byte Bloom filter with no bits set for
a commit with no changes?  Can an implementation still store a Bloom
filter for commits that modify more than 512 paths?

> +    * The BDAT chunk is present if and only if BIDX is present.
> +
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>        This list of H-byte hashes describe a set of B commit-graph files that
>        form a commit-graph chain. The graph position for the ith commit in this
> diff --git a/commit-graph.c b/commit-graph.c
> index 732c81fa1b2..a8b6b5cca5d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c

> @@ -1034,6 +1071,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
>  	}
>  }
>  
> +static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> +					    struct write_commit_graph_context *ctx)
> +{
> +	struct commit **list = ctx->commits.list;
> +	struct commit **last = ctx->commits.list + ctx->commits.nr;
> +	uint32_t cur_pos = 0;
> +	struct progress *progress = NULL;
> +	int i = 0;
> +
> +	if (ctx->report_progress)
> +		progress = start_delayed_progress(
> +			_("Writing changed paths Bloom filters index"),
> +			ctx->commits.nr);
> +
> +	while (list < last) {
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		cur_pos += filter->len;

Given a sufficiently large number of commits with large enough Bloom
filters this will silently overflow.

> +		display_progress(progress, ++i);
> +		hashwrite_be32(f, cur_pos);
> +		list++;
> +	}
> +
> +	stop_progress(&progress);
> +}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 5/29/2020 4:57 AM, SZEDER Gábor wrote:
> On Mon, Apr 06, 2020 at 04:59:49PM +0000, Garima Singh via GitGitGadget wrote:
>> From: Garima Singh <[email protected]>
>>
>> Update the technical documentation for commit-graph-format with
>> the formats for the Bloom filter index (BIDX) and Bloom filter
>> data (BDAT) chunks. Write the computed Bloom filters information
>> to the commit graph file using this format.
>>
>> Helped-by: Derrick Stolee <[email protected]>
>> Signed-off-by: Garima Singh <[email protected]>
>> ---
>>  .../technical/commit-graph-format.txt         |  30 +++++
>>  commit-graph.c                                | 113 +++++++++++++++++-
>>  commit-graph.h                                |   5 +
>>  3 files changed, 147 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>> index a4f17441aed..de56f9f1efd 100644
>> --- a/Documentation/technical/commit-graph-format.txt
>> +++ b/Documentation/technical/commit-graph-format.txt
>> @@ -17,6 +17,9 @@ metadata, including:
>>  - The parents of the commit, stored using positional references within
>>    the graph file.
>>  
>> +- The Bloom filter of the commit carrying the paths that were changed between
>> +  the commit and its first parent, if requested.
>> +
>>  These positional references are stored as unsigned 32-bit integers
>>  corresponding to the array position within the list of commit OIDs. Due
>>  to some special constants we use to track parents, we can store at most
>> @@ -93,6 +96,33 @@ CHUNK DATA:
>>        positions for the parents until reaching a value with the most-significant
>>        bit on. The other bits correspond to the position of the last parent.
>>  
>> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
>> +    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
> 
> This is inconsistent with the implementation: according to the code in
> one of the previous patches these entries are simple byte offsets, not
> 8-byte word offsets, i.e. the combined size of all modified path
> Bloom filters can be at most 2^32 bytes.

The documentation was fixed in 88093289cdc (Documentation: changed-path Bloom
filters use byte words, 2020-05-11).

> The commit-graph file can contain information about at most 2^31-1
> commits.  This means that with that many commits each commit can have
> a merely 2 byte Bloom filter on average.  When using 7 hashes we'd
> need 10 bits per path, so in two bytes we could store only a single
> path.
> 
> Clearly, using 4 byte index entries significantly lowers the max
> number of commits that can be stored with modified path Bloom filters.

This is a good point, and certainly the reason for 8-byte multiples.

> IMO every new chunk must support at least 2^31-1 commits.

I'm not sure this is a valid requirement. Even extremely large repositories
(that are created by actual use, not synthetic) are on the scale of 2^24
commits.

You are right that we should make the commit-graph write process more robust
to reaching these limits. You point out that we have a new limit when these
filters are enabled.

For reference, the Windows OS repo has ~4.25 million commits and the
commit-graph file with changed-path Bloom filters is around 520mb. That's
the whole file size, and without the filters it's around 240mb, so the
filters are taking <300mb ~ 2^29 and we would need to grow the repo by 8x
to hit this limit. That's not an unreasonable amount of growth, but is
also far enough away that we can handle it in time.

The incremental commit-graph can actually save us here (and is similar to
how we solved a scale issue in Azure Repos around the multi-pack-index):
we can refuse to merge layers of an incremental commit-graph if the
changed-path filters would exceed the size limit. Of course, the _first_
write of such a commit-graph would need to be aware of this limit and
plan for it in advance, but that's also a theoretical issue.

I'm tracking some follow-up work [1] for the changed-path filters,
including a way to limit the number of filters computed in one
"git commit-graph write" process. I'll make note of your concerns here,
too.

[1] https://github.com/microsoft/git/issues/272

>> +      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
>> +      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
>> +      BIDX[i] (plus header length), where BIDX[-1] is 0.
>> +    * The BIDX chunk is ignored if the BDAT chunk is not present.
>> +
>> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>> +    * It starts with header consisting of three unsigned 32-bit integers:
>> +      - Version of the hash algorithm being used. We currently only support
>> +	value 1 which corresponds to the 32-bit version of the murmur3 hash
>> +	implemented exactly as described in
>> +	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
>> +	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
>> +	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
>> +	in Probabilistic Verification"
> 
> How should double hashing compute the k hashes, i.e. using 64 bit or
> 32 bit unsigned integer arithmetic?
> 
> I'm puzzled that you link to this paper and still use double hashing.
> 
> Two of the contributions of that paper are that it points out some
> shortcomings of the double hashing scheme and provides a better
> alternative in the form of enhanced double hashing, which can cut the
> false positive rate in half.
> 
> However, that paper considers the hashing scheme only in the context
> of one big Bloom filter.  I've found that when it comes to many small
> Bloom filters then the k hashes produced by any double hashing variant
> are not independent enough, and "standard" double hashing fares the
> worst among them.  There are real repositories out there where double
> hashing has over an order of magnitude higher average false positive
> rate than enhanced double hashing.  Though that's not to say that
> enhanced double hashing is good...
> 
> For details on these issues see
> 
>   https://public-inbox.org/git/[email protected]

That message includes very detailed experimental analysis, which is nice.
We will need to do some concrete side-by-side comparisons to see if there
actually is a meaningful difference. (You may have already done this.)

>> +      - The number of times a path is hashed and hence the number of bit positions
>> +	      that cumulatively determine whether a file is present in the commit.
>> +      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
>> +	      contains 'n' entries, then the filter size is the minimum number of 64-bit
>> +	      words that contain n*b bits.
> 
> Since the ideal number of bits per element depends only on the number
> of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
> in the commit-graph?

The ideal number depends also on what false-positive rate you want. In a
hypothetical future where we want to allow customization here, we want
the filters to be consistently sized across all filters.

>> +    * The rest of the chunk is the concatenation of all the computed Bloom
>> +      filters for the commits in lexicographic order.
>> +    * Note: Commits with no changes or more than 512 changes have Bloom filters
>> +      of length zero.
> 
> What does this "Note:" prefix mean in the file format specification?
> 
> Can an implementation use a one byte Bloom filter with no bits set for
> a commit with no changes?  Can an implementation still store a Bloom
> filter for commits that modify more than 512 paths?

This is currently due to a hard-coded value in the implementation. It's not a
requirement of the file format.

>> +    * The BDAT chunk is present if and only if BIDX is present.
>> +
>>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>>        This list of H-byte hashes describe a set of B commit-graph files that
>>        form a commit-graph chain. The graph position for the ith commit in this
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 732c81fa1b2..a8b6b5cca5d 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
> 
>> @@ -1034,6 +1071,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
>>  	}
>>  }
>>  
>> +static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>> +					    struct write_commit_graph_context *ctx)
>> +{
>> +	struct commit **list = ctx->commits.list;
>> +	struct commit **last = ctx->commits.list + ctx->commits.nr;
>> +	uint32_t cur_pos = 0;
>> +	struct progress *progress = NULL;
>> +	int i = 0;
>> +
>> +	if (ctx->report_progress)
>> +		progress = start_delayed_progress(
>> +			_("Writing changed paths Bloom filters index"),
>> +			ctx->commits.nr);
>> +
>> +	while (list < last) {
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
>> +		cur_pos += filter->len;
> 
> Given a sufficiently large number of commits with large enough Bloom
> filters this will silently overflow.

Worth fixing, but we are not in a rush. I noted it in my GitHub issue.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Fri, May 29, 2020 at 09:35:17AM -0400, Derrick Stolee wrote:
> >> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> >> +    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
> > 
> > This is inconsistent with the implementation: according to the code in
> > one of the previous patches these entries are simple byte offsets, not
> > 8-byte word offsets, i.e. the combined size of all modified path
> > Bloom filters can be at most 2^32 bytes.
> 
> The documentation was fixed in 88093289cdc (Documentation: changed-path Bloom
> filters use byte words, 2020-05-11).

Oh, good.  I'm waaay behind the curve and haven't seen this fix.  Even
better, now I also noticed that two bugs I was about to report have
been fixed already (though both fixes have minor flaws).

Ok, so at least the specs are consistent with the implementation.  I'm
not sure this was done in the right direction, though, because too
small Bloom filters do hurt performance.

> > Clearly, using 4 byte index entries significantly lowers the max
> > number of commits that can be stored with modified path Bloom filters.
> 
> This is a good point, and certainly the reason for 8-byte multiples.

Note that Bloom filters with power-of-two number of bits have higher
false positive probabilities when using some form of double hashing.
When going for 8 byte blocks all commits modifying <= 12 paths
(assuming 7 hashes per path) will have power-of-2 sized Bloom filters
(64 or 128 bits), and that is a lot of commits.

> The incremental commit-graph can actually save us here

Oh, I haven't thought of that

> >> +      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> >> +      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> >> +      BIDX[i] (plus header length), where BIDX[-1] is 0.
> >> +    * The BIDX chunk is ignored if the BDAT chunk is not present.
> >> +
> >> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> >> +    * It starts with header consisting of three unsigned 32-bit integers:
> >> +      - Version of the hash algorithm being used. We currently only support
> >> +	value 1 which corresponds to the 32-bit version of the murmur3 hash
> >> +	implemented exactly as described in
> >> +	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
> >> +	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
> >> +	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> >> +	in Probabilistic Verification"
> > 
> > How should double hashing compute the k hashes, i.e. using 64 bit or
> > 32 bit unsigned integer arithmetic?

Note that this should be clarified in the specs.

> >> +      - The number of times a path is hashed and hence the number of bit positions
> >> +	      that cumulatively determine whether a file is present in the commit.
> >> +      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
> >> +	      contains 'n' entries, then the filter size is the minimum number of 64-bit
> >> +	      words that contain n*b bits.
> > 
> > Since the ideal number of bits per element depends only on the number
> > of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
> > in the commit-graph?
> 
> The ideal number depends also on what false-positive rate you want.

Well, yes, but indirectly:  according to Wikipedia :) the optimal
number of hashes per element depends only on the desired false
probability, and the optimal number of bits per element depends only
on the number of hashes per element.

So storing the min number of bits per entry seems to be redundant.

> In a
> hypothetical future where we want to allow customization here, we want
> the filters to be consistently sized across all filters.

Wouldn't customizing through the number of hashes be sufficient?

> >> +    * Note: Commits with no changes or more than 512 changes have Bloom filters
> >> +      of length zero.
> > 
> > What does this "Note:" prefix mean in the file format specification?
> > 
> > Can an implementation use a one byte Bloom filter with no bits set for
> > a commit with no changes?  Can an implementation still store a Bloom
> > filter for commits that modify more than 512 paths?
> 
> This is currently due to a hard-coded value in the implementation. It's not a
> requirement of the file format.

Should an implementation detail like that be part of the specs?  It
sure caused a bit of confusion here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

76ffbca71a (commit-graph: write Bloom filters to commit graph file,
2020-04-06) added two delayed progress lines to writing the Bloom
filter index and data chunk.  This is wrong, because a single common
progress is used while writing all chunks, which is not updated while
writing these two new chunks, resulting in incomplete-looking "done"
lines:

  Expanding reachable commits in commit graph: 888679, done.
  Computing commit changed paths Bloom filters: 100% (888678/888678), done.
  Writing out commit graph in 6 passes:  66% (3554712/5332068), done.

Use the common 'struct progress' instance while writing the Bloom
filter chunks as well.

Signed-off-by: SZEDER Gábor <[email protected]>
---
 commit-graph.c | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index aaf3327ede..65cf32637c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1086,23 +1086,14 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t cur_pos = 0;
-	struct progress *progress = NULL;
-	int i = 0;
-
-	if (ctx->report_progress)
-		progress = start_delayed_progress(
-			_("Writing changed paths Bloom filters index"),
-			ctx->commits.nr);
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
 		cur_pos += filter->len;
-		display_progress(progress, ++i);
+		display_progress(ctx->progress, ++ctx->progress_cnt);
 		hashwrite_be32(f, cur_pos);
 		list++;
 	}
-
-	stop_progress(&progress);
 }
 
 static void write_graph_chunk_bloom_data(struct hashfile *f,
@@ -1111,13 +1102,6 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
-	struct progress *progress = NULL;
-	int i = 0;
-
-	if (ctx->report_progress)
-		progress = start_delayed_progress(
-			_("Writing changed paths Bloom filters data"),
-			ctx->commits.nr);
 
 	hashwrite_be32(f, settings->hash_version);
 	hashwrite_be32(f, settings->num_hashes);
@@ -1125,12 +1109,10 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
-		display_progress(progress, ++i);
+		display_progress(ctx->progress, ++ctx->progress_cnt);
 		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
 		list++;
 	}
-
-	stop_progress(&progress);
 }
 
 static int oid_compare(const void *_a, const void *_b)
-- 
2.27.0.547.g4ba2d26563

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/9/2020 1:00 PM, SZEDER Gábor wrote:
> 76ffbca71a (commit-graph: write Bloom filters to commit graph file,
> 2020-04-06) added two delayed progress lines to writing the Bloom
> filter index and data chunk.  This is wrong, because a single common
> progress is used while writing all chunks, which is not updated while
> writing these two new chunks, resulting in incomplete-looking "done"
> lines:
> 
>   Expanding reachable commits in commit graph: 888679, done.
>   Computing commit changed paths Bloom filters: 100% (888678/888678), done.
>   Writing out commit graph in 6 passes:  66% (3554712/5332068), done.
> 
> Use the common 'struct progress' instance while writing the Bloom
> filter chunks as well.

Thanks for finding this. It's a clearly correct way to go,
and is one of the things that did not get updated properly
between the old prototype when applying it on the new code
that included this ctx->progress pattern.

Junio: head's up that this will conflict with the final patch
in ds/maintenance. I'll remove my edits to these methods in
my v2 to make that merge a bit easier.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/9/2020 2:01 PM, Derrick Stolee wrote:
> Junio: head's up that this will conflict with the final patch
> in ds/maintenance. I'll remove my edits to these methods in
> my v2 to make that merge a bit easier.
Or, I'm getting confused, because I changed start_progress()
calls in midx.c, not commit-graph.c. Please ignore my scattered
brain.

Thanks,
-Stolee

the graph file.

- The Bloom filter of the commit carrying the paths that were changed between
the commit and its first parent, if requested.

These positional references are stored as unsigned 32-bit integers
corresponding to the array position within the list of commit OIDs. Due
to some special constants we use to track parents, we can store at most
Expand Down Expand Up @@ -93,6 +96,33 @@ CHUNK DATA:
positions for the parents until reaching a value with the most-significant
bit on. The other bits correspond to the position of the last parent.

Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
* The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
Bloom filters from commit 0 to commit i (inclusive) in lexicographic
order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
BIDX[i] (plus header length), where BIDX[-1] is 0.
* The BIDX chunk is ignored if the BDAT chunk is not present.

Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
* It starts with header consisting of three unsigned 32-bit integers:
- Version of the hash algorithm being used. We currently only support
value 1 which corresponds to the 32-bit version of the murmur3 hash
implemented exactly as described in
https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
hashing technique using seed values 0x293ae76f and 0x7e646e2 as
described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
in Probabilistic Verification"
- The number of times a path is hashed and hence the number of bit positions
that cumulatively determine whether a file is present in the commit.
- The minimum number of bits 'b' per entry in the Bloom filter. If the filter
contains 'n' entries, then the filter size is the minimum number of 64-bit
words that contain n*b bits.
* The rest of the chunk is the concatenation of all the computed Bloom
filters for the commits in lexicographic order.
* Note: Commits with no changes or more than 512 changes have Bloom filters
of length zero.
* The BDAT chunk is present if and only if BIDX is present.

Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
This list of H-byte hashes describe a set of B commit-graph files that
form a commit-graph chain. The graph position for the ith commit in this
Expand Down
2 changes: 2 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -695,6 +695,7 @@ X =
PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))

TEST_BUILTINS_OBJS += test-advise.o
TEST_BUILTINS_OBJS += test-bloom.o
TEST_BUILTINS_OBJS += test-chmtime.o
TEST_BUILTINS_OBJS += test-config.o
TEST_BUILTINS_OBJS += test-ctype.o
Expand Down Expand Up @@ -840,6 +841,7 @@ LIB_OBJS += base85.o
LIB_OBJS += bisect.o
LIB_OBJS += blame.o
LIB_OBJS += blob.o
LIB_OBJS += bloom.o
LIB_OBJS += branch.o
LIB_OBJS += bulk-checkin.o
LIB_OBJS += bundle.o
Expand Down
275 changes: 275 additions & 0 deletions bloom.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
#include "git-compat-util.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jakub Narebski wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> From: Derrick Stolee <[email protected]>
>
> When computing the changed-paths bloom filters for the commit-graph,
> we limit the size of the filter by restricting the number of paths
> in the diff. Instead of computing a large diff and then ignoring the
> result, it is better to halt the diff computation early.

Good idea.

>
> Create a new "max_changes" option in struct diff_options. If non-zero,
> then halt the diff computation after discovering strictly more changed
> paths. This includes paths corresponding to trees that change.

All right; also, it doesn't need to be exact, though it would be good if
it was.

512 changed paths (changed files) usually generate more than 512
elements to be added to the Bloom filter (changed directories and
files), anyway.

>
> Use this max_changes option in the bloom filter calculations. This
> reduces the time taken to compute the filters for the Linux kernel
> repo from 2m50s to 2m35s. On a large internal repository with ~500
> commits that perform tree-wide changes, the time reduced from
> 6m15s to 3m48s.

I wonder if there is some large open-source project with many commits
performing tree-wide changes, that is with many commits with more than
512 changed files with respect to the first parent.

Maybe https://github.com/whosonfirst-data/whosonfirst-data-venue-us-ny
from "Top Ten Worst Repositories to host on GitHub - Git Merge 2017"
could be a good repository to test ;-)

>
> Signed-off-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>

Looks good to me, but that is from cursory examination.
Don't know the area to say anything more.

> ---
>  bloom.c     | 4 +++-
>  diff.h      | 5 +++++
>  tree-diff.c | 6 ++++++
>  3 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/bloom.c b/bloom.c
> index 6082193a75..818382c03b 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -134,6 +134,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>  	int i;
>  	struct diff_options diffopt;
> +	int max_changes = 512;
>  
>  	if (!bloom_filters.slab_size)
>  		return NULL;
> @@ -142,6 +143,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
> +	diffopt.max_changes = max_changes;
>  	diff_setup_done(&diffopt);
>  
>  	if (c->parents)
> @@ -150,7 +152,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
>  	diffcore_std(&diffopt);
>  
> -	if (diff_queued_diff.nr <= 512) {
> +	if (diff_queued_diff.nr <= max_changes) {
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry* e;
>  		struct hashmap_iter iter;
> diff --git a/diff.h b/diff.h
> index 6febe7e365..9443dc1b00 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -285,6 +285,11 @@ struct diff_options {
>  	/* Number of hexdigits to abbreviate raw format output to. */
>  	int abbrev;
>  
> +	/* If non-zero, then stop computing after this many changes. */
> +	int max_changes;
> +	/* For internal use only. */
> +	int num_changes;
> +
>  	int ita_invisible_in_index;
>  /* white-space error highlighting */
>  #define WSEH_NEW (1<<12)
> diff --git a/tree-diff.c b/tree-diff.c
> index 33ded7f8b3..f3d303c6e5 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		if (diff_can_quit_early(opt))
>  			break;
>  
> +		if (opt->max_changes && opt->num_changes > opt->max_changes)
> +			break;
> +
>  		if (opt->pathspec.nr) {
>  			skip_uninteresting(&t, base, opt);
>  			for (i = 0; i < nparent; i++)
> @@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  
>  			/* t� */
>  			update_tree_entry(&t);
> +			opt->num_changes++;
>  		}
>  
>  		/* t > p[imin] */
> @@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		skip_emit_tp:
>  			/* � pi=p[imin]  pi� */
>  			update_tp_entries(tp, nparent);
> +			opt->num_changes++;
>  		}
>  	}
>  
> @@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
>  	const struct object_id **parents_oid, int nparent,
>  	struct strbuf *base, struct diff_options *opt)
>  {
> +	opt->num_changes = 0;
>  	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
>  
>  	/*

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Garima Singh wrote (reply to this):


On 2/16/2020 7:00 PM, Jakub Narebski wrote:
> "Derrick Stolee via GitGitGadget" <[email protected]> writes:
> 
>> From: Derrick Stolee <[email protected]>
>>
>> Use this max_changes option in the bloom filter calculations. This
>> reduces the time taken to compute the filters for the Linux kernel
>> repo from 2m50s to 2m35s. On a large internal repository with ~500
>> commits that perform tree-wide changes, the time reduced from
>> 6m15s to 3m48s.
> 
> I wonder if there is some large open-source project with many commits
> performing tree-wide changes, that is with many commits with more than
> 512 changed files with respect to the first parent.
> 
> Maybe https://github.com/whosonfirst-data/whosonfirst-data-venue-us-ny
> from "Top Ten Worst Repositories to host on GitHub - Git Merge 2017"
> could be a good repository to test ;-)
> 

Thanks for the suggestion! I will see if any of these repos gives us a 
good test bed and add the perf improvement numbers in the appropriate
commit messages in v3. 

Cheers! 
Garima Singh

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jakub Narebski wrote (reply to this):

"Garima Singh via GitGitGadget" <[email protected]> writes:

> From: Garima Singh <[email protected]>
>
> Read previously computed Bloom filters from the commit-graph file if
> possible to avoid recomputing during commit-graph write.

All right, what is written makes sense for this point in patch series.

But it my opinion it is more important to state that this commit adds
"parsing" of the Bloom filter data from commit-graph file.  This means
that it needs to be calculated only once, then stored in commit-graph,
ready to be re-used.

>
> See Documentation/technical/commit-graph-format for the format in which
> the Bloom filter information is written to the commit graph file.
>
> To read Bloom filter for a given commit with lexicographic position
> 'i' we need to:
> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
>    filter of commit i+1. It is essentially the index past the end
>    of the filter of commit i. It is called end_index in the code.
>
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
>    for filter of commit i. It is called the start_index in the code.
>    For the first commit, where i = 0, Bloom filter data starts at the
>    beginning, just past the header in the BDAT chunk. Hence, start_index
>    will be 0.
>
> 3. The length of the filter will be end_index - start_index, because
>    BIDX[i] gives the cumulative 8-byte words including the ith
>    commit's filter.
>
> We toggle whether Bloom filters should be recomputed based on the
> compute_if_null flag.

Nitpick: the flag (the parameter) is called compute_if_not_present, not
compute_if_null.

All right, this explanation is nice and clear.

>
> Helped-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  bloom.c               | 49 ++++++++++++++++++++++++++++++++++++++++++-
>  bloom.h               |  4 +++-
>  commit-graph.c        |  7 ++++---
>  t/helper/test-bloom.c |  2 +-
>  4 files changed, 56 insertions(+), 6 deletions(-)
>
> diff --git a/bloom.c b/bloom.c
> index 818382c03b..90d84dc713 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -1,5 +1,7 @@
>  #include "git-compat-util.h"
>  #include "bloom.h"
> +#include "commit.h"
> +#include "commit-slab.h"
>  #include "commit-graph.h"
>  #include "object-store.h"
>  #include "diff.h"
> @@ -127,8 +129,39 @@ void add_key_to_filter(struct bloom_key *key,
>  	}
>  }
>  
> +static int load_bloom_filter_from_graph(struct commit_graph *g,
> +				   struct bloom_filter *filter,
> +				   struct commit *c)
> +{
> +	uint32_t lex_pos, start_index, end_index;
> +
> +	while (c->graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
> +	if (!g->chunk_bloom_indexes)
> +		return 0;
> +
> +	lex_pos = c->graph_pos - g->num_commits_in_base;

All right, this finds lexicographical position of the commit following
the chain of incremental commit-graph files, and also check if the
commit-graph fragment that contains the commit in question has Bloom
filter data included.

> +
> +	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
> +
> +	if (lex_pos)

Wouldn't it be better to be more explicit, and write

  +	if (lex_pos > 0)


> +		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> +	else
> +		start_index = 0;

All right, here we find start_index and end_index.

It might be good idea to at least assert() that start_index <= end_index,
though that should not happen (that is why I propose for this check to
be compiled on only for debug builds).

> +
> +	filter->len = end_index - start_index;
> +	filter->data = (uint64_t *)(g->chunk_bloom_data +
> +					sizeof(uint64_t) * start_index +
> +					BLOOMDATA_CHUNK_HEADER_SIZE);

All right, nice use of constant.

> +
> +	return 1;
> +}
> +
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c)
> +				      struct commit *c,
> +				      int compute_if_not_present)
>  {
>  	struct bloom_filter *filter;
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> @@ -141,6 +174,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
>  
> +	if (!filter->data) {
> +		load_commit_graph_info(r, c);
> +		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> +			r->objects->commit_graph->chunk_bloom_indexes) {

All right, the limitation that the top layer of incremental commit graph
needs to have Bloom filters enabled for it to be even considered is
reasonable tradeoff, in my opinion.

> +			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> +				return filter;
> +			else
> +				return NULL;

If it should have filter, return it, otherwise return NULL.

I wonder however when it can return NULL (and whether it should compute
Bloom filters if required instead).

> +		}
> +	}
> +
> +	if (filter->data || !compute_if_not_present)
> +		return filter;

If we have filter from slab, return it.  All right.

However, according to documentation contained in comments in
commit-slab.h, bloom_filter_slab_at() will allocate the location to
store the data, and return freshly allocated memory... fortunately it
uses xcalloc() so returned bloom_filter would have ->len == 0 and
->data == 0.

> +
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.max_changes = max_changes;
> diff --git a/bloom.h b/bloom.h
> index 7f40c751f7..76f8a9ad0c 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -13,6 +13,7 @@ struct bloom_filter_settings {
>  
>  #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
>  #define BITS_PER_WORD 64
> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3*sizeof(uint32_t)

All right.

>  
>  /*
>   * A bloom_filter struct represents a data segment to
> @@ -47,7 +48,8 @@ void add_key_to_filter(struct bloom_key *key,
>  					   struct bloom_filter_settings *settings);
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c);
> +				      struct commit *c,
> +				      int compute_if_not_present);
>

All right, adding new parameter (changing function signature).

>  int bloom_filter_contains(struct bloom_filter *filter,
>  			  struct bloom_key *key,
> diff --git a/commit-graph.c b/commit-graph.c
> index 4585b3b702..c0e9834bf2 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1094,7 +1094,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  			ctx->commits.nr);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		cur_pos += filter->len;
>  		display_progress(progress, ++i);
>  		hashwrite_be32(f, cur_pos);
> @@ -1123,7 +1123,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  	hashwrite_be32(f, settings->bits_per_entry);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		display_progress(progress, ++i);
>  		hashwrite(f, filter->data, filter->len * sizeof(uint64_t));
>  		list++;

All right, if needed (that is, if '--changed-path' option from the
future commit is provided to 'git commit-graph write'),
compute_bloom_filters() would be called befor write_commit_graph_file(),
which in turn runs write_graph_chunk_bloom_index() and *_data().


Actually, when writing Bloom data chunks (BIDX and BDAT) we could have
requested recomputing filters if necessary: slab storage works as
memoization, so you would calculate Bloom filter data for each commit in
the commit-graph only once.  And write_graph_chunk_bloom_indexes()
and write_graph_chunk_bloom_data() are called only if ctx->changed_paths
is true.

So it would work with

  +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 1);

Only in the future we would really need to call with compute_if_not_present
parameter set to falsy value.

> @@ -1304,7 +1304,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		struct commit *c = sorted_by_pos[i];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>  		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
>  		display_progress(progress, i + 1);
>  	}
> @@ -2314,6 +2314,7 @@ void free_commit_graph(struct commit_graph *g)
>  		g->data = NULL;
>  		close(g->graph_fd);
>  	}
> +	free(g->bloom_filter_settings);
>  	free(g->filename);
>  	free(g);

Shouldn't this fixup be added to earlier commit?

>  }
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 331957011b..9b4be97f75 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -47,7 +47,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>  	struct bloom_filter *filter;
>  	setup_git_directory();
>  	c = lookup_commit(the_repository, commit_oid);
> -	filter = get_bloom_filter(the_repository, c);
> +	filter = get_bloom_filter(the_repository, c, 1);
>  	print_bloom_filter(filter);
>  }

I would like to see some tests, but that needs to wait for patch that
adds --changed-paths option to the 'write' subcommand.

Things to be tested:
1. That after reading commit-graph with Bloom filter:
   - that commit(s) in commit-graph have Bloom filter
   - that commits outside commit-graph do not have Bloom filter
2. That incremental commit-graph feature works:
   - for commits in deeper layer that have Bloom filter chunks
   - for commits in deeper layer that do not have Bloom filter chunks

Best,
-- 
Jakub Nar�bski

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Garima Singh wrote (reply to this):



On 2/20/2020 1:48 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <[email protected]> writes:
> 
>> From: Garima Singh <[email protected]>
>>
>> Read previously computed Bloom filters from the commit-graph file if
>> possible to avoid recomputing during commit-graph write.
> 
> All right, what is written makes sense for this point in patch series.
> 
> But it my opinion it is more important to state that this commit adds
> "parsing" of the Bloom filter data from commit-graph file.  This means
> that it needs to be calculated only once, then stored in commit-graph,
> ready to be re-used.
> 

Good point. Incorporated in v3.

>>
>> See Documentation/technical/commit-graph-format for the format in which
>> the Bloom filter information is written to the commit graph file.
>>
>> To read Bloom filter for a given commit with lexicographic position
>> 'i' we need to:
>> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
>>    filter of commit i+1. It is essentially the index past the end
>>    of the filter of commit i. It is called end_index in the code.
>>
>> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
>>    for filter of commit i. It is called the start_index in the code.
>>    For the first commit, where i = 0, Bloom filter data starts at the
>>    beginning, just past the header in the BDAT chunk. Hence, start_index
>>    will be 0.
>>
>> 3. The length of the filter will be end_index - start_index, because
>>    BIDX[i] gives the cumulative 8-byte words including the ith
>>    commit's filter.
>>
>> We toggle whether Bloom filters should be recomputed based on the
>> compute_if_null flag.
> 
> Nitpick: the flag (the parameter) is called compute_if_not_present, not
> compute_if_null.
> 
Oops. Fixed in v3. 

>> +
>> +	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
>> +
>> +	if (lex_pos)
> 
> Wouldn't it be better to be more explicit, and write
> 
>   +	if (lex_pos > 0)
> 
> 

Sure. 

>> +		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
>> +	else
>> +		start_index = 0;
> 
> All right, here we find start_index and end_index.
> 
> It might be good idea to at least assert() that start_index <= end_index,
> though that should not happen (that is why I propose for this check to
> be compiled on only for debug builds).
> 

I will look into this. Thanks! 


>> @@ -1304,7 +1304,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>  
>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>  		struct commit *c = sorted_by_pos[i];
>> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>>  		ctx->total_bloom_filter_data_size += sizeof(uint64_t) * filter->len;
>>  		display_progress(progress, i + 1);
>>  	}
>> @@ -2314,6 +2314,7 @@ void free_commit_graph(struct commit_graph *g)
>>  		g->data = NULL;
>>  		close(g->graph_fd);
>>  	}
>> +	free(g->bloom_filter_settings);
>>  	free(g->filename);
>>  	free(g);
> 
> Shouldn't this fixup be added to earlier commit?
> 

Yes. 

>>  }
>> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
>> index 331957011b..9b4be97f75 100644
>> --- a/t/helper/test-bloom.c
>> +++ b/t/helper/test-bloom.c
>> @@ -47,7 +47,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>>  	struct bloom_filter *filter;
>>  	setup_git_directory();
>>  	c = lookup_commit(the_repository, commit_oid);
>> -	filter = get_bloom_filter(the_repository, c);
>> +	filter = get_bloom_filter(the_repository, c, 1);
>>  	print_bloom_filter(filter);
>>  }
> 
> I would like to see some tests, but that needs to wait for patch that
> adds --changed-paths option to the 'write' subcommand.
> 
> Things to be tested:
> 1. That after reading commit-graph with Bloom filter:
>    - that commit(s) in commit-graph have Bloom filter
>    - that commits outside commit-graph do not have Bloom filter
> 2. That incremental commit-graph feature works:
>    - for commits in deeper layer that have Bloom filter chunks
>    - for commits in deeper layer that do not have Bloom filter chunks
> 

Included in later commits. 

> Best,
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:50PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <[email protected]>
> 
> Add logic to
> a) parse Bloom filter information from the commit graph file and,
> b) re-use existing Bloom filters.
> 
> See Documentation/technical/commit-graph-format for the format in which
> the Bloom filter information is written to the commit graph file.
> 
> To read Bloom filter for a given commit with lexicographic position
> 'i' we need to:
> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
>    filter of commit i+1. It is essentially the index past the end
>    of the filter of commit i. It is called end_index in the code.
> 
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
>    for filter of commit i. It is called the start_index in the code.
>    For the first commit, where i = 0, Bloom filter data starts at the
>    beginning, just past the header in the BDAT chunk. Hence, start_index
>    will be 0.
> 
> 3. The length of the filter will be end_index - start_index, because
>    BIDX[i] gives the cumulative 8-byte words including the ith
>    commit's filter.
> 
> We toggle whether Bloom filters should be recomputed based on the
> compute_if_not_present flag.

A very important question is not discussed here: when should we
recompute Bloom filters?

> Helped-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  bloom.c               | 49 ++++++++++++++++++++++++++++++++++++++++++-
>  bloom.h               |  4 +++-
>  commit-graph.c        |  6 +++---
>  t/helper/test-bloom.c |  2 +-
>  4 files changed, 55 insertions(+), 6 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index a16eee92331..0f714dd76ae 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -4,6 +4,8 @@
>  #include "diffcore.h"
>  #include "revision.h"
>  #include "hashmap.h"
> +#include "commit-graph.h"
> +#include "commit.h"
>  
>  define_commit_slab(bloom_filter_slab, struct bloom_filter);
>  
> @@ -26,6 +28,36 @@ static inline unsigned char get_bitmask(uint32_t pos)
>  	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
>  }
>  
> +static int load_bloom_filter_from_graph(struct commit_graph *g,
> +				   struct bloom_filter *filter,
> +				   struct commit *c)
> +{
> +	uint32_t lex_pos, start_index, end_index;
> +
> +	while (c->graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
> +	if (!g->chunk_bloom_indexes)
> +		return 0;
> +
> +	lex_pos = c->graph_pos - g->num_commits_in_base;
> +
> +	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
> +
> +	if (lex_pos > 0)
> +		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> +	else
> +		start_index = 0;
> +
> +	filter->len = end_index - start_index;
> +	filter->data = (unsigned char *)(g->chunk_bloom_data +
> +					sizeof(unsigned char) * start_index +
> +					BLOOMDATA_CHUNK_HEADER_SIZE);
> +
> +	return 1;
> +}
> +
>  /*
>   * Calculate the murmur3 32-bit hash value for the given data
>   * using the given seed.
> @@ -127,7 +159,8 @@ void init_bloom_filters(void)
>  }
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c)
> +				      struct commit *c,
> +					  int compute_if_not_present)
>  {
>  	struct bloom_filter *filter;
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;

This line in the hunk context sets the default parameters with which
this process will compute any new changed path Bloom filters.

Note that this is not the settings instance that eventually gets
written to the header of the Bloom filters chunk:
write_commit_graph_file() has its own 'struct bloom_filter_settings'
instance, and that's the one that goes into the chunk header.

> @@ -140,6 +173,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
>  
> +	if (!filter->data) {
> +		load_commit_graph_info(r, c);
> +		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> +			r->objects->commit_graph->chunk_bloom_indexes) {
> +			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> +				return filter;
> +			else
> +				return NULL;
> +		}
> +	}

And in the above conditions we try to load the existing Bloom filter
for the given commit and return it as-is for reuse if it already
exists, or go on to compute a new Bloom filter with the parameters set
at the beginning of the function.

Unfortunately, the parameters used to compute the now reused Bloom
filters are not checked anywhere.  In fact this writing process
entirely ignores all parameters in the header of the existing Bloom
filters chunk, and simply replaces them with the default parameters
hard-coded in write_commit_graph_file().  Consequently, we can end up
with Bloom filters computed with different parameters in the same
commit-graph file, which, in turn, can result in commits omitted from
the output of pathspec-limited revision walks.

The makeshift (there is no way to override those hard-coded defaults)
tests below demonstrate this issue.

This issue raises a good couple of questions:

  - What should we do when updating a commit-graph that was written
    with different Bloom filter parameters than our hardcoded
    defaults?

    Reusing the exising Bloom filters is clearly wrong.  Throwing away
    all existing Bloom filters and recomputing them with our defaults
    parameters doesn't seem to be good option, because that's a
    considerable amount of work, and the user might have a reason to
    chose those parameters.

  - What should we do when updating a commit-graph that was written
    with different Bloom filter parameters than specified by the user
    on the command line or in the config?

    Wipe out the old Bloom filters and recompute with new parameters,
    spending considerable time in bigger repositories?  Or stop with a
    warning about the different parameters (maybe it's just a typo),
    and require '--force'?
    
    Dunno, and we don't have such options and configuration yet
    anyway.

  - What about split commit-graphs?

    When split commit-graphs were introduced there was not a single
    chunk that had its own header.  Now the Bloom filters chunk does
    have a header, which leads to other questions:

    - Should that Bloom filters header be included in every split
      commit-graph?

      Not sure, but I suppose that having a header in each split
      commit-graph file would make loading and parsing that chunk a
      bit simpler, because all of them should be parsed the same way.
      Anyway, I think the specs should be explicit about it.  But...:

    - Should we allow different parameters in the Bloom filter chunks
      in each split commit-graph?

      The point of split commit-graphs is to avoid the overhead of
      re-writing the whole commit-graph file every time new commits
      are added, and it's crucial that both writing and merging split
      commit-graph files are cheap.  However, split commit-graph files
      using different Bloom filter parameters can't be merged without
      recomputing those Bloom filters, making merging quite expensive.

      So I don't think that it's a good idea to allow different Bloom
      filter parameters in split commit-graphs.  But then perhaps it
      would be better not to have a Bloom filter chunk header in all
      split commit-graph files after all.

    In any case, the last test below shows that the Bloom filter
    parameters are only read from the header of the most recent split
    commit-graph file.


  ---  >8  ---

#!/bin/sh

test_description='test'

. ./test-lib.sh

test_expect_success 'yuckiest setup ever!' '
	(
		cd "$GIT_BUILD_DIR" &&

		# The number of hashes per path cannot be configured
		# at runtime, so build a dedicated git binary that
		# writes Bloom filters using only 6 hashes per path.
		sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/7/6/" bloom.h &&
		make -j4 git &&
		cp git git6 &&

		# Revert, rebuild.
		sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/6/7/" bloom.h &&
		make -j4 git
	) &&
	git6="$GIT_BUILD_DIR"/git6
'

test_expect_success 'setup' '
	# We need a filename whose 7th hash maps to a different bit
	# position than any of its first 6 hashes in a 2-byte Bloom
	# filter.
	file=File &&

	test_tick &&
	git commit --allow-empty -m initial &&
	echo 1 >$file &&
	git add $file &&
	git commit -m one $file &&
	echo 2 >$file &&
	git commit -m two $file &&

	git log --oneline -- $file >expect
'

test_expect_success 'can read Bloom filters with different parameters' '
	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&

	# Write a commit-graph with Bloom filters using only 6 hashes
	# per path.
	"$git6" commit-graph write --reachable --changed-paths &&

	# Try pathspec-limited revision walk with the git binary writing
	# Bloom filters using 7 hashes: it still works, because no matter
	# how many hashes it would use when writing the commit-graph, the
	# reader part respects the nr of hashes stored in the
	# commit-graph file.  So far so good.
	git log --oneline $file >actual &&
	test_cmp expect actual
'

test_expect_failure 'commit-graph write does not reuse Bloom filters with different parameters' '
	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&

	# Write a commit-graph with Bloom filters using only 6 hashes
	# per path for a subset of commits.
	git rev-parse HEAD^ |
	"$git6" commit-graph write --stdin-commits --changed-paths &&

	# Add the rest of the commits to the commit-graph containing Bloom
	# filters using 6 hashes with a git version that writes Bloom
	# filters using 7 hashes.
	# Does it reuse the existing Bloom filters with 6 hashes?
	git commit-graph write --reachable --changed-paths &&

	# Yes, it does, because these report different filter data,
	# even though both commits modified the same file.
	test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
	test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&

	# Furthermore, it updated the Bloom filter chunk header as well,
	# which now stores that all Bloom filters use 7 hashes.
	# Consequently, the first commit whose Bloom filter was written
	# with only 6 hashes falls victim of a false negative, and is
	# omitted from the output.
	git log --oneline $file >actual &&
	test_cmp expect actual
'

test_expect_failure 'split commit-graphs and Bloom filters with different parameters' '
	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&

	git rev-parse HEAD^ |
	"$git6" commit-graph write --stdin-commits --changed-paths --split &&

	git commit-graph write --reachable --changed-paths --split=no-merge &&

	# To make sure that I test what I want, i.e. two commit-graphs
	# with one commit in each.  (Though "test-tool read-graph" is
	# utterly oblivious to split commit graphs...)
	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain &&
	verbose test "$(test-tool read-graph |sed -n -e "s/^num_commits: //p")" = 1 &&

	test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
	test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&

	git log --oneline $file >actual &&
	test_cmp expect actual
'

test_done

  ---  8<  ---

> +	if (filter->data || !compute_if_not_present)
> +		return filter;
> +
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.max_changes = max_changes;
> diff --git a/bloom.h b/bloom.h
> index 85ab8e9423d..760d7122374 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -32,6 +32,7 @@ struct bloom_filter_settings {
>  
>  #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
>  #define BITS_PER_WORD 8
> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
>  
>  /*
>   * A bloom_filter struct represents a data segment to
> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
>  void init_bloom_filters(void);
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c);
> +				      struct commit *c,
> +				      int compute_if_not_present);
>  
>  #endif
> \ No newline at end of file
> diff --git a/commit-graph.c b/commit-graph.c
> index a8b6b5cca5d..77668629e27 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  			ctx->commits.nr);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		cur_pos += filter->len;
>  		display_progress(progress, ++i);
>  		hashwrite_be32(f, cur_pos);
> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  	hashwrite_be32(f, settings->bits_per_entry);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		display_progress(progress, ++i);
>  		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
>  		list++;
> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		struct commit *c = sorted_commits[i];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>  		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>  		display_progress(progress, i + 1);
>  	}
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index f18d1b722e1..ce412664ba9 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>  	struct bloom_filter *filter;
>  	setup_git_directory();
>  	c = lookup_commit(the_repository, commit_oid);
> -	filter = get_bloom_filter(the_repository, c);
> +	filter = get_bloom_filter(the_repository, c, 1);
>  	print_bloom_filter(filter);
>  }
>  
> -- 
> gitgitgadget

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

SZEDER Gábor <[email protected]> writes:

> Note that this is not the settings instance that eventually gets
> written to the header of the Bloom filters chunk:
> write_commit_graph_file() has its own 'struct bloom_filter_settings'
> instance, and that's the one that goes into the chunk header.
> ...
> Unfortunately, the parameters used to compute the now reused Bloom
> filters are not checked anywhere.  In fact this writing process
> entirely ignores all parameters in the header of the existing Bloom
> filters chunk, and simply replaces them with the default parameters
> hard-coded in write_commit_graph_file().  Consequently, we can end up
> with Bloom filters computed with different parameters in the same
> commit-graph file, which, in turn, can result in commits omitted from
> the output of pathspec-limited revision walks.

Yeah, the whole design seems quite broken and as you said later,
mixing other ingredients like split file would only make things
worse X-<.

> The makeshift (there is no way to override those hard-coded defaults)
> tests below demonstrate this issue.
>
> This issue raises a good couple of questions:
>
>   - What should we do when updating a commit-graph that was written
>     with different Bloom filter parameters than our hardcoded
>     defaults?
>
>     Reusing the exising Bloom filters is clearly wrong.  Throwing away
>     all existing Bloom filters and recomputing them with our defaults
>     parameters doesn't seem to be good option, because that's a
>     considerable amount of work, and the user might have a reason to
>     chose those parameters.
>
>   - What should we do when updating a commit-graph that was written
>     with different Bloom filter parameters than specified by the user
>     on the command line or in the config?
>
>     Wipe out the old Bloom filters and recompute with new parameters,
>     spending considerable time in bigger repositories?  Or stop with a
>     warning about the different parameters (maybe it's just a typo),
>     and require '--force'?
>     
>     Dunno, and we don't have such options and configuration yet
>     anyway.
>
>   - What about split commit-graphs?
>
>     When split commit-graphs were introduced there was not a single
>     chunk that had its own header.  Now the Bloom filters chunk does
>     have a header, which leads to other questions:
>
>     - Should that Bloom filters header be included in every split
>       commit-graph?
>
>       Not sure, but I suppose that having a header in each split
>       commit-graph file would make loading and parsing that chunk a
>       bit simpler, because all of them should be parsed the same way.
>       Anyway, I think the specs should be explicit about it.  But...:
>
>     - Should we allow different parameters in the Bloom filter chunks
>       in each split commit-graph?
>
>       The point of split commit-graphs is to avoid the overhead of
>       re-writing the whole commit-graph file every time new commits
>       are added, and it's crucial that both writing and merging split
>       commit-graph files are cheap.  However, split commit-graph files
>       using different Bloom filter parameters can't be merged without
>       recomputing those Bloom filters, making merging quite expensive.
>
>       So I don't think that it's a good idea to allow different Bloom
>       filter parameters in split commit-graphs.  But then perhaps it
>       would be better not to have a Bloom filter chunk header in all
>       split commit-graph files after all.
>
>     In any case, the last test below shows that the Bloom filter
>     parameters are only read from the header of the most recent split
>     commit-graph file.
>
>
>   ---  >8  ---
>
> #!/bin/sh
>
> test_description='test'
>
> . ./test-lib.sh
>
> test_expect_success 'yuckiest setup ever!' '
> 	(
> 		cd "$GIT_BUILD_DIR" &&
>
> 		# The number of hashes per path cannot be configured
> 		# at runtime, so build a dedicated git binary that
> 		# writes Bloom filters using only 6 hashes per path.
> 		sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/7/6/" bloom.h &&
> 		make -j4 git &&
> 		cp git git6 &&
>
> 		# Revert, rebuild.
> 		sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/6/7/" bloom.h &&
> 		make -j4 git
> 	) &&
> 	git6="$GIT_BUILD_DIR"/git6
> '
>
> test_expect_success 'setup' '
> 	# We need a filename whose 7th hash maps to a different bit
> 	# position than any of its first 6 hashes in a 2-byte Bloom
> 	# filter.
> 	file=File &&
>
> 	test_tick &&
> 	git commit --allow-empty -m initial &&
> 	echo 1 >$file &&
> 	git add $file &&
> 	git commit -m one $file &&
> 	echo 2 >$file &&
> 	git commit -m two $file &&
>
> 	git log --oneline -- $file >expect
> '
>
> test_expect_success 'can read Bloom filters with different parameters' '
> 	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
>
> 	# Write a commit-graph with Bloom filters using only 6 hashes
> 	# per path.
> 	"$git6" commit-graph write --reachable --changed-paths &&
>
> 	# Try pathspec-limited revision walk with the git binary writing
> 	# Bloom filters using 7 hashes: it still works, because no matter
> 	# how many hashes it would use when writing the commit-graph, the
> 	# reader part respects the nr of hashes stored in the
> 	# commit-graph file.  So far so good.
> 	git log --oneline $file >actual &&
> 	test_cmp expect actual
> '
>
> test_expect_failure 'commit-graph write does not reuse Bloom filters with different parameters' '
> 	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
>
> 	# Write a commit-graph with Bloom filters using only 6 hashes
> 	# per path for a subset of commits.
> 	git rev-parse HEAD^ |
> 	"$git6" commit-graph write --stdin-commits --changed-paths &&
>
> 	# Add the rest of the commits to the commit-graph containing Bloom
> 	# filters using 6 hashes with a git version that writes Bloom
> 	# filters using 7 hashes.
> 	# Does it reuse the existing Bloom filters with 6 hashes?
> 	git commit-graph write --reachable --changed-paths &&
>
> 	# Yes, it does, because these report different filter data,
> 	# even though both commits modified the same file.
> 	test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
> 	test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&
>
> 	# Furthermore, it updated the Bloom filter chunk header as well,
> 	# which now stores that all Bloom filters use 7 hashes.
> 	# Consequently, the first commit whose Bloom filter was written
> 	# with only 6 hashes falls victim of a false negative, and is
> 	# omitted from the output.
> 	git log --oneline $file >actual &&
> 	test_cmp expect actual
> '
>
> test_expect_failure 'split commit-graphs and Bloom filters with different parameters' '
> 	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
>
> 	git rev-parse HEAD^ |
> 	"$git6" commit-graph write --stdin-commits --changed-paths --split &&
>
> 	git commit-graph write --reachable --changed-paths --split=no-merge &&
>
> 	# To make sure that I test what I want, i.e. two commit-graphs
> 	# with one commit in each.  (Though "test-tool read-graph" is
> 	# utterly oblivious to split commit graphs...)
> 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain &&
> 	verbose test "$(test-tool read-graph |sed -n -e "s/^num_commits: //p")" = 1 &&
>
> 	test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
> 	test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&
>
> 	git log --oneline $file >actual &&
> 	test_cmp expect actual
> '
>
> test_done
>
>   ---  8<  ---
>
>> +	if (filter->data || !compute_if_not_present)
>> +		return filter;
>> +
>>  	repo_diff_setup(r, &diffopt);
>>  	diffopt.flags.recursive = 1;
>>  	diffopt.max_changes = max_changes;
>> diff --git a/bloom.h b/bloom.h
>> index 85ab8e9423d..760d7122374 100644
>> --- a/bloom.h
>> +++ b/bloom.h
>> @@ -32,6 +32,7 @@ struct bloom_filter_settings {
>>  
>>  #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
>>  #define BITS_PER_WORD 8
>> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
>>  
>>  /*
>>   * A bloom_filter struct represents a data segment to
>> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
>>  void init_bloom_filters(void);
>>  
>>  struct bloom_filter *get_bloom_filter(struct repository *r,
>> -				      struct commit *c);
>> +				      struct commit *c,
>> +				      int compute_if_not_present);
>>  
>>  #endif
>> \ No newline at end of file
>> diff --git a/commit-graph.c b/commit-graph.c
>> index a8b6b5cca5d..77668629e27 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>>  			ctx->commits.nr);
>>  
>>  	while (list < last) {
>> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>>  		cur_pos += filter->len;
>>  		display_progress(progress, ++i);
>>  		hashwrite_be32(f, cur_pos);
>> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>>  	hashwrite_be32(f, settings->bits_per_entry);
>>  
>>  	while (list < last) {
>> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>>  		display_progress(progress, ++i);
>>  		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
>>  		list++;
>> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>  
>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>  		struct commit *c = sorted_commits[i];
>> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>>  		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>>  		display_progress(progress, i + 1);
>>  	}
>> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
>> index f18d1b722e1..ce412664ba9 100644
>> --- a/t/helper/test-bloom.c
>> +++ b/t/helper/test-bloom.c
>> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>>  	struct bloom_filter *filter;
>>  	setup_git_directory();
>>  	c = lookup_commit(the_repository, commit_oid);
>> -	filter = get_bloom_filter(the_repository, c);
>> +	filter = get_bloom_filter(the_repository, c, 1);
>>  	print_bloom_filter(filter);
>>  }
>>  
>> -- 
>> gitgitgadget

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:52PM +0000, Garima Singh via GitGitGadget wrote:
> +static void prepare_to_use_bloom_filter(struct rev_info *revs)
> +{
> +	struct pathspec_item *pi;
> +	char *path_alloc = NULL;
> +	const char *path;
> +	int last_index;
> +	int len;
> +
> +	if (!revs->commits)
> +	    return;
> +
> +	repo_parse_commit(revs->repo, revs->commits->item);
> +
> +	if (!revs->repo->objects->commit_graph)
> +		return;
> +
> +	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
> +	if (!revs->bloom_filter_settings)
> +		return;
> +
> +	pi = &revs->pruning.pathspec.items[0];
> +	last_index = pi->len - 1;
> +
> +	/* remove single trailing slash from path, if needed */
> +	if (pi->match[last_index] == '/') {
> +	    path_alloc = xstrdup(pi->match);
> +	    path_alloc[last_index] = '\0';
> +	    path = path_alloc;

fill_bloom_key() takes a length parameter, so there is no need to
duplicate the path to be able to shorten it by one character to remove
that trailing '/'.

> +	} else
> +	    path = pi->match;
> +
> +	len = strlen(path);

'struct pathspec_item's 'len' field already contains the length of the
path, so there is no need for this strlen().

> +
> +	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> +	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> +
> +	free(path_alloc);
> +}

> @@ -3362,6 +3440,8 @@ int prepare_revision_walk(struct rev_info *revs)
>  				       FOR_EACH_OBJECT_PROMISOR_ONLY);
>  	}
>  
> +	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
> +		prepare_to_use_bloom_filter(revs);
>  	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
>  		commit_list_sort_by_date(&revs->commits);
>  	if (revs->no_walk)
                return 0;
        if (revs->limited) {
                if (limit_list(revs) < 0)
                        return -1;

I extended the hunk context a bit to show that
prepare_to_use_bloom_filter() is called before limit_list().  This is
important, because specifying exclude revs and pathspecs, i.e.  'git
log ^v1.2.3 -- dir/file' does perform a lot of diffs in limit_list(),
and this way we can take advantage of Bloom filters even in this case.

> @@ -3379,6 +3459,7 @@ int prepare_revision_walk(struct rev_info *revs)
>  		simplify_merges(revs);
>  	if (revs->children.name)
>  		set_children(revs);
> +
>  	return 0;
>  }
>  
> diff --git a/revision.h b/revision.h
> index 475f048fb61..7c026fe41fc 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct repository;
>  struct rev_info;
>  struct string_list;
>  struct saved_parents;
> +struct bloom_key;
> +struct bloom_filter_settings;
>  define_shared_commit_slab(revision_sources, char *);
>  
>  struct rev_cmdline_info {
> @@ -291,6 +293,15 @@ struct rev_info {
>  	struct revision_sources *sources;
>  
>  	struct topo_walk_info *topo_walk_info;
> +
> +	/* Commit graph bloom filter fields */
> +	/* The bloom filter key for the pathspec */
> +	struct bloom_key *bloom_key;
> +	/*
> +	 * The bloom filter settings used to generate the key.
> +	 * This is loaded from the commit-graph being used.
> +	 */
> +	struct bloom_filter_settings *bloom_filter_settings;
>  };
>  
>  int ref_excluded(struct string_list *, const char *path);
> -- 
> gitgitgadget
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:44PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <[email protected]>
> 
> Add the core implementation for computing Bloom filters for
> the paths changed between a commit and it's first parent.
> 
> We fill the Bloom filters as (const char *data, int len) pairs
> as `struct bloom_filters" within a commit slab.
> 
> Filters for commits with no changes and more than 512 changes,
> is represented with a filter of length zero. There is no gain
> in distinguishing between a computed filter of length zero for
> a commit with no changes, and an uncomputed filter for new commits
> or for commits with more than 512 changes. The effect on
> `git log -- path` is the same in both cases. We will fall back to
> the normal diffing algorithm when we can't benefit from the
> existence of Bloom filters.
> 
> Helped-by: Jeff King <[email protected]>
> Helped-by: Derrick Stolee <[email protected]>
> Reviewed-by: Jakub Narębski <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  bloom.c               | 97 +++++++++++++++++++++++++++++++++++++++++++
>  bloom.h               |  8 ++++
>  t/helper/test-bloom.c | 20 +++++++++
>  t/t0095-bloom.sh      | 47 +++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/bloom.c b/bloom.c
> index 888b67f1ea6..881a9841ede 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -1,5 +1,18 @@
>  #include "git-compat-util.h"
>  #include "bloom.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);

So here we define a commit slab for modified path Bloom filters, ...

> +struct bloom_filter_slab bloom_filters;
> +
> +struct pathmap_hash_entry {
> +    struct hashmap_entry entry;
> +    const char path[FLEX_ARRAY];
> +};
>  
>  static uint32_t rotate_left(uint32_t value, int32_t count)
>  {
> @@ -107,3 +120,87 @@ void add_key_to_filter(const struct bloom_key *key,
>  		filter->data[block_pos] |= get_bitmask(hash_mod);
>  	}
>  }
> +
> +void init_bloom_filters(void)
> +{
> +	init_bloom_filter_slab(&bloom_filters);

... here initialize the slab ...

> +}
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c)
> +{
> +	struct bloom_filter *filter;
> +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> +	int i;
> +	struct diff_options diffopt;
> +
> +	if (bloom_filters.slab_size == 0)
> +		return NULL;
> +
> +	filter = bloom_filter_slab_at(&bloom_filters, c);

... allocate an entry in the slab ...

> +
> +	repo_diff_setup(r, &diffopt);
> +	diffopt.flags.recursive = 1;
> +	diff_setup_done(&diffopt);
> +
> +	if (c->parents)
> +		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
> +	else
> +		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> +	diffcore_std(&diffopt);
> +
> +	if (diff_queued_diff.nr <= 512) {
> +		struct hashmap pathmap;
> +		struct pathmap_hash_entry *e;
> +		struct hashmap_iter iter;
> +		hashmap_init(&pathmap, NULL, NULL, 0);
> +
> +		for (i = 0; i < diff_queued_diff.nr; i++) {
> +			const char *path = diff_queued_diff.queue[i]->two->path;
> +
> +			/*
> +			* Add each leading directory of the changed file, i.e. for
> +			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> +			* the Bloom filter could be used to speed up commands like
> +			* 'git log dir/subdir', too.
> +			*
> +			* Note that directories are added without the trailing '/'.
> +			*/
> +			do {
> +				char *last_slash = strrchr(path, '/');
> +
> +				FLEX_ALLOC_STR(e, path, path);
> +				hashmap_entry_init(&e->entry, strhash(path));
> +				hashmap_add(&pathmap, &e->entry);
> +
> +				if (!last_slash)
> +					last_slash = (char*)path;
> +				*last_slash = '\0';
> +
> +			} while (*path);
> +
> +			diff_free_filepair(diff_queued_diff.queue[i]);
> +		}
> +
> +		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> +		filter->data = xcalloc(filter->len, sizeof(unsigned char));

... and here we fill the slab with data, including a memory allocation
for each slab entry.

What is missing in this patch or in any of the followup patches is a
place where we clear the slab and the additional memory attached to
it.

> +
> +		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
> +			struct bloom_key key;
> +			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
> +			add_key_to_filter(&key, filter, &settings);
> +		}
> +
> +		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
> +	} else {
> +		for (i = 0; i < diff_queued_diff.nr; i++)
> +			diff_free_filepair(diff_queued_diff.queue[i]);
> +		filter->data = NULL;
> +		filter->len = 0;
> +	}
> +
> +	free(diff_queued_diff.queue);
> +	DIFF_QUEUE_CLEAR(&diff_queued_diff);
> +
> +	return filter;
> +}
> diff --git a/bloom.h b/bloom.h
> index b9ce422ca2d..85ab8e9423d 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -1,6 +1,9 @@
>  #ifndef BLOOM_H
>  #define BLOOM_H
>  
> +struct commit;
> +struct repository;
> +
>  struct bloom_filter_settings {
>  	/*
>  	 * The version of the hashing technique being used.
> @@ -73,4 +76,9 @@ void add_key_to_filter(const struct bloom_key *key,
>  					   struct bloom_filter *filter,
>  					   const struct bloom_filter_settings *settings);
>  
> +void init_bloom_filters(void);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c);
> +
>  #endif
> \ No newline at end of file
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 20460cde775..f18d1b722e1 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -1,6 +1,7 @@
>  #include "git-compat-util.h"
>  #include "bloom.h"
>  #include "test-tool.h"
> +#include "commit.h"
>  
>  struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>  
> @@ -32,6 +33,16 @@ static void print_bloom_filter(struct bloom_filter *filter) {
>  	printf("\n");
>  }
>  
> +static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> +{
> +	struct commit *c;
> +	struct bloom_filter *filter;
> +	setup_git_directory();
> +	c = lookup_commit(the_repository, commit_oid);
> +	filter = get_bloom_filter(the_repository, c);
> +	print_bloom_filter(filter);
> +}
> +
>  int cmd__bloom(int argc, const char **argv)
>  {
>  	if (!strcmp(argv[1], "get_murmur3")) {
> @@ -57,5 +68,14 @@ int cmd__bloom(int argc, const char **argv)
>  		print_bloom_filter(&filter);
>  	}
>  
> +    if (!strcmp(argv[1], "get_filter_for_commit")) {
> +		struct object_id oid;
> +		const char *end;
> +		if (parse_oid_hex(argv[2], &oid, &end))
> +			die("cannot parse oid '%s'", argv[2]);
> +		init_bloom_filters();
> +		get_bloom_filter_for_commit(&oid);
> +	}
> +
>  	return 0;
>  }
> \ No newline at end of file
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 36a086c7c60..8f9eef116dc 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -67,4 +67,51 @@ test_expect_success 'compute bloom key for test string 2' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'get bloom filters for commit with no changes' '
> +	git init &&
> +	git commit --allow-empty -m "c0" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:0
> +	Filter_Data:
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'get bloom filter for commit with 10 changes' '
> +	rm actual &&
> +	rm expect &&
> +	mkdir smallDir &&
> +	for i in $(test_seq 0 9)
> +	do
> +		echo $i >smallDir/$i
> +	done &&
> +	git add smallDir &&
> +	git commit -m "commit with 10 changes" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:25
> +	Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
> +	rm actual &&
> +	rm expect &&
> +	mkdir bigDir &&
> +	for i in $(test_seq 0 512)
> +	do
> +		echo $i >bigDir/$i
> +	done &&
> +	git add bigDir &&
> +	git commit -m "commit with 513 changes" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:0
> +	Filter_Data:
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'
> +
>  test_done
> \ No newline at end of file
> -- 
> gitgitgadget
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:50PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <[email protected]>
> 
> Add logic to
> a) parse Bloom filter information from the commit graph file and,
> b) re-use existing Bloom filters.
> 
> See Documentation/technical/commit-graph-format for the format in which
> the Bloom filter information is written to the commit graph file.
> 
> To read Bloom filter for a given commit with lexicographic position
> 'i' we need to:
> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
>    filter of commit i+1. It is essentially the index past the end
>    of the filter of commit i. It is called end_index in the code.
> 
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
>    for filter of commit i. It is called the start_index in the code.
>    For the first commit, where i = 0, Bloom filter data starts at the
>    beginning, just past the header in the BDAT chunk. Hence, start_index
>    will be 0.
> 
> 3. The length of the filter will be end_index - start_index, because
>    BIDX[i] gives the cumulative 8-byte words including the ith
>    commit's filter.
> 
> We toggle whether Bloom filters should be recomputed based on the
> compute_if_not_present flag.
> 
> Helped-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  bloom.c               | 49 ++++++++++++++++++++++++++++++++++++++++++-
>  bloom.h               |  4 +++-
>  commit-graph.c        |  6 +++---
>  t/helper/test-bloom.c |  2 +-
>  4 files changed, 55 insertions(+), 6 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index a16eee92331..0f714dd76ae 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -4,6 +4,8 @@
>  #include "diffcore.h"
>  #include "revision.h"
>  #include "hashmap.h"
> +#include "commit-graph.h"
> +#include "commit.h"
>  
>  define_commit_slab(bloom_filter_slab, struct bloom_filter);
>  
> @@ -26,6 +28,36 @@ static inline unsigned char get_bitmask(uint32_t pos)
>  	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
>  }
>  
> +static int load_bloom_filter_from_graph(struct commit_graph *g,
> +				   struct bloom_filter *filter,
> +				   struct commit *c)
> +{
> +	uint32_t lex_pos, start_index, end_index;
> +
> +	while (c->graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
> +	if (!g->chunk_bloom_indexes)
> +		return 0;
> +
> +	lex_pos = c->graph_pos - g->num_commits_in_base;
> +
> +	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);

Let's suppose that we encounter a bogus commit-graph file.  This would
then segfault if 'lex_pos' were to point past the end of file, i.e.
past the mmap()-ed memory region.

> +
> +	if (lex_pos > 0)
> +		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> +	else
> +		start_index = 0;
> +
> +	filter->len = end_index - start_index;
> +	filter->data = (unsigned char *)(g->chunk_bloom_data +
> +					sizeof(unsigned char) * start_index +
> +					BLOOMDATA_CHUNK_HEADER_SIZE);

And this could lead to segfault later when accessing the Bloom filter
data if 'start_index' or 'end_index' were to point past EOF or
end_index < start_index.

IMO all indices and offsets read from the commit-graph file must be
checked to ensure that they fit in the corresponding chunk, like I did
in my modified path Bloom filters implementation.  However, I'm not
sure how it's best to handle an out-of-bounds offset...  Simply
erroring out in case of a bogus commit-graph file is the
straightforward possibility, of course, but since the commit-graph is
only an optimization, it would be better user experience to warn and
ignore it and finish the operation without the commit-graph (albeit
slower).  But is it even possible to ignore the commit-graph, say, in
the middle of a 'git rev-list --topo-order HEAD'?

> +	return 1;
> +}
> +
>  /*
>   * Calculate the murmur3 32-bit hash value for the given data
>   * using the given seed.
> @@ -127,7 +159,8 @@ void init_bloom_filters(void)
>  }
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c)
> +				      struct commit *c,
> +					  int compute_if_not_present)
>  {
>  	struct bloom_filter *filter;
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> @@ -140,6 +173,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
>  
> +	if (!filter->data) {
> +		load_commit_graph_info(r, c);
> +		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> +			r->objects->commit_graph->chunk_bloom_indexes) {
> +			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> +				return filter;
> +			else
> +				return NULL;
> +		}
> +	}
> +
> +	if (filter->data || !compute_if_not_present)
> +		return filter;
> +
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.max_changes = max_changes;
> diff --git a/bloom.h b/bloom.h
> index 85ab8e9423d..760d7122374 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -32,6 +32,7 @@ struct bloom_filter_settings {
>  
>  #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
>  #define BITS_PER_WORD 8
> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
>  
>  /*
>   * A bloom_filter struct represents a data segment to
> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
>  void init_bloom_filters(void);
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c);
> +				      struct commit *c,
> +				      int compute_if_not_present);
>  
>  #endif
> \ No newline at end of file
> diff --git a/commit-graph.c b/commit-graph.c
> index a8b6b5cca5d..77668629e27 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  			ctx->commits.nr);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		cur_pos += filter->len;
>  		display_progress(progress, ++i);
>  		hashwrite_be32(f, cur_pos);
> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  	hashwrite_be32(f, settings->bits_per_entry);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		display_progress(progress, ++i);
>  		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
>  		list++;
> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		struct commit *c = sorted_commits[i];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>  		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>  		display_progress(progress, i + 1);
>  	}
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index f18d1b722e1..ce412664ba9 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>  	struct bloom_filter *filter;
>  	setup_git_directory();
>  	c = lookup_commit(the_repository, commit_oid);
> -	filter = get_bloom_filter(the_repository, c);
> +	filter = get_bloom_filter(the_repository, c, 1);
>  	print_bloom_filter(filter);
>  }
>  
> -- 
> gitgitgadget
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
> 
> When computing the changed-paths bloom filters for the commit-graph,
> we limit the size of the filter by restricting the number of paths
> in the diff. Instead of computing a large diff and then ignoring the
> result, it is better to halt the diff computation early.
> 
> Create a new "max_changes" option in struct diff_options. If non-zero,
> then halt the diff computation after discovering strictly more changed
> paths. This includes paths corresponding to trees that change.
> 
> Use this max_changes option in the bloom filter calculations. This
> reduces the time taken to compute the filters for the Linux kernel
> repo from 2m50s to 2m35s. On a large internal repository with ~500
> commits that perform tree-wide changes, the time reduced from
> 6m15s to 3m48s.
> 
> Signed-off-by: Derrick Stolee <[email protected]>
> Signed-off-by: Garima Singh <[email protected]>
> ---
>  bloom.c     | 4 +++-
>  diff.h      | 5 +++++
>  tree-diff.c | 6 ++++++
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/bloom.c b/bloom.c
> index 881a9841ede..a16eee92331 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -133,6 +133,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>  	int i;
>  	struct diff_options diffopt;
> +	int max_changes = 512;
>  
>  	if (bloom_filters.slab_size == 0)
>  		return NULL;
> @@ -141,6 +142,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
> +	diffopt.max_changes = max_changes;
>  	diff_setup_done(&diffopt);
>  
>  	if (c->parents)
> @@ -149,7 +151,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
>  	diffcore_std(&diffopt);
>  
> -	if (diff_queued_diff.nr <= 512) {
> +	if (diff_queued_diff.nr <= max_changes) {
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry *e;
>  		struct hashmap_iter iter;
> diff --git a/diff.h b/diff.h
> index 6febe7e3656..9443dc1b003 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -285,6 +285,11 @@ struct diff_options {
>  	/* Number of hexdigits to abbreviate raw format output to. */
>  	int abbrev;
>  
> +	/* If non-zero, then stop computing after this many changes. */
> +	int max_changes;
> +	/* For internal use only. */
> +	int num_changes;

"For internal use only", understood.

> +
>  	int ita_invisible_in_index;
>  /* white-space error highlighting */
>  #define WSEH_NEW (1<<12)
> diff --git a/tree-diff.c b/tree-diff.c
> index 33ded7f8b3e..f3d303c6e54 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		if (diff_can_quit_early(opt))
>  			break;
>  
> +		if (opt->max_changes && opt->num_changes > opt->max_changes)
> +			break;
> +
>  		if (opt->pathspec.nr) {
>  			skip_uninteresting(&t, base, opt);
>  			for (i = 0; i < nparent; i++)
> @@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  
>  			/* t↓ */
>  			update_tree_entry(&t);
> +			opt->num_changes++;
>  		}
>  
>  		/* t > p[imin] */
> @@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		skip_emit_tp:
>  			/* ∀ pi=p[imin]  pi↓ */
>  			update_tp_entries(tp, nparent);
> +			opt->num_changes++;
>  		}
>  	}

This counter is basically broken, its value is wrong for over 98% of
commits, and, worse, its value remains 0 for over 85% of commits in
the repositories I usually use to test modified path Bloom filters.
Consequently, a relatively large number of commits modifying more than
512 paths get Bloom filters.

The makeshift tests in the patch below demonstrate these issues as
most of them fail, most notably those two tests that demonstrate that
modifying existing paths are not counted at all.


  ---  >8  ---

diff --git a/bloom.c b/bloom.c
index 9b86aa3f59..3db0fde734 100644
--- a/bloom.c
+++ b/bloom.c
@@ -203,7 +203,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
-	diffopt.max_changes = max_changes;
+	diffopt.max_changes = 0;
 	diff_setup_done(&diffopt);
 
 	/* ensure commit is parsed so we have parent information */
@@ -214,6 +214,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	else
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
+	printf("%s  %d\n", oid_to_hex(&c->object.oid), diffopt.num_changes);
 
 	if (diffopt.num_changes <= max_changes) {
 		struct hashmap pathmap;
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 0000000000..8d2bd9f03f
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,142 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_tick &&
+
+	echo 1 >file &&
+	mkdir -p dir/subdir &&
+	echo 1 >dir/subdir/file1 &&
+	echo 1 >dir/subdir/file2 &&
+	git add file dir &&
+	git commit -m setup &&
+
+	echo 2 >file &&
+	git commit -a -m "modify one path in root" &&
+	mod_one_path=$(git rev-parse HEAD) &&
+
+	echo 2 >dir/subdir/file1 &&
+	echo 2 >dir/subdir/file2 &&
+	git commit -a -m "modify two file two dirs deep" &&
+	mod_four_paths=$(git rev-parse HEAD) &&
+
+	>new-file &&
+	git add new-file &&
+	git commit -m "add new file in root" &&
+	new_file_in_root=$(git rev-parse HEAD) &&
+
+	git rm new-file &&
+	git commit -m "delete file in root" &&
+	delete_file_in_root=$(git rev-parse HEAD) &&
+
+	>dir/new-file &&
+	git add dir/new-file &&
+	git commit -m "add new file in dir" &&
+	new_file_in_dir=$(git rev-parse HEAD) &&
+
+	git rm dir/new-file &&
+	git commit -m "delete file in dir" &&
+	delete_file_in_dir=$(git rev-parse HEAD) &&
+
+	echo 1 >d-f &&
+	git add d-f &&
+	git commit -m foo &&
+	git rm d-f &&
+	mkdir d-f &&
+	echo 2 >d-f/file &&
+	git add d-f &&
+	git commit -m "replace file with dir" &&
+	file_to_dir=$(git rev-parse HEAD) &&
+
+	>d-f.c &&
+	git add d-f.c &&
+	git commit -m "add a file that sorts between d-f and d-f/" &&
+	git rm -r d-f &&
+	echo 3 >d-f &&
+	git add d-f &&
+	git commit -m "replace dir with file" &&
+	dir_to_file=$(git rev-parse HEAD) &&
+
+	bin_sha1=$(git rev-parse HEAD:dir/subdir | hex2oct) &&
+	# leading zero in mode: the content of the tree remains the same,
+	# but its oid does change!
+	printf "040000 subdir\0$bin_sha1" >rawtree &&
+	tree1=$(git hash-object -t tree -w rawtree) &&
+	git cat-file -p HEAD^{tree} >out &&
+	tree2=$(sed -e "s/$(git rev-parse HEAD:dir/)/$tree1/" out |git mktree) &&
+	different_but_same_tree=$(git commit-tree \
+		-m "leading zeros in mode" \
+		-p $(git rev-parse HEAD) $tree2) &&
+	git update-ref HEAD $different_but_same_tree &&
+
+	git commit-graph write --reachable --changed-paths >out &&
+	cat out  # debug
+'
+
+test_expect_success 'modify one path in root' '
+	git diff --name-status $mod_one_path^ $mod_one_path &&
+	echo "$mod_one_path  1" >expect &&
+	grep "$mod_one_path" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'modify two file two dirs deep' '
+	git diff --name-status $mod_four_paths^ $mod_four_paths &&
+	echo "$mod_four_paths  4" >expect &&
+	grep "$mod_four_paths" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'add new file in root' '
+	git diff --name-status $new_file_in_root^ $new_file_in_root &&
+	echo "$new_file_in_root  1" >expect &&
+	grep "$new_file_in_root" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'delete file in root' '
+	git diff --name-status $delete_file_in_root^ $delete_file_in_root &&
+	echo "$delete_file_in_root  1" >expect &&
+	grep "$delete_file_in_root" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'add new file in dir' '
+	git diff --name-status $new_file_in_dir^ $new_file_in_dir &&
+	echo "$new_file_in_dir  2" >expect &&
+	grep "$new_file_in_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'delete file in dir' '
+	git diff --name-status $delete_file_in_dir^ $delete_file_in_dir &&
+	echo "$delete_file_in_dir  2" >expect &&
+	grep "$delete_file_in_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'replace file with dir' '
+	git diff --name-status $file_to_dir^ $file_to_dir &&
+	echo "$file_to_dir  2" >expect &&
+	grep "$file_to_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'replace dir with file' '
+	git diff --name-status $dir_to_file^ $dir_to_file &&
+	echo "$dir_to_file  2" >expect &&
+	grep "$dir_to_file" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'leading zeros in mode' '
+	git diff --name-status $different_but_same_tree^ $different_but_same_tree &&
+	echo "$different_but_same_tree  0" >expect &&
+	grep "$different_but_same_tree" out >actual &&
+	test_cmp expect actual
+'
+
+test_done

  ---  >8  ---


> @@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
>  	const struct object_id **parents_oid, int nparent,
>  	struct strbuf *base, struct diff_options *opt)
>  {
> +	opt->num_changes = 0;
>  	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
>  
>  	/*
> -- 
> gitgitgadget
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
> On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> This counter is basically broken, its value is wrong for over 98% of
> commits, and, worse, its value remains 0 for over 85% of commits in
> the repositories I usually use to test modified path Bloom filters.
> Consequently, a relatively large number of commits modifying more than
> 512 paths get Bloom filters.

Thanks for finding this! The counter is only really tested in one
place, and that test only considers _file adds_, which is a problem.

If I understand this correctly, the bug is a performance-only bug
(since this is a performance-only feature), but it is an important
one to fix.

There is certainly some dark magic happening in this tree-diff logic,
so instead of trying to get an accurate count we should just use the
magic global diff_queued_diff to track the current list of file changes.

Note: diff_queued_diff does not track the directory changes, so it
is an under-count for the total changes to track in the Bloom filter.
This is later corrected by the block that adds these leading directory
changes.

> The makeshift tests in the patch below demonstrate these issues as
> most of them fail, most notably those two tests that demonstrate that
> modifying existing paths are not counted at all.

I adapted your diff along with ripping out 'num_changes' in favor
of diff_queued_diff.nr. This required modifying some of your expected
values in the test script (losing the leading directories in the
count).

I'll work with Taylor to create a fix, and include proper testing
of the logic here. We'll stick it in the v2 of his max-changed-paths
series [1]. He already has some helpful logging that can help create
tests that ensure this logic is performing as expected.

We plan to have that fix available by later today or early tomorrow.
Will you be available to help validate it?

[1] https://lore.kernel.org/git/[email protected]/

Thanks,
-Stolee

  --- >8 ---

diff --git a/bloom.c b/bloom.c
index 1a573226e7..b8d6cb9240 100644
--- a/bloom.c
+++ b/bloom.c
@@ -218,8 +218,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	else
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
+	printf("%s  %d\n", oid_to_hex(&c->object.oid), diff_queued_diff.nr);
 
-	if (diffopt.num_changes <= max_changes) {
+	if (diff_queued_diff.nr <= max_changes) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index e0c0af6286b..1d32b718857 100644
--- a/diff.h
+++ b/diff.h
@@ -287,8 +287,6 @@ struct diff_options {
 
 	/* If non-zero, then stop computing after this many changes. */
 	int max_changes;
-	/* For internal use only. */
-	int num_changes;
 
 	int ita_invisible_in_index;
 /* white-space error highlighting */
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 00000000000..1f35aa8e2c5
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,142 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_tick &&
+
+	echo 1 >file &&
+	mkdir -p dir/subdir &&
+	echo 1 >dir/subdir/file1 &&
+	echo 1 >dir/subdir/file2 &&
+	git add file dir &&
+	git commit -m setup &&
+
+	echo 2 >file &&
+	git commit -a -m "modify one path in root" &&
+	mod_one_path=$(git rev-parse HEAD) &&
+
+	echo 2 >dir/subdir/file1 &&
+	echo 2 >dir/subdir/file2 &&
+	git commit -a -m "modify two file two dirs deep" &&
+	mod_four_paths=$(git rev-parse HEAD) &&
+
+	>new-file &&
+	git add new-file &&
+	git commit -m "add new file in root" &&
+	new_file_in_root=$(git rev-parse HEAD) &&
+
+	git rm new-file &&
+	git commit -m "delete file in root" &&
+	delete_file_in_root=$(git rev-parse HEAD) &&
+
+	>dir/new-file &&
+	git add dir/new-file &&
+	git commit -m "add new file in dir" &&
+	new_file_in_dir=$(git rev-parse HEAD) &&
+
+	git rm dir/new-file &&
+	git commit -m "delete file in dir" &&
+	delete_file_in_dir=$(git rev-parse HEAD) &&
+
+	echo 1 >d-f &&
+	git add d-f &&
+	git commit -m foo &&
+	git rm d-f &&
+	mkdir d-f &&
+	echo 2 >d-f/file &&
+	git add d-f &&
+	git commit -m "replace file with dir" &&
+	file_to_dir=$(git rev-parse HEAD) &&
+
+	>d-f.c &&
+	git add d-f.c &&
+	git commit -m "add a file that sorts between d-f and d-f/" &&
+	git rm -r d-f &&
+	echo 3 >d-f &&
+	git add d-f &&
+	git commit -m "replace dir with file" &&
+	dir_to_file=$(git rev-parse HEAD) &&
+
+	bin_sha1=$(git rev-parse HEAD:dir/subdir | hex2oct) &&
+	# leading zero in mode: the content of the tree remains the same,
+	# but its oid does change!
+	printf "040000 subdir\0$bin_sha1" >rawtree &&
+	tree1=$(git hash-object -t tree -w rawtree) &&
+	git cat-file -p HEAD^{tree} >out &&
+	tree2=$(sed -e "s/$(git rev-parse HEAD:dir/)/$tree1/" out |git mktree) &&
+	different_but_same_tree=$(git commit-tree \
+		-m "leading zeros in mode" \
+		-p $(git rev-parse HEAD) $tree2) &&
+	git update-ref HEAD $different_but_same_tree &&
+
+	git commit-graph write --reachable --changed-paths >out &&
+	cat out  # debug
+'
+
+test_expect_success 'modify one path in root' '
+	git diff --name-status $mod_one_path^ $mod_one_path &&
+	echo "$mod_one_path  1" >expect &&
+	grep "$mod_one_path" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'modify two file two dirs deep' '
+	git diff --name-status $mod_four_paths^ $mod_four_paths &&
+	echo "$mod_four_paths  2" >expect &&
+	grep "$mod_four_paths" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'add new file in root' '
+	git diff --name-status $new_file_in_root^ $new_file_in_root &&
+	echo "$new_file_in_root  1" >expect &&
+	grep "$new_file_in_root" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'delete file in root' '
+	git diff --name-status $delete_file_in_root^ $delete_file_in_root &&
+	echo "$delete_file_in_root  1" >expect &&
+	grep "$delete_file_in_root" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'add new file in dir' '
+	git diff --name-status $new_file_in_dir^ $new_file_in_dir &&
+	echo "$new_file_in_dir  1" >expect &&
+	grep "$new_file_in_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'delete file in dir' '
+	git diff --name-status $delete_file_in_dir^ $delete_file_in_dir &&
+	echo "$delete_file_in_dir  1" >expect &&
+	grep "$delete_file_in_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'replace file with dir' '
+	git diff --name-status $file_to_dir^ $file_to_dir &&
+	echo "$file_to_dir  2" >expect &&
+	grep "$file_to_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'replace dir with file' '
+	git diff --name-status $dir_to_file^ $dir_to_file &&
+	echo "$dir_to_file  2" >expect &&
+	grep "$dir_to_file" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'leading zeros in mode' '
+	git diff --name-status $different_but_same_tree^ $different_but_same_tree &&
+	echo "$different_but_same_tree  0" >expect &&
+	grep "$different_but_same_tree" out >actual &&
+	test_cmp expect actual
+'
+
+test_done
diff --git a/tree-diff.c b/tree-diff.c
index 6ebad1a46f3..7cebbb327e2 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
-		if (opt->max_changes && opt->num_changes > opt->max_changes)
+		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
 			break;
 
 		if (opt->pathspec.nr) {
@@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
@@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
-	opt->num_changes = 0;
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
 
 	/*

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Tue, Aug 04, 2020 at 12:25:45PM -0400, Derrick Stolee wrote:
> On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
> > On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> > This counter is basically broken, its value is wrong for over 98% of
> > commits, and, worse, its value remains 0 for over 85% of commits in
> > the repositories I usually use to test modified path Bloom filters.
> > Consequently, a relatively large number of commits modifying more than
> > 512 paths get Bloom filters.
> 
> Thanks for finding this! The counter is only really tested in one
> place, and that test only considers _file adds_, which is a problem.
> 
> If I understand this correctly, the bug is a performance-only bug
> (since this is a performance-only feature), but it is an important
> one to fix.

Or a performance-only feature in a performance-only feature, because
those additional modified path Bloom filters can improve the runtime
of pathspec-limited revision walks (assuming that the false positive
rate is low enough).

> There is certainly some dark magic happening in this tree-diff logic,
> so instead of trying to get an accurate count we should just use the
> magic global diff_queued_diff to track the current list of file changes.
> 
> Note: diff_queued_diff does not track the directory changes, so it
> is an under-count for the total changes to track in the Bloom filter.
> This is later corrected by the block that adds these leading directory
> changes.
> 
> > The makeshift tests in the patch below demonstrate these issues as
> > most of them fail, most notably those two tests that demonstrate that
> > modifying existing paths are not counted at all.
> 
> I adapted your diff along with ripping out 'num_changes' in favor
> of diff_queued_diff.nr. This required modifying some of your expected
> values in the test script (losing the leading directories in the
> count).
> 
> I'll work with Taylor to create a fix, and include proper testing
> of the logic here. We'll stick it in the v2 of his max-changed-paths
> series [1]. He already has some helpful logging that can help create
> tests that ensure this logic is performing as expected.

Don't forget to include a check of the hashmap's size, to make sure.

FWIW, the patch below does result in the correct count (read: the same
as in my implemenation) for all but 4 commits in those repositories I
use for testing, without adding any memory allocations and extra
strcmp() calls.

  ---  >8  ---

diff --git a/cache.h b/cache.h
index 0f0485ecfe..3fc7e1b427 100644
--- a/cache.h
+++ b/cache.h
@@ -1574,6 +1574,7 @@ int repo_interpret_branch_name(struct repository *r,
 int validate_headref(const char *ref);
 
 int base_name_compare(const char *name1, int len1, int mode1, const char *name2, int len2, int mode2);
+int base_name_compare_df(const char *name1, int len1, int mode1, const char *name2, int len2, int mode2, int *df);
 int df_name_compare(const char *name1, int len1, int mode1, const char *name2, int len2, int mode2);
 int name_compare(const char *name1, size_t len1, const char *name2, size_t len2);
 int cache_name_stage_compare(const char *name1, int len1, int stage1, const char *name2, int len2, int stage2);
diff --git a/read-cache.c b/read-cache.c
index aa427c5c17..041af19e60 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -460,13 +460,16 @@ int ie_modified(struct index_state *istate,
 	return 0;
 }
 
-int base_name_compare(const char *name1, int len1, int mode1,
-		      const char *name2, int len2, int mode2)
+int base_name_compare_df(const char *name1, int len1, int mode1,
+			 const char *name2, int len2, int mode2,
+			 int *df)
 {
 	unsigned char c1, c2;
 	int len = len1 < len2 ? len1 : len2;
 	int cmp;
 
+	*df = 0;
+
 	cmp = memcmp(name1, name2, len);
 	if (cmp)
 		return cmp;
@@ -476,7 +479,21 @@ int base_name_compare(const char *name1, int len1, int mode1,
 		c1 = '/';
 	if (!c2 && S_ISDIR(mode2))
 		c2 = '/';
-	return (c1 < c2) ? -1 : (c1 > c2) ? 1 : 0;
+	if (c1 == c2)
+		return 0;	/* TODO: is this even possible? */
+	if ((c1 == '/' && !c2) ||
+	    (!c1 && c2 == '/'))
+		*df = 1;
+	return (c1 < c2) ? -1 : 1;
+}
+
+int base_name_compare(const char *name1, int len1, int mode1,
+		      const char *name2, int len2, int mode2)
+{
+	int unused;
+	return base_name_compare_df(name1, len1, mode1,
+				    name2, len2, mode2,
+				    &unused);
 }
 
 /*
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
index 8d2bd9f03f..4f08590b45 100755
--- a/t/t9999-test.sh
+++ b/t/t9999-test.sh
@@ -125,7 +125,7 @@ test_expect_success 'replace file with dir' '
 	test_cmp expect actual
 '
 
-test_expect_success 'replace dir with file' '
+test_expect_failure 'replace dir with file' '
 	git diff --name-status $dir_to_file^ $dir_to_file &&
 	echo "$dir_to_file  2" >expect &&
 	grep "$dir_to_file" out >actual &&
diff --git a/tree-diff.c b/tree-diff.c
index f3d303c6e5..e27f9c805e 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -46,11 +46,14 @@ static int ll_diff_tree_oid(const struct object_id *old_oid,
  *      Due to this convention, if trees are scanned in sorted order, all
  *      non-empty descriptors will be processed first.
  */
-static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
+static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2,
+			      int *df)
 {
 	struct name_entry *e1, *e2;
 	int cmp;
 
+	*df = 0;
+
 	/* empty descriptors sort after valid tree entries */
 	if (!t1->size)
 		return t2->size ? 1 : 0;
@@ -59,8 +62,9 @@ static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
 
 	e1 = &t1->entry;
 	e2 = &t2->entry;
-	cmp = base_name_compare(e1->path, tree_entry_len(e1), e1->mode,
-				e2->path, tree_entry_len(e2), e2->mode);
+	cmp = base_name_compare_df(e1->path, tree_entry_len(e1), e1->mode,
+				   e2->path, tree_entry_len(e2), e2->mode,
+				   df);
 	return cmp;
 }
 
@@ -410,7 +414,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 {
 	struct tree_desc t, *tp;
 	void *ttree, **tptree;
-	int i;
+	int i, df;
 
 	FAST_ARRAY_ALLOC(tp, nparent);
 	FAST_ARRAY_ALLOC(tptree, nparent);
@@ -463,7 +467,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		tp[0].entry.mode &= ~S_IFXMIN_NEQ;
 
 		for (i = 1; i < nparent; ++i) {
-			cmp = tree_entry_pathcmp(&tp[i], &tp[imin]);
+			cmp = tree_entry_pathcmp(&tp[i], &tp[imin], &df);
 			if (cmp < 0) {
 				imin = i;
 				tp[i].entry.mode &= ~S_IFXMIN_NEQ;
@@ -483,10 +487,12 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 
 		/* compare t vs p[imin] */
-		cmp = tree_entry_pathcmp(&t, &tp[imin]);
+		cmp = tree_entry_pathcmp(&t, &tp[imin], &df);
 
 		/* t = p[imin] */
 		if (cmp == 0) {
+			int prev_num_changes = opt->num_changes;
+
 			/* are either pi > p[imin] or diff(t,pi) != ø ? */
 			if (!opt->flags.find_copies_harder) {
 				for (i = 0; i < nparent; ++i) {
@@ -506,6 +512,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
 			/* D += {δ(t,pi) if pi=p[imin];  "+a" if pi > p[imin]} */
 			p = emit_path(p, base, opt, nparent,
 					&t, tp, imin);
+			if (!(opt->num_changes == prev_num_changes &&
+			      S_ISDIR(t.entry.mode)))
+				opt->num_changes++;
 
 		skip_emit_t_tp:
 			/* t↓,  ∀ pi=p[imin]  pi↓ */
@@ -518,10 +527,11 @@ static struct combine_diff_path *ll_diff_tree_paths(
 			/* D += "+t" */
 			p = emit_path(p, base, opt, nparent,
 					&t, /*tp=*/NULL, -1);
+			if (!df)
+				opt->num_changes++;
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -535,11 +545,12 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			p = emit_path(p, base, opt, nparent,
 					/*t=*/NULL, tp, imin);
+			if (!df)
+				opt->num_changes++;
 
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
  ---  >8  ---


Having said that, the best (i.e faster and accurate) solution to this
issue is probably:

  - Update the callchain between diff_tree_oid() and the diff callback
    functions to allow the callbacks to break diffing with a non-zero
    error code.

  - Fill Bloom filters using the approach presented in:

      https://public-inbox.org/git/[email protected]/

    but modify the callbacks to return non-zero when too many paths
    have been processed.

  - Drop this counter entirely, as there are no other users.

> We plan to have that fix available by later today or early tomorrow.
> Will you be available to help validate it?
> 
> [1] https://lore.kernel.org/git/[email protected]/
> 
> Thanks,
> -Stolee
> 
>   --- >8 ---
> 
> diff --git a/bloom.c b/bloom.c
> index 1a573226e7..b8d6cb9240 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -218,8 +218,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	else
>  		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
>  	diffcore_std(&diffopt);
> +	printf("%s  %d\n", oid_to_hex(&c->object.oid), diff_queued_diff.nr);
>  
> -	if (diffopt.num_changes <= max_changes) {
> +	if (diff_queued_diff.nr <= max_changes) {
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry *e;
>  		struct hashmap_iter iter;
> diff --git a/diff.h b/diff.h
> index e0c0af6286b..1d32b718857 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -287,8 +287,6 @@ struct diff_options {
>  
>  	/* If non-zero, then stop computing after this many changes. */
>  	int max_changes;
> -	/* For internal use only. */
> -	int num_changes;
>  
>  	int ita_invisible_in_index;
>  /* white-space error highlighting */
> diff --git a/t/t9999-test.sh b/t/t9999-test.sh
> new file mode 100755
> index 00000000000..1f35aa8e2c5
> --- /dev/null
> +++ b/t/t9999-test.sh
> @@ -0,0 +1,142 @@
> +#!/bin/sh
> +
> +test_description='test'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +	test_tick &&
> +
> +	echo 1 >file &&
> +	mkdir -p dir/subdir &&
> +	echo 1 >dir/subdir/file1 &&
> +	echo 1 >dir/subdir/file2 &&
> +	git add file dir &&
> +	git commit -m setup &&
> +
> +	echo 2 >file &&
> +	git commit -a -m "modify one path in root" &&
> +	mod_one_path=$(git rev-parse HEAD) &&
> +
> +	echo 2 >dir/subdir/file1 &&
> +	echo 2 >dir/subdir/file2 &&
> +	git commit -a -m "modify two file two dirs deep" &&
> +	mod_four_paths=$(git rev-parse HEAD) &&
> +
> +	>new-file &&
> +	git add new-file &&
> +	git commit -m "add new file in root" &&
> +	new_file_in_root=$(git rev-parse HEAD) &&
> +
> +	git rm new-file &&
> +	git commit -m "delete file in root" &&
> +	delete_file_in_root=$(git rev-parse HEAD) &&
> +
> +	>dir/new-file &&
> +	git add dir/new-file &&
> +	git commit -m "add new file in dir" &&
> +	new_file_in_dir=$(git rev-parse HEAD) &&
> +
> +	git rm dir/new-file &&
> +	git commit -m "delete file in dir" &&
> +	delete_file_in_dir=$(git rev-parse HEAD) &&
> +
> +	echo 1 >d-f &&
> +	git add d-f &&
> +	git commit -m foo &&
> +	git rm d-f &&
> +	mkdir d-f &&
> +	echo 2 >d-f/file &&
> +	git add d-f &&
> +	git commit -m "replace file with dir" &&
> +	file_to_dir=$(git rev-parse HEAD) &&
> +
> +	>d-f.c &&
> +	git add d-f.c &&
> +	git commit -m "add a file that sorts between d-f and d-f/" &&
> +	git rm -r d-f &&
> +	echo 3 >d-f &&
> +	git add d-f &&
> +	git commit -m "replace dir with file" &&
> +	dir_to_file=$(git rev-parse HEAD) &&
> +
> +	bin_sha1=$(git rev-parse HEAD:dir/subdir | hex2oct) &&
> +	# leading zero in mode: the content of the tree remains the same,
> +	# but its oid does change!
> +	printf "040000 subdir\0$bin_sha1" >rawtree &&
> +	tree1=$(git hash-object -t tree -w rawtree) &&
> +	git cat-file -p HEAD^{tree} >out &&
> +	tree2=$(sed -e "s/$(git rev-parse HEAD:dir/)/$tree1/" out |git mktree) &&
> +	different_but_same_tree=$(git commit-tree \
> +		-m "leading zeros in mode" \
> +		-p $(git rev-parse HEAD) $tree2) &&
> +	git update-ref HEAD $different_but_same_tree &&
> +
> +	git commit-graph write --reachable --changed-paths >out &&
> +	cat out  # debug
> +'
> +
> +test_expect_success 'modify one path in root' '
> +	git diff --name-status $mod_one_path^ $mod_one_path &&
> +	echo "$mod_one_path  1" >expect &&
> +	grep "$mod_one_path" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'modify two file two dirs deep' '
> +	git diff --name-status $mod_four_paths^ $mod_four_paths &&
> +	echo "$mod_four_paths  2" >expect &&
> +	grep "$mod_four_paths" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'add new file in root' '
> +	git diff --name-status $new_file_in_root^ $new_file_in_root &&
> +	echo "$new_file_in_root  1" >expect &&
> +	grep "$new_file_in_root" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'delete file in root' '
> +	git diff --name-status $delete_file_in_root^ $delete_file_in_root &&
> +	echo "$delete_file_in_root  1" >expect &&
> +	grep "$delete_file_in_root" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'add new file in dir' '
> +	git diff --name-status $new_file_in_dir^ $new_file_in_dir &&
> +	echo "$new_file_in_dir  1" >expect &&
> +	grep "$new_file_in_dir" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'delete file in dir' '
> +	git diff --name-status $delete_file_in_dir^ $delete_file_in_dir &&
> +	echo "$delete_file_in_dir  1" >expect &&
> +	grep "$delete_file_in_dir" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'replace file with dir' '
> +	git diff --name-status $file_to_dir^ $file_to_dir &&
> +	echo "$file_to_dir  2" >expect &&
> +	grep "$file_to_dir" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'replace dir with file' '
> +	git diff --name-status $dir_to_file^ $dir_to_file &&
> +	echo "$dir_to_file  2" >expect &&
> +	grep "$dir_to_file" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'leading zeros in mode' '
> +	git diff --name-status $different_but_same_tree^ $different_but_same_tree &&
> +	echo "$different_but_same_tree  0" >expect &&
> +	grep "$different_but_same_tree" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_done
> diff --git a/tree-diff.c b/tree-diff.c
> index 6ebad1a46f3..7cebbb327e2 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		if (diff_can_quit_early(opt))
>  			break;
>  
> -		if (opt->max_changes && opt->num_changes > opt->max_changes)
> +		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
>  			break;
>  
>  		if (opt->pathspec.nr) {
> @@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  
>  			/* t↓ */
>  			update_tree_entry(&t);
> -			opt->num_changes++;
>  		}
>  
>  		/* t > p[imin] */
> @@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		skip_emit_tp:
>  			/* ∀ pi=p[imin]  pi↓ */
>  			update_tp_entries(tp, nparent);
> -			opt->num_changes++;
>  		}
>  	}
>  
> @@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
>  	const struct object_id **parents_oid, int nparent,
>  	struct strbuf *base, struct diff_options *opt)
>  {
> -	opt->num_changes = 0;
>  	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
>  
>  	/*

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/4/2020 1:00 PM, SZEDER Gábor wrote:
> On Tue, Aug 04, 2020 at 12:25:45PM -0400, Derrick Stolee wrote:
>> On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
>>> On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> This counter is basically broken, its value is wrong for over 98% of
>>> commits, and, worse, its value remains 0 for over 85% of commits in
>>> the repositories I usually use to test modified path Bloom filters.
>>> Consequently, a relatively large number of commits modifying more than
>>> 512 paths get Bloom filters.
>>
>> Thanks for finding this! The counter is only really tested in one
>> place, and that test only considers _file adds_, which is a problem.
>>
>> If I understand this correctly, the bug is a performance-only bug
>> (since this is a performance-only feature), but it is an important
>> one to fix.
> 
> Or a performance-only feature in a performance-only feature, because
> those additional modified path Bloom filters can improve the runtime
> of pathspec-limited revision walks (assuming that the false positive
> rate is low enough).
> 
>> There is certainly some dark magic happening in this tree-diff logic,
>> so instead of trying to get an accurate count we should just use the
>> magic global diff_queued_diff to track the current list of file changes.
>>
>> Note: diff_queued_diff does not track the directory changes, so it
>> is an under-count for the total changes to track in the Bloom filter.
>> This is later corrected by the block that adds these leading directory
>> changes.
>>
>>> The makeshift tests in the patch below demonstrate these issues as
>>> most of them fail, most notably those two tests that demonstrate that
>>> modifying existing paths are not counted at all.
>>
>> I adapted your diff along with ripping out 'num_changes' in favor
>> of diff_queued_diff.nr. This required modifying some of your expected
>> values in the test script (losing the leading directories in the
>> count).
>>
>> I'll work with Taylor to create a fix, and include proper testing
>> of the logic here. We'll stick it in the v2 of his max-changed-paths
>> series [1]. He already has some helpful logging that can help create
>> tests that ensure this logic is performing as expected.
> 
> Don't forget to include a check of the hashmap's size, to make sure.

Yes, thanks for the pointer. That check is currently not in there,
since the code assumes the hashmap's size will match num_changes.
Hopefully, the tests I intend to write around this would have caught
such an omission.
 
> FWIW, the patch below does result in the correct count (read: the same
> as in my implemenation) for all but 4 commits in those repositories I
> use for testing, without adding any memory allocations and extra
> strcmp() calls.

...

> Having said that, the best (i.e faster and accurate) solution to this
> issue is probably:
> 
>   - Update the callchain between diff_tree_oid() and the diff callback
>     functions to allow the callbacks to break diffing with a non-zero
>     error code.

It looks like this part would not be too difficult. The pathchange
callback is called by emit_path() which returns a struct combine_diff_path
pointer. This could return NULL to signal an early termination, but
we need to update all callers of the following methods to handle NULL
responses:

 * emit_path()
 * ll_diff_tree_paths()
 * diff_tree_paths()

Of some interest: diff_tree_paths() returns a struct combine_diff_path
pointer, but no callers seem to consume it.

>   - Fill Bloom filters using the approach presented in:
> 
>       https://public-inbox.org/git/[email protected]/
> 
>     but modify the callbacks to return non-zero when too many paths
>     have been processed.

Thanks for the pointer to that specific patch. You do a good job of
describing your thought process, including why you used the callback
approach instead of the diff queue approach. The main reason seemed to
be memory overhead from populating the entire diff queue before
checking the limit.

However, if we are using the diff queue as the short-circuit, then
perhaps that memory overhead isn't as much of a problem?

You admit yourself, that

  This patch implements a more efficient, but more complex, approach:

The logic around matching prefixes definitely seems complex and
hard to test, especially around the file/directory changes with the
sort order problems that have plagued similar prefix checks recently.
I'm not doubting your implementation, just saying that the complexity
is worth considering before jumping to that solution too quickly.

To sum up, I intend to start with a fix that uses the diff queue
count as a limit, then try the callback approach to see if there are
measurable improvements in performance.

>   - Drop this counter entirely, as there are no other users.

With the callback approach, "this counter" is both num_changes and
max_changes, since the callback would perform all of the short-circuit
logic.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/4/2020 1:31 PM, Derrick Stolee wrote:
> On 8/4/2020 1:00 PM, SZEDER Gábor wrote:

>> Having said that, the best (i.e faster and accurate) solution to this
>> issue is probably:
>>
>>   - Update the callchain between diff_tree_oid() and the diff callback
>>     functions to allow the callbacks to break diffing with a non-zero
>>     error code.
> 
> It looks like this part would not be too difficult. 

Oh, my hubris! I gave this a shot for some time this morning. This
will definitely take some work to do right. Just changing the callbacks
to return 'int' is a wide-sweeping change, but the place where they are
called already has an 'int' return that means something different.

I'm not saying this is impossible. It just takes more attention and care
than I can currently devote, given my other works in progress right now.

>>   - Fill Bloom filters using the approach presented in:
>>
>>       https://public-inbox.org/git/[email protected]/
>>
>>     but modify the callbacks to return non-zero when too many paths
>>     have been processed.
> 
> Thanks for the pointer to that specific patch. You do a good job of
> describing your thought process, including why you used the callback
> approach instead of the diff queue approach. The main reason seemed to
> be memory overhead from populating the entire diff queue before
> checking the limit.
> 
> However, if we are using the diff queue as the short-circuit, then
> perhaps that memory overhead isn't as much of a problem?
> 
> You admit yourself, that
> 
>   This patch implements a more efficient, but more complex, approach:
> 
> The logic around matching prefixes definitely seems complex and
> hard to test, especially around the file/directory changes with the
> sort order problems that have plagued similar prefix checks recently.
> I'm not doubting your implementation, just saying that the complexity
> is worth considering before jumping to that solution too quickly.
> 
> To sum up, I intend to start with a fix that uses the diff queue
> count as a limit, then try the callback approach to see if there are
> measurable improvements in performance.

That fix is now available [1].

[1] https://lore.kernel.org/git/d1c4bbcaa9627068d5d9fbd0e4a2e8c8834a4bd3.1596646576.git.me@ttaylorr.com/

Again, the callback approach seems promising. The complexity is
stopping me from trying to apply it on top of the current
implementation, while I should be focusing on other things. I completely
believe that that approach is faster and more memory-efficient. I would
love to test and review a patch that takes that approach here.

Thanks,
-Stolee

#include "bloom.h"
#include "diff.h"
#include "diffcore.h"
#include "revision.h"
#include "hashmap.h"
#include "commit-graph.h"
#include "commit.h"

define_commit_slab(bloom_filter_slab, struct bloom_filter);

struct bloom_filter_slab bloom_filters;

struct pathmap_hash_entry {
struct hashmap_entry entry;
const char path[FLEX_ARRAY];
};

static uint32_t rotate_left(uint32_t value, int32_t count)
{
uint32_t mask = 8 * sizeof(uint32_t) - 1;
count &= mask;
return ((value << count) | (value >> ((-count) & mask)));
}

static inline unsigned char get_bitmask(uint32_t pos)
{
return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
}

static int load_bloom_filter_from_graph(struct commit_graph *g,
struct bloom_filter *filter,
struct commit *c)
{
uint32_t lex_pos, start_index, end_index;

while (c->graph_pos < g->num_commits_in_base)
g = g->base_graph;

/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
if (!g->chunk_bloom_indexes)
return 0;

lex_pos = c->graph_pos - g->num_commits_in_base;

end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);

if (lex_pos > 0)
start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
else
start_index = 0;

filter->len = end_index - start_index;
filter->data = (unsigned char *)(g->chunk_bloom_data +
sizeof(unsigned char) * start_index +
BLOOMDATA_CHUNK_HEADER_SIZE);

return 1;
}

/*
* Calculate the murmur3 32-bit hash value for the given data
* using the given seed.
* Produces a uniformly distributed hash value.
* Not considered to be cryptographically secure.
* Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
*/
uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
{
const uint32_t c1 = 0xcc9e2d51;
const uint32_t c2 = 0x1b873593;
const uint32_t r1 = 15;
const uint32_t r2 = 13;
const uint32_t m = 5;
const uint32_t n = 0xe6546b64;
int i;
uint32_t k1 = 0;
const char *tail;

int len4 = len / sizeof(uint32_t);

uint32_t k;
for (i = 0; i < len4; i++) {
uint32_t byte1 = (uint32_t)data[4*i];
uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
k = byte1 | byte2 | byte3 | byte4;
k *= c1;
k = rotate_left(k, r1);
k *= c2;

seed ^= k;
seed = rotate_left(seed, r2) * m + n;
}

tail = (data + len4 * sizeof(uint32_t));

switch (len & (sizeof(uint32_t) - 1)) {
case 3:
k1 ^= ((uint32_t)tail[2]) << 16;
/*-fallthrough*/
case 2:
k1 ^= ((uint32_t)tail[1]) << 8;
/*-fallthrough*/
case 1:
k1 ^= ((uint32_t)tail[0]) << 0;
k1 *= c1;
k1 = rotate_left(k1, r1);
k1 *= c2;
seed ^= k1;
break;
}

seed ^= (uint32_t)len;
seed ^= (seed >> 16);
seed *= 0x85ebca6b;
seed ^= (seed >> 13);
seed *= 0xc2b2ae35;
seed ^= (seed >> 16);

return seed;
}

void fill_bloom_key(const char *data,
size_t len,
struct bloom_key *key,
const struct bloom_filter_settings *settings)
{
int i;
const uint32_t seed0 = 0x293ae76f;
const uint32_t seed1 = 0x7e646e2c;
const uint32_t hash0 = murmur3_seeded(seed0, data, len);
const uint32_t hash1 = murmur3_seeded(seed1, data, len);

key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
for (i = 0; i < settings->num_hashes; i++)
key->hashes[i] = hash0 + i * hash1;
}

void add_key_to_filter(const struct bloom_key *key,
struct bloom_filter *filter,
const struct bloom_filter_settings *settings)
{
int i;
uint64_t mod = filter->len * BITS_PER_WORD;

for (i = 0; i < settings->num_hashes; i++) {
uint64_t hash_mod = key->hashes[i] % mod;
uint64_t block_pos = hash_mod / BITS_PER_WORD;

filter->data[block_pos] |= get_bitmask(hash_mod);
}
}

void init_bloom_filters(void)
{
init_bloom_filter_slab(&bloom_filters);
}

struct bloom_filter *get_bloom_filter(struct repository *r,
struct commit *c,
int compute_if_not_present)
{
struct bloom_filter *filter;
struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
int i;
struct diff_options diffopt;
int max_changes = 512;

if (bloom_filters.slab_size == 0)
return NULL;

filter = bloom_filter_slab_at(&bloom_filters, c);

if (!filter->data) {
load_commit_graph_info(r, c);
if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
r->objects->commit_graph->chunk_bloom_indexes) {
if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
return filter;
else
return NULL;
}
}

if (filter->data || !compute_if_not_present)
return filter;

repo_diff_setup(r, &diffopt);
diffopt.flags.recursive = 1;
diffopt.max_changes = max_changes;
diff_setup_done(&diffopt);

if (c->parents)
diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
else
diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
diffcore_std(&diffopt);

if (diff_queued_diff.nr <= max_changes) {
struct hashmap pathmap;
struct pathmap_hash_entry *e;
struct hashmap_iter iter;
hashmap_init(&pathmap, NULL, NULL, 0);

for (i = 0; i < diff_queued_diff.nr; i++) {
const char *path = diff_queued_diff.queue[i]->two->path;

/*
* Add each leading directory of the changed file, i.e. for
* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
* the Bloom filter could be used to speed up commands like
* 'git log dir/subdir', too.
*
* Note that directories are added without the trailing '/'.
*/
do {
char *last_slash = strrchr(path, '/');

FLEX_ALLOC_STR(e, path, path);
hashmap_entry_init(&e->entry, strhash(path));
hashmap_add(&pathmap, &e->entry);

if (!last_slash)
last_slash = (char*)path;
*last_slash = '\0';

} while (*path);

diff_free_filepair(diff_queued_diff.queue[i]);
}

filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
filter->data = xcalloc(filter->len, sizeof(unsigned char));

hashmap_for_each_entry(&pathmap, &iter, e, entry) {
struct bloom_key key;
fill_bloom_key(e->path, strlen(e->path), &key, &settings);
add_key_to_filter(&key, filter, &settings);
}

hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
} else {
for (i = 0; i < diff_queued_diff.nr; i++)
diff_free_filepair(diff_queued_diff.queue[i]);
filter->data = NULL;
filter->len = 0;
}

free(diff_queued_diff.queue);
DIFF_QUEUE_CLEAR(&diff_queued_diff);

return filter;
}

int bloom_filter_contains(const struct bloom_filter *filter,
const struct bloom_key *key,
const struct bloom_filter_settings *settings)
{
int i;
uint64_t mod = filter->len * BITS_PER_WORD;

if (!mod)
return -1;

for (i = 0; i < settings->num_hashes; i++) {
uint64_t hash_mod = key->hashes[i] % mod;
uint64_t block_pos = hash_mod / BITS_PER_WORD;
if (!(filter->data[block_pos] & get_bitmask(hash_mod)))
return 0;
}

return 1;
}
Loading