Skip to content

Harden the sparse-checkout builtin #513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 15 additions & 4 deletions Documentation/git-sparse-checkout.txt
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,14 @@ To avoid interfering with other worktrees, it first enables the
+
When the `--stdin` option is provided, the patterns are read from
standard in as a newline-delimited list instead of from the arguments.
+
When `core.sparseCheckoutCone` is enabled, the input list is considered a
list of directories instead of sparse-checkout patterns. The command writes
patterns to the sparse-checkout file to include all files contained in those
directories (recursively) as well as files that are siblings of ancestor
directories. The input format matches the output of `git ls-tree --name-only`.
This includes interpreting pathnames that begin with a double quote (") as
C-style quoted strings.

'disable'::
Disable the `core.sparseCheckout` config setting, and restore the
Expand Down Expand Up @@ -106,7 +114,7 @@ The full pattern set allows for arbitrary pattern matches and complicated
inclusion/exclusion rules. These can result in O(N*M) pattern matches when
updating the index, where N is the number of patterns and M is the number
of paths in the index. To combat this performance issue, a more restricted
pattern set is allowed when `core.spareCheckoutCone` is enabled.
pattern set is allowed when `core.sparseCheckoutCone` is enabled.

The accepted patterns in the cone pattern set are:

Expand All @@ -128,9 +136,12 @@ the following patterns:
----------------

This says "include everything in root, but nothing two levels below root."
If we then add the folder `A/B/C` as a recursive pattern, the folders `A` and
`A/B` are added as parent patterns. The resulting sparse-checkout file is
now

When in cone mode, the `git sparse-checkout set` subcommand takes a list of
directories instead of a list of sparse-checkout patterns. In this mode,
the command `git sparse-checkout set A/B/C` sets the directory `A/B/C` as
a recursive pattern, the directories `A` and `A/B` are added as parent
patterns. The resulting sparse-checkout file is now

----------------
/*
Expand Down
2 changes: 1 addition & 1 deletion builtin/clone.c
Original file line number Diff line number Diff line change
Expand Up @@ -1130,7 +1130,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
if (option_required_reference.nr || option_optional_reference.nr)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

Hi Stolee,

On Tue, Jan 14, 2020 at 07:25:57PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <[email protected]>
>
> The --sparse option was added to the clone builtin in d89f09c (clone:
> add --sparse mode, 2019-11-21) and was tested with a local path clone
> in t1091-sparse-checkout-builtin.sh. However, due to a difference in
> how local paths are handled versus URLs, this mechanism does not work
> with URLs.

As we discussed off-list, both of us (as well as Peff) were able to
reproduce this issue. I think that this paragraph is a good description
of what's going on heee.

> Modify the test to use a "file://" URL, which would output this error
> before the code change:
>
>   Cloning into 'clone'...
>   fatal: cannot change to 'file://.../repo': No such file or directory
>   error: failed to initialize sparse-checkout

Nice, this should give us confidence that there won't be a regression
here in the future. I don't think that the explanation is complicated
enough for a single commit which introduced an expected failure, so
grouping it all together in this patch seems good to me.

> These errors are due to using a "-C <path>" option to call 'git -C
> <path> sparse-checkout init' but the URL is being given instead of
> the target directory.
>
> Update that target directory to evaluate this correctly. I have also
> manually tested that https:// URLs are handled correctly as well.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  builtin/clone.c                    | 2 +-
>  t/t1091-sparse-checkout-builtin.sh | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/clone.c b/builtin/clone.c
> index 4348d962c9..2caefc44fb 100644
> --- a/builtin/clone.c
> +++ b/builtin/clone.c
> @@ -1130,7 +1130,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
>  	if (option_required_reference.nr || option_optional_reference.nr)
>  		setup_reference();
>
> -	if (option_sparse_checkout && git_sparse_checkout_init(repo))
> +	if (option_sparse_checkout && git_sparse_checkout_init(dir))

I agree that 'dir' is the right thing to use here. It's the string we
read from to print "Cloning into ...", which always displays the
directory relative to the cwd. Looking at the implementation in
'git_sparse_checkout_init', this matches my understanding, too.

>  		return 1;
>
>  	remote = remote_get(option_origin);
> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
> index 37365dc668..58d9c69163 100755
> --- a/t/t1091-sparse-checkout-builtin.sh
> +++ b/t/t1091-sparse-checkout-builtin.sh
> @@ -90,7 +90,7 @@ test_expect_success 'init with existing sparse-checkout' '
>  '
>
>  test_expect_success 'clone --sparse' '
> -	git clone --sparse repo clone &&
> +	git clone --sparse "file://$(pwd)/repo" clone &&
>  	git -C clone sparse-checkout list >actual &&
>  	cat >expect <<-\EOF &&
>  	/*
> --
> gitgitgadget

This all looks good to me.

  Acked-by: Taylor Blau <[email protected]>

Thanks,
Taylor

setup_reference();

if (option_sparse_checkout && git_sparse_checkout_init(repo))
if (option_sparse_checkout && git_sparse_checkout_init(dir))
return 1;

remote = remote_get(option_origin);
Expand Down
48 changes: 43 additions & 5 deletions builtin/sparse-checkout.c
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#include "resolve-undo.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Jan 28, 2020 at 06:26:41PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <[email protected]>
> 
> If a user somehow creates a directory with an asterisk (*) or backslash
> (\), then the "git sparse-checkout set" command will struggle to provide
> the correct pattern in the sparse-checkout file. When not in cone mode,
> the provided pattern is written directly into the sparse-checkout file.
> However, in cone mode we expect a list of paths to directories and then
> we convert those into patterns.

Is this really about cone mode? It seems more like it is about --stdin.
I.e., what are the rules for when the input is a filename and when it is
a pattern? In our earlier discussion, I had assumed that command-line
arguments to "sparse-checkout set" were actual filenames, and "--stdin"
just read them from stdin.

But looking at the documentation, they are always called "patterns" on
the command-line. Should the "--stdin" documentation make it clear that
we are no longer taking patterns, but instead actual filenames?

Or am I confused, and in non-cone-mode the "ls-tree | sparse-checkout"
pipeline is not supposed to work at all? (I.e., they really are always
patterns)?

> Even more specifically, the goal is to always allow the following from
> the root of a repo:
> 
>   git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin
> 
> The ls-tree command provides directory names with an unescaped asterisk.
> It also quotes the directories that contain an escaped backslash. We
> must remove these quotes, then keep the escaped backslashes.
> 
> However, there is some care needed for the timing of these escapes. The
> in-memory pattern list is used to update the working directory before
> writing the patterns to disk. Thus, we need the command to have the
> unescaped names in the hashsets for the cone comparisons, then escape
> the patterns later.

OK, this part make sense.

> Use unquote_c_style() when parsing lines from stdin. Command-line
> arguments will be parsed as-is, assuming the user can do the correct
> level of escaping from their environment to match the exact directory
> names.

I think there's two issues here: escaping characters from the shell so
that they make it intact to Git, and the question of whether Git is
expecting patterns or raw filenames. I agree the user is responsible for
the shell half, but I think we need to clarify what we're expecting.
I.e., if I say:

 git sparse-checkout set 'f*'

am I trying to match "foo", or the literal file "f*"?

> +static char *escaped_pattern(char *pattern)
> +{
> +	char *p = pattern;
> +	struct strbuf final = STRBUF_INIT;
> +
> +	while (*p) {
> +		if (*p == '*' || *p == '\\')
> +			strbuf_addch(&final, '\\');
> +
> +		strbuf_addch(&final, *p);
> +		p++;
> +	}
> +
> +	return strbuf_detach(&final, NULL);
> +}

Do we need to catch other metacharacters here (using is_glob_special()
perhaps)?

> @@ -423,8 +442,21 @@ static int sparse_checkout_set(int argc, const char **argv, const char *prefix)
>  		pl.use_cone_patterns = 1;
>  
>  		if (set_opts.use_stdin) {
> -			while (!strbuf_getline(&line, stdin))
> +			struct strbuf unquoted = STRBUF_INIT;
> +			while (!strbuf_getline(&line, stdin)) {
> +				if (line.buf[0] == '"') {
> +					strbuf_setlen(&unquoted, 0);

A minor nit, but strbuf_reset(&unquoted) would be more idiomatic here.

> +					if (unquote_c_style(&unquoted, line.buf, NULL))
> +						die(_("unable to unquote C-style string '%s'"),
> +						line.buf);
> +
> +					strbuf_swap(&unquoted, &line);
> +				}
> +
>  				strbuf_to_cone_pattern(&line, &pl);
> +			}

OK, overall this input procedure makes sense to me.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Wed, Jan 29, 2020 at 05:17:13AM -0500, Jeff King wrote:

> > From: Derrick Stolee <[email protected]>
> > 
> > If a user somehow creates a directory with an asterisk (*) or backslash
> > (\), then the "git sparse-checkout set" command will struggle to provide
> > the correct pattern in the sparse-checkout file. When not in cone mode,
> > the provided pattern is written directly into the sparse-checkout file.
> > However, in cone mode we expect a list of paths to directories and then
> > we convert those into patterns.
> 
> Is this really about cone mode? It seems more like it is about --stdin.
> I.e., what are the rules for when the input is a filename and when it is
> a pattern? In our earlier discussion, I had assumed that command-line
> arguments to "sparse-checkout set" were actual filenames, and "--stdin"
> just read them from stdin.
> 
> But looking at the documentation, they are always called "patterns" on
> the command-line. Should the "--stdin" documentation make it clear that
> we are no longer taking patterns, but instead actual filenames?
> 
> Or am I confused, and in non-cone-mode the "ls-tree | sparse-checkout"
> pipeline is not supposed to work at all? (I.e., they really are always
> patterns)?

Hmph, sorry, I _was_ just confused. I was reading a copy of the manpage
without your final patch, which made things much clearer.

So OK, I think the resulting documentation does make things clear. And
this is just about cone mode, not --stdin. Please ignore my ramblings in
the rest of the replied-to message. But...

> > Even more specifically, the goal is to always allow the following from
> > the root of a repo:
> > 
> >   git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin
> > 
> > The ls-tree command provides directory names with an unescaped asterisk.
> > It also quotes the directories that contain an escaped backslash. We
> > must remove these quotes, then keep the escaped backslashes.
> > 
> > However, there is some care needed for the timing of these escapes. The
> > in-memory pattern list is used to update the working directory before
> > writing the patterns to disk. Thus, we need the command to have the
> > unescaped names in the hashsets for the cone comparisons, then escape
> > the patterns later.
> 
> OK, this part make sense.

You could also demonstrate this even without --stdin with something
like:

  git config core.sparsecheckoutcone true
  git sparse-checkout set 'foo*bar'

which should take that as a literal filename and put the pattern
'foo\*bar' in the sparse-checkout file. And your tests do cover that.

So really there are two separate bugs here, and it might be a little
easier to explain the "timing of these escapes" thing by doing them
separately. I.e., the case above needs escaping and we could demonstrate
the bug with a command-line "set".  And then follow up by fixing the
problem with correctly de-quoting --stdin.

> > +static char *escaped_pattern(char *pattern)
> [...]
> Do we need to catch other metacharacters here (using is_glob_special()
> perhaps)?

After de-confusing myself, I think the individual code comments I wrote
still apply though (especially this one).

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/29/2020 5:33 AM, Jeff King wrote:
> On Wed, Jan 29, 2020 at 05:17:13AM -0500, Jeff King wrote:
> 
>>> From: Derrick Stolee <[email protected]>
>>>
>>> If a user somehow creates a directory with an asterisk (*) or backslash
>>> (\), then the "git sparse-checkout set" command will struggle to provide
>>> the correct pattern in the sparse-checkout file. When not in cone mode,
>>> the provided pattern is written directly into the sparse-checkout file.
>>> However, in cone mode we expect a list of paths to directories and then
>>> we convert those into patterns.
>>
>> Is this really about cone mode? It seems more like it is about --stdin.
>> I.e., what are the rules for when the input is a filename and when it is
>> a pattern? In our earlier discussion, I had assumed that command-line
>> arguments to "sparse-checkout set" were actual filenames, and "--stdin"
>> just read them from stdin.
>>
>> But looking at the documentation, they are always called "patterns" on
>> the command-line. Should the "--stdin" documentation make it clear that
>> we are no longer taking patterns, but instead actual filenames?
>>
>> Or am I confused, and in non-cone-mode the "ls-tree | sparse-checkout"
>> pipeline is not supposed to work at all? (I.e., they really are always
>> patterns)?
> 
> Hmph, sorry, I _was_ just confused. I was reading a copy of the manpage
> without your final patch, which made things much clearer.
> 
> So OK, I think the resulting documentation does make things clear. And
> this is just about cone mode, not --stdin. Please ignore my ramblings in
> the rest of the replied-to message. But...
> 
>>> Even more specifically, the goal is to always allow the following from
>>> the root of a repo:
>>>
>>>   git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin
>>>
>>> The ls-tree command provides directory names with an unescaped asterisk.
>>> It also quotes the directories that contain an escaped backslash. We
>>> must remove these quotes, then keep the escaped backslashes.
>>>
>>> However, there is some care needed for the timing of these escapes. The
>>> in-memory pattern list is used to update the working directory before
>>> writing the patterns to disk. Thus, we need the command to have the
>>> unescaped names in the hashsets for the cone comparisons, then escape
>>> the patterns later.
>>
>> OK, this part make sense.
> 
> You could also demonstrate this even without --stdin with something
> like:
> 
>   git config core.sparsecheckoutcone true
>   git sparse-checkout set 'foo*bar'
> 
> which should take that as a literal filename and put the pattern
> 'foo\*bar' in the sparse-checkout file. And your tests do cover that.
> 
> So really there are two separate bugs here, and it might be a little
> easier to explain the "timing of these escapes" thing by doing them
> separately. I.e., the case above needs escaping and we could demonstrate
> the bug with a command-line "set".  And then follow up by fixing the
> problem with correctly de-quoting --stdin.

I've locally split the commit into two parts. That makes things much
simpler to read.

>>> +static char *escaped_pattern(char *pattern)
>> [...]
>> Do we need to catch other metacharacters here (using is_glob_special()
>> perhaps)?
> 
> After de-confusing myself, I think the individual code comments I wrote
> still apply though (especially this one).

I've applied the smaller comments and am now investigating the right
thing to do with other is_glob_special() characters. There is a small
chance that I can replace any "c == '*' || c == '\'" with is_glob_special(),
but we shall see. At the very least, I'll need to expand my tests.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/29/2020 9:16 AM, Derrick Stolee wrote:
> On 1/29/2020 5:33 AM, Jeff King wrote:
>> On Wed, Jan 29, 2020 at 05:17:13AM -0500, Jeff King wrote:
>>>> +static char *escaped_pattern(char *pattern)
>>> [...]
>>> Do we need to catch other metacharacters here (using is_glob_special()
>>> perhaps)?
>>
>> After de-confusing myself, I think the individual code comments I wrote
>> still apply though (especially this one).
> 
> I've applied the smaller comments and am now investigating the right
> thing to do with other is_glob_special() characters. There is a small
> chance that I can replace any "c == '*' || c == '\'" with is_glob_special(),
> but we shall see. At the very least, I'll need to expand my tests.

I think I have a handle on these cases, and I've pushed it to my GGG PR.
I'll let this version settle a bit for more review before updating it
with a v4.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Wed, Jan 29, 2020 at 09:16:11AM -0500, Derrick Stolee wrote:

> I've applied the smaller comments and am now investigating the right
> thing to do with other is_glob_special() characters. There is a small
> chance that I can replace any "c == '*' || c == '\'" with is_glob_special(),
> but we shall see. At the very least, I'll need to expand my tests.

Yeah, that's all I'd expect to need. You mentioned earlier about how
ls-tree would output them, but I don't think it would matter. Now that
you're using unquote_c_style(), you'll get the literal filenames no
matter which way ls-tree decides to quote them (and I don't think it
would quote '?', just as it wouldn't '*', because those are not
syntactically significant in its output).

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/30/2020 2:29 AM, Jeff King wrote:
> On Wed, Jan 29, 2020 at 09:16:11AM -0500, Derrick Stolee wrote:
> 
>> I've applied the smaller comments and am now investigating the right
>> thing to do with other is_glob_special() characters. There is a small
>> chance that I can replace any "c == '*' || c == '\'" with is_glob_special(),
>> but we shall see. At the very least, I'll need to expand my tests.
> 
> Yeah, that's all I'd expect to need. You mentioned earlier about how
> ls-tree would output them, but I don't think it would matter. Now that
> you're using unquote_c_style(), you'll get the literal filenames no
> matter which way ls-tree decides to quote them (and I don't think it
> would quote '?', just as it wouldn't '*', because those are not
> syntactically significant in its output).

Yes, even this case for 'git ls-tree' gets covered in the final
version of the test:

test_expect_success BSLASHPSPEC 'pattern-checks: escaped characters' '
	git clone repo escaped &&
	TREEOID=$(git -C escaped rev-parse HEAD:folder1) &&
	NEWTREE=$(git -C escaped mktree <<-EOF
	$(git -C escaped ls-tree HEAD)
	040000 tree $TREEOID	zbad\\dir
	040000 tree $TREEOID	zdoes*exist
	040000 tree $TREEOID	zglob[!a]?
	EOF
	) &&
	COMMIT=$(git -C escaped commit-tree $NEWTREE -p HEAD) &&
	git -C escaped reset --hard $COMMIT &&
	check_files escaped "a deep folder1 folder2 zbad\\dir zdoes*exist" zglob[!a]? &&
	git -C escaped sparse-checkout init --cone &&
	git -C escaped sparse-checkout set zbad\\dir/bogus "zdoes*not*exist" "zdoes*exist" "zglob[!a]?" &&
	cat >expect <<-\EOF &&
	/*
	!/*/
	/zbad\\dir/
	!/zbad\\dir/*/
	/zbad\\dir/bogus/
	/zdoes\*exist/
	/zdoes\*not\*exist/
	/zglob\[!a]\?/
	EOF
	test_cmp expect escaped/.git/info/sparse-checkout &&
	check_read_tree_errors escaped "a zbad\\dir zdoes*exist zglob[!a]?" &&
	git -C escaped ls-tree -d --name-only HEAD >list-expect &&
	git -C escaped sparse-checkout set --stdin <list-expect &&
	cat >expect <<-\EOF &&
	/*
	!/*/
	/deep/
	/folder1/
	/folder2/
	/zbad\\dir/
	/zdoes\*exist/
	/zglob\[!a]\?/
	EOF
	test_cmp expect escaped/.git/info/sparse-checkout &&
	check_files escaped "a deep folder1 folder2 zbad\\dir zdoes*exist" zglob[!a]? &&
	git -C escaped sparse-checkout list >list-actual &&
	test_cmp list-expect list-actual
'

Thanks,
-Stolee


Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Jan 28, 2020 at 06:26:42PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <[email protected]>
> 
> When in cone mode, the 'git sparse-checkout list' subcommand lists
> the directories included in the sparse cone. When these directories
> contain odd characters, such as a backslash, then we need to use
> C-style quotes similar to 'git ls-tree'.

Makes sense, and the code looks correct to me.

-Peff

#include "unpack-trees.h"
#include "wt-status.h"
#include "quote.h"

static const char *empty_base = "";

Expand Down Expand Up @@ -77,8 +78,10 @@ static int sparse_checkout_list(int argc, const char **argv)

string_list_sort(&sl);

for (i = 0; i < sl.nr; i++)
printf("%s\n", sl.items[i].string);
for (i = 0; i < sl.nr; i++) {
quote_c_style(sl.items[i].string, NULL, stdout, 0);
printf("\n");
}

return 0;
}
Expand Down Expand Up @@ -140,6 +143,22 @@ static int update_working_directory(struct pattern_list *pl)
return result;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Jan 14, 2020 at 07:26:02PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <[email protected]>
> 
> If a user somehow creates a directory with an asterisk (*) or backslash
> (\), then the "git sparse-checkout set" command will struggle to provide
> the correct pattern in the sparse-checkout file. When not in cone mode,
> the provided pattern is written directly into the sparse-checkout file.
> However, in cone mode we expect a list of paths to directories and then
> we convert those into patterns.
> 
> Even more specifically, the goal is to always allow the following from
> the root of a repo:
> 
>   git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin
> 
> The ls-tree command provides directory names with an unescaped asterisk.
> It also quotes the directories that contain an escaped backslash. We
> must remove these quotes, then keep the escaped backslashes.

Do we need to document these rules somewhere? Naively I'd expect
"--stdin" to take in literal pathnames. But of course it can't represent
a path with a newline. So perhaps it makes sense to take quoted names by
default, and allow literal NUL-separated input with "-z" if anybody
wants it.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/14/2020 4:25 PM, Jeff King wrote:
> On Tue, Jan 14, 2020 at 07:26:02PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <[email protected]>
>>
>> If a user somehow creates a directory with an asterisk (*) or backslash
>> (\), then the "git sparse-checkout set" command will struggle to provide
>> the correct pattern in the sparse-checkout file. When not in cone mode,
>> the provided pattern is written directly into the sparse-checkout file.
>> However, in cone mode we expect a list of paths to directories and then
>> we convert those into patterns.
>>
>> Even more specifically, the goal is to always allow the following from
>> the root of a repo:
>>
>>   git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin
>>
>> The ls-tree command provides directory names with an unescaped asterisk.
>> It also quotes the directories that contain an escaped backslash. We
>> must remove these quotes, then keep the escaped backslashes.
> 
> Do we need to document these rules somewhere? Naively I'd expect
> "--stdin" to take in literal pathnames. But of course it can't represent
> a path with a newline. So perhaps it makes sense to take quoted names by
> default, and allow literal NUL-separated input with "-z" if anybody
> wants it.

This is worth thinking about the right way to describe the rules:

1. You don't _need_ quotes. They happen to come along for the ride in
  'git ls-tree' so it doesn't mess up shell scripts that iterate on
  those entries. At least, that's why I think they are quoted.

2. If you use quotes, the first layer of quotes will be removed.

How much of this needs to be documented explicitly, or how much should
we say "The input format matches what we would expect from 'git ls-tree
--name-only'"?

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Jan 14, 2020 at 05:11:03PM -0500, Derrick Stolee wrote:

> > Do we need to document these rules somewhere? Naively I'd expect
> > "--stdin" to take in literal pathnames. But of course it can't represent
> > a path with a newline. So perhaps it makes sense to take quoted names by
> > default, and allow literal NUL-separated input with "-z" if anybody
> > wants it.
> 
> This is worth thinking about the right way to describe the rules:
> 
> 1. You don't _need_ quotes. They happen to come along for the ride in
>   'git ls-tree' so it doesn't mess up shell scripts that iterate on
>   those entries. At least, that's why I think they are quoted.

It's not just shell scripts. Without quoting, the syntax becomes
ambiguous (e.g., imagine a file with a newline in it). So most Git
output that shows a filename will quote it if necessary, unless
NUL separators are being used.

> 2. If you use quotes, the first layer of quotes will be removed.

I take this to mean that anything starting with a double-quote will have
the outer layer removed, and backslash escapes inside expanded. And
anything without a starting double quote (even if it has internal
backslash escapes!) will be taken literally.

That would match how things like "update-index --index-info" work.

As far as implementation, I know you're trying to keep some of the
escaping, but I think it might make more sense to do use
unquote_c_style() to parse the input (see update-index's use for some
prior art), and then re-quote as necessary to put things into the
sparse-checkout file (I guess quoting more than just quote_c_style()
would do, since you need to quote glob metacharacters like '*' and
probably "!"). But as much as possible, I think you'd want literal
strings inside the program, and just quoting/unquoting at the edges.

> How much of this needs to be documented explicitly, or how much should
> we say "The input format matches what we would expect from 'git ls-tree
> --name-only'"?

I think it's fine to say that, and maybe call attention to the quoting.
Like:

  The input format matches the output of `git ls-tree --name-only`. This
  includes interpreting pathnames that begin with a double quote (") as
  C-style quoted strings.

Disappointingly, update-index does not seem to explain the rules
anywhere. fast-import does cover it. Maybe it's something that ought to
be hoisted out into gitcli(7) or similar (or maybe it has been and I
just can't find it).

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/14/2020 5:48 PM, Jeff King wrote:
> On Tue, Jan 14, 2020 at 05:11:03PM -0500, Derrick Stolee wrote:
> 
>>> Do we need to document these rules somewhere? Naively I'd expect
>>> "--stdin" to take in literal pathnames. But of course it can't represent
>>> a path with a newline. So perhaps it makes sense to take quoted names by
>>> default, and allow literal NUL-separated input with "-z" if anybody
>>> wants it.
>>
>> This is worth thinking about the right way to describe the rules:
>>
>> 1. You don't _need_ quotes. They happen to come along for the ride in
>>   'git ls-tree' so it doesn't mess up shell scripts that iterate on
>>   those entries. At least, that's why I think they are quoted.
> 
> It's not just shell scripts. Without quoting, the syntax becomes
> ambiguous (e.g., imagine a file with a newline in it). So most Git
> output that shows a filename will quote it if necessary, unless
> NUL separators are being used.

Good to know.

>> 2. If you use quotes, the first layer of quotes will be removed.
> 
> I take this to mean that anything starting with a double-quote will have
> the outer layer removed, and backslash escapes inside expanded. And
> anything without a starting double quote (even if it has internal
> backslash escapes!) will be taken literally.

Hm. Perhaps you are right! The ls-tree output for the test example
is:

	deep
	folder1
	folder2
	"zbad\\dir"
	zdoes*exist

so the "zdoes*exist" value is not escaped. I believe the current
logic does something extra: consider supplying this input to
'git sparse-checkout set --stdin':

	deep
	folder1
	folder2
	"zbad\\dir"
	zdoes\*exist

then should we un-escape "\*" to "*"? Or is this not a valid input
because a backslash should have been quoted into C-style quotes?

The behavior in the current series allows this output that would
never be written by "git ls-tree".

> That would match how things like "update-index --index-info" work.
> 
> As far as implementation, I know you're trying to keep some of the
> escaping, but I think it might make more sense to do use
> unquote_c_style() to parse the input (see update-index's use for some
> prior art), and then re-quote as necessary to put things into the
> sparse-checkout file (I guess quoting more than just quote_c_style()
> would do, since you need to quote glob metacharacters like '*' and
> probably "!"). But as much as possible, I think you'd want literal
> strings inside the program, and just quoting/unquoting at the edges.

I was playing around with this, and I think that quote_c_style() is
necessary for the output, but we have a strange in-memory situation
for the other escaping: we both fill the hashsets with the un-escaped
data and fill the pattern list with the escaped patterns.

I'll add a commit with the quote_c_style() calls during the 'list'
subcommand.

>> How much of this needs to be documented explicitly, or how much should
>> we say "The input format matches what we would expect from 'git ls-tree
>> --name-only'"?
> 
> I think it's fine to say that, and maybe call attention to the quoting.
> Like:
> 
>   The input format matches the output of `git ls-tree --name-only`. This
>   includes interpreting pathnames that begin with a double quote (") as
>   C-style quoted strings.
> 
> Disappointingly, update-index does not seem to explain the rules
> anywhere. fast-import does cover it. Maybe it's something that ought to
> be hoisted out into gitcli(7) or similar (or maybe it has been and I
> just can't find it).

I'll start the process by using your recommended language. I noticed
also that the 'set' command doesn't actually document what happens
when in cone mode, so I will correct that, too.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Jan 24, 2020 at 04:10:21PM -0500, Derrick Stolee wrote:

> Hm. Perhaps you are right! The ls-tree output for the test example
> is:
> 
> 	deep
> 	folder1
> 	folder2
> 	"zbad\\dir"
> 	zdoes*exist
> 
> so the "zdoes*exist" value is not escaped. I believe the current
> logic does something extra: consider supplying this input to
> 'git sparse-checkout set --stdin':
> 
> 	deep
> 	folder1
> 	folder2
> 	"zbad\\dir"
> 	zdoes\*exist
> 
> then should we un-escape "\*" to "*"? Or is this not a valid input
> because a backslash should have been quoted into C-style quotes?

I'd think we should not un-escape anything, because we weren't told that
this was a C-style quoted string by the presence of an opening
double-quote. And that's how, say, update-index behaves:

  $ blob=$(echo foo | git hash-object -w --stdin)
  $ printf '100644 %s\t%s\n' \
      $blob 'just*asterisk' \
      $blob 'backslash\without\quotes' \
      $blob '"backslash\\with\\quotes"' |
    git update-index --index-info

which results in:

  $ git ls-files
  "backslash\\with\\quotes"
  "backslash\\without\\quotes"
  just*asterisk

  [same, but without quoting]
  $ git ls-files -z | tr '\0' '\n'
  backslash\with\quotes
  backslash\without\quotes
  just*asterisk

> The behavior in the current series allows this output that would
> never be written by "git ls-tree".

Yes, I think we'd never write that, because ls-tree would quote anything
with a backslash in it, even though it's not strictly necessary. But it
would be valid input to specify a file that has backslashes but not
double-quotes, and I think sparse-checkout should be changed to match
update-index here.

> I was playing around with this, and I think that quote_c_style() is
> necessary for the output, but we have a strange in-memory situation
> for the other escaping: we both fill the hashsets with the un-escaped
> data and fill the pattern list with the escaped patterns.

Yeah, but I think that the syntactic escaping on input might not have
identical rules to the escaping needed for the patterns.

So it makes sense to me to handle input as a separate mechanism, get a
pristine copy of what the user was trying to communicate to us, and then
re-escape whatever we need to put into the pattern list.

And ultimately the flow would be something like:

  - read input
    - if argument is from command-line, use it verbatim
    - else if reading stdin with "-z", use it verbatim
    - else if line starts with double-quote, unquote_c_style()
    - else use line verbatim
    - the result is a single pristine filename
  - fill hashset with pristine filenames
  - generate pattern list to write to sparse file, escaping filenames as
    necessary according to sparse-pattern rules

Obviously you don't have a "-z" yet, but I think it's something we'd
probably want in the long run. And anything coming from the command-line
shouldn't need quoting to get it to us either (and so we'd need to
escape before writing to the sparse file).

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/24/2020 4:42 PM, Jeff King wrote:
> And ultimately the flow would be something like:
> 
>   - read input
>     - if argument is from command-line, use it verbatim
>     - else if reading stdin with "-z", use it verbatim
>     - else if line starts with double-quote, unquote_c_style()
>     - else use line verbatim
>     - the result is a single pristine filename
>   - fill hashset with pristine filenames
>   - generate pattern list to write to sparse file, escaping filenames as
>     necessary according to sparse-pattern rules
> 
> Obviously you don't have a "-z" yet, but I think it's something we'd
> probably want in the long run. And anything coming from the command-line
> shouldn't need quoting to get it to us either (and so we'd need to
> escape before writing to the sparse file).

This recommendation came async with my v2, so I'll follow shortly with
a v3 that uses this flow. I have something that I think works, after
slightly adapting my tests, but now I need to make sure that all the
patches still make sense and build cleanly.

Thanks,
-Stolee

}

static char *escaped_pattern(char *pattern)
{
char *p = pattern;
struct strbuf final = STRBUF_INIT;

while (*p) {
if (is_glob_special(*p))
strbuf_addch(&final, '\\');

strbuf_addch(&final, *p);
p++;
}

return strbuf_detach(&final, NULL);
}

static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
{
int i;
Expand All @@ -164,10 +183,11 @@ static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
fprintf(fp, "/*\n!/*/\n");

for (i = 0; i < sl.nr; i++) {
char *pattern = sl.items[i].string;
char *pattern = escaped_pattern(sl.items[i].string);

if (strlen(pattern))
fprintf(fp, "%s/\n!%s/*/\n", pattern, pattern);
free(pattern);
}

string_list_clear(&sl, 0);
Expand All @@ -185,8 +205,9 @@ static void write_cone_to_file(FILE *fp, struct pattern_list *pl)
string_list_remove_duplicates(&sl, 0);

for (i = 0; i < sl.nr; i++) {
char *pattern = sl.items[i].string;
char *pattern = escaped_pattern(sl.items[i].string);
fprintf(fp, "%s/\n", pattern);
free(pattern);
}
}

Expand All @@ -199,6 +220,10 @@ static int write_patterns_and_update(struct pattern_list *pl)
int result;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> From: Derrick Stolee <[email protected]>
>
> The 'git init' command creates the ".git/info" directory and fills it
> with some default files. However, 'git worktree add' does not create
> the info directory for that worktree. This causes a problem when running
> "git sparse-checkout init" inside a worktree. While care was taken to
> allow the sparse-checkout config to be specific to a worktree, this
> initialization was untested.
>
> Safely create the leading directories for the sparse-checkout file. This
> is the safest thing to do even without worktrees, as a user could delete
> their ".git/info" directory and expect Git to recover safely.
>
> Signed-off-by: Derrick Stolee <[email protected]>
> ---
>  builtin/sparse-checkout.c          |  4 ++++
>  t/t1091-sparse-checkout-builtin.sh | 10 ++++++++++
>  2 files changed, 14 insertions(+)
>
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index b3bed891cb..3cee8ab46e 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -199,6 +199,10 @@ static int write_patterns_and_update(struct pattern_list *pl)
>  	int result;
>  
>  	sparse_filename = get_sparse_checkout_filename();
> +
> +	if (safe_create_leading_directories(sparse_filename))
> +		die(_("failed to create directory for sparse-checkout file"));
> +

The use of safe_create_leading_directories() here, which uses
adjust_shared_perm(), is the right thing to do.

Looks good.

> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
> index 20caefe155..37365dc668 100755
> --- a/t/t1091-sparse-checkout-builtin.sh
> +++ b/t/t1091-sparse-checkout-builtin.sh
> @@ -295,4 +295,14 @@ test_expect_success 'interaction with submodules' '
>  	check_files super/modules/child "a deep folder1 folder2"
>  '
>  
> +test_expect_success 'different sparse-checkouts with worktrees' '
> +	git -C repo worktree add --detach ../worktree &&
> +	check_files worktree "a deep folder1 folder2" &&
> +	git -C worktree sparse-checkout init --cone &&
> +	git -C repo sparse-checkout set folder1 &&
> +	git -C worktree sparse-checkout set deep/deeper1 &&
> +	check_files repo "a folder1" &&
> +	check_files worktree "a deep"
> +'
> +
>  test_done


sparse_filename = get_sparse_checkout_filename();

if (safe_create_leading_directories(sparse_filename))
die(_("failed to create directory for sparse-checkout file"));

fd = hold_lock_file_for_update(&lk, sparse_filename,
LOCK_DIE_ON_ERROR);

Expand Down Expand Up @@ -419,8 +444,21 @@ static int sparse_checkout_set(int argc, const char **argv, const char *prefix)
pl.use_cone_patterns = 1;

if (set_opts.use_stdin) {
while (!strbuf_getline(&line, stdin))
struct strbuf unquoted = STRBUF_INIT;
while (!strbuf_getline(&line, stdin)) {
if (line.buf[0] == '"') {
strbuf_reset(&unquoted);
if (unquote_c_style(&unquoted, line.buf, NULL))
die(_("unable to unquote C-style string '%s'"),
line.buf);

strbuf_swap(&unquoted, &line);
}

strbuf_to_cone_pattern(&line, &pl);
}

strbuf_release(&unquoted);
} else {
for (i = 0; i < argc; i++) {
strbuf_setlen(&line, 0);
Expand Down
79 changes: 75 additions & 4 deletions dir.c
Original file line number Diff line number Diff line change
Expand Up @@ -630,11 +630,42 @@ int pl_hashmap_cmp(const void *unused_cmp_data,
return strncmp(ee1->pattern, ee2->pattern, min_len);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Jan 14, 2020 at 07:26:01PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <[email protected]>
> 
> In cone mode, the sparse-checkout feature uses hashset containment
> queries to match paths. Make this algorithm respect escaped asterisk
> (*) and backslash (\) characters.
> 
> Create dup_and_filter_pattern() method to convert a pattern by
> removing escape characters and dropping an optional "/*" at the end.
> This method is available in dir.h as we will use it in
> builtin/sparse-chekcout.c in a later change.

s/chekcout/checkout/

It took me a minute to understand the problem here, but I think it's: if
a path in the sparse-checkout file has "\*" in it, we'd try to match a
literal "\*" in the hash, not "*"?

But we wouldn't run into that yet because we don't properly _write_ the
escaped names until patch 8.

Is that right?

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/14/2020 4:21 PM, Jeff King wrote:
> On Tue, Jan 14, 2020 at 07:26:01PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <[email protected]>
>>
>> In cone mode, the sparse-checkout feature uses hashset containment
>> queries to match paths. Make this algorithm respect escaped asterisk
>> (*) and backslash (\) characters.
>>
>> Create dup_and_filter_pattern() method to convert a pattern by
>> removing escape characters and dropping an optional "/*" at the end.
>> This method is available in dir.h as we will use it in
>> builtin/sparse-chekcout.c in a later change.
> 
> s/chekcout/checkout/

Thanks.

> It took me a minute to understand the problem here, but I think it's: if
> a path in the sparse-checkout file has "\*" in it, we'd try to match a
> literal "\*" in the hash, not "*"?

Yes, the hashset would have the string "\*" instead of the string "*". This
would lead to missing directories when cone mode is enabled compared to
cone mode not being enabled.
 
> But we wouldn't run into that yet because we don't properly _write_ the
> escaped names until patch 8.

We wouldn't run into it when using the builtin, but also a user could
edit their sparse-checkout file manually OR figure out how to get the
"right" pattern by running "git sparse-checkout set "my\\*dir" (where the
escaped backslash is collapsed by the shell and Git sees "my\*dir".

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Jan 28, 2020 at 06:26:40PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <[email protected]>
> 
> In cone mode, the sparse-checkout feature uses hashset containment
> queries to match paths. Make this algorithm respect escaped asterisk
> (*) and backslash (\) characters.

Do we also need to worry about other glob metacharacters? E.g., "?" or
ranges like "[A-Z]"?

> +static char *dup_and_filter_pattern(const char *pattern)
> +{
> +	char *set, *read;
> +	char *result = xstrdup(pattern);
> +
> +	set = result;
> +	read = result;
> +
> +	while (*read) {
> +		/* skip escape characters (once) */
> +		if (*read == '\\')
> +			read++;
> +
> +		*set = *read;
> +
> +		set++;
> +		read++;
> +	}
> +	*set = 0;
> +
> +	if (*(read - 2) == '/' && *(read - 1) == '*')
> +		*(read - 2) = 0;
> +
> +	return result;
> +}

Do we need to check that the pattern is longer than 1 character here? If
it's a single character, it seems like this "*(read - 2)" will
dereference the byte before the string.

-Peff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/29/2020 5:03 AM, Jeff King wrote:
> On Tue, Jan 28, 2020 at 06:26:40PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> From: Derrick Stolee <[email protected]>
>>
>> In cone mode, the sparse-checkout feature uses hashset containment
>> queries to match paths. Make this algorithm respect escaped asterisk
>> (*) and backslash (\) characters.
> 
> Do we also need to worry about other glob metacharacters? E.g., "?" or
> ranges like "[A-Z]"?

These are not part of the .gitignore patterns [1].

[1] https://git-scm.com/docs/gitignore#_pattern_format

>> +static char *dup_and_filter_pattern(const char *pattern)
>> +{
>> +	char *set, *read;
>> +	char *result = xstrdup(pattern);
>> +
>> +	set = result;
>> +	read = result;
>> +
>> +	while (*read) {
>> +		/* skip escape characters (once) */
>> +		if (*read == '\\')
>> +			read++;
>> +
>> +		*set = *read;
>> +
>> +		set++;
>> +		read++;
>> +	}
>> +	*set = 0;
>> +
>> +	if (*(read - 2) == '/' && *(read - 1) == '*')
>> +		*(read - 2) = 0;
>> +
>> +	return result;
>> +}
> 
> Do we need to check that the pattern is longer than 1 character here? If
> it's a single character, it seems like this "*(read - 2)" will
> dereference the byte before the string.

This method is only called by add_pattern_to_hashsets(), which
has a guard against paths of length less than 2, but thats' no
excuse for dangerous pointer arithmetic here.

But you also point out an even more confusing thing: why are we
modifying based on the 'read' pointer, and not the 'set' pointer?
This seems to work _accidentally_ only when the pattern has "<something>/*"
and "<something>" has no escape characters.

I had to recall exactly why we are dropping this "/*", but it's because
the pattern _actually_ ends with "/*/" but the in-memory pattern has
already dropped that last slash and applied PATTERN_FLAG_MUSTBEDIR.

Here is a diff that I can apply to this patch to fix this problem
_and_ demonstrate it in the tests:

diff --git a/dir.c b/dir.c
index 579f274d13..277577c8bf 100644
--- a/dir.c
+++ b/dir.c
@@ -633,6 +633,7 @@ int pl_hashmap_cmp(const void *unused_cmp_data,
 static char *dup_and_filter_pattern(const char *pattern)
 {
        char *set, *read;
+       size_t count  = 0;
        char *result = xstrdup(pattern);
 
        set = result;
@@ -647,11 +648,14 @@ static char *dup_and_filter_pattern(const char *pattern)
 
                set++;
                read++;
+               count++;
        }
        *set = 0;
 
-       if (*(read - 2) == '/' && *(read - 1) == '*')
-               *(read - 2) = 0;
+       if (count > 2 &&
+           *(set - 1) == '*' &&
+           *(set - 2) == '/')
+               *(set - 2) = 0;
 
        return result;
 }
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index 0a21a5e15d..20b0465f77 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -383,6 +383,7 @@ test_expect_success BSLASHPSPEC 'pattern-checks: escaped "*"' '
        /*
        !/*/
        /zbad\\dir/
+       !/zbad\\dir/*/
        /zdoes\*not\*exist/
        /zdoes\*exist/
        EOF

With this extra line in the test, but compiling the old version of this patch,
the test fails with:

'err' is not empty, it contains:
+ cat err
warning: unrecognized negative pattern: '/zbad\\dir/*'
warning: disabling cone pattern matching

To ensure this negative pattern exists in the later patch where we set
the patterns using the builtin, I'll add "zbad\\dir/bogus" to the list
of directories to include, which will add another pattern to the set.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/29/2020 8:58 AM, Derrick Stolee wrote:
> On 1/29/2020 5:03 AM, Jeff King wrote:
>> On Tue, Jan 28, 2020 at 06:26:40PM +0000, Derrick Stolee via GitGitGadget wrote:
>>
>>> From: Derrick Stolee <[email protected]>
>>>
>>> In cone mode, the sparse-checkout feature uses hashset containment
>>> queries to match paths. Make this algorithm respect escaped asterisk
>>> (*) and backslash (\) characters.
>>
>> Do we also need to worry about other glob metacharacters? E.g., "?" or
>> ranges like "[A-Z]"?
> 
> These are not part of the .gitignore patterns [1].
> 
> [1] https://git-scm.com/docs/gitignore#_pattern_format

I should read things more carefully. There is also this information in
one of the bullets:

	An asterisk "*" matches anything except a slash. The character
	"?" matches any one character except "/". The range notation,
	e.g. [a-zA-Z], can be used to match one of the characters in a range.
	See fnmatch(3) and the FNM_PATHNAME flag for a more detailed
	description.

So this series does not attempt to properly work with globs, and I'll
need to test those a bit. Certainly they shouldn't work in cone mode,
so an extra patch to remove those would be simple. Input sanitizing
would be interesting, and I'll see what `git ls-tree` would output
with paths containing these characters.

-Stolee

}

static char *dup_and_filter_pattern(const char *pattern)
{
char *set, *read;
size_t count = 0;
char *result = xstrdup(pattern);

set = result;
read = result;

while (*read) {
/* skip escape characters (once) */
if (*read == '\\')
read++;

*set = *read;

set++;
read++;
count++;
}
*set = 0;

if (count > 2 &&
*(set - 1) == '*' &&
*(set - 2) == '/')
*(set - 2) = 0;

return result;
}

static void add_pattern_to_hashsets(struct pattern_list *pl, struct path_pattern *given)
{
struct pattern_entry *translated;
char *truncated;
char *data = NULL;
const char *prev, *cur, *next;

if (!pl->use_cone_patterns)
return;
Expand All @@ -651,17 +682,57 @@ static void add_pattern_to_hashsets(struct pattern_list *pl, struct path_pattern
return;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff King wrote (reply to this):

On Tue, Jan 14, 2020 at 07:25:58PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <[email protected]>
> 
> When core.sparseCheckoutCone is enabled, the 'git sparse-checkout set'
> command creates a restricted set of possible patterns that are used
> by a custom algorithm to quickly match those patterns.
> 
> If a user manually edits the sparse-checkout file, then they could
> create patterns that do not match these expectations. The cone-mode
> matching algorithm can return incorrect results. The solution is to
> detect these incorrect patterns, warn that we do not recognize them,
> and revert to the standard algorithm.
> 
> Check each pattern for the "**" substring, and revert to the old
> logic if seen. While technically a "/<dir>/**" pattern matches
> the meaning of "/<dir>/", it is not one that would be written by
> the sparse-checkout builtin in cone mode. Attempting to accept that
> pattern change complicates the logic and instead we punt and do
> not accept any instance of "**".

That all makes sense.

> diff --git a/dir.c b/dir.c
> index 22d08e61c2..f8e350dda2 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -651,6 +651,13 @@ static void add_pattern_to_hashsets(struct pattern_list *pl, struct path_pattern
>  		return;
>  	}
>  
> +	if (strstr(given->pattern, "**")) {
> +		/* Not a cone pattern. */
> +		pl->use_cone_patterns = 0;
> +		warning(_("unrecognized pattern: '%s'"), given->pattern);
> +		goto clear_hashmaps;
> +	}

The clear_hashmaps label already unsets pl->use_cone_patterns, so the
first line is redundant (the same is true of existing goto jumps, as
well, though).

I wondered whether this warning could be triggered accidentally by
somebody who just happened to add such a pattern. But we'd exit
immediately from add_pattern_to_hashsets() immediately unless the user
has set core.sparseCheckoutCone. And if that's set, then warning is
definitely the right thing to do.

-Peff

}

if (given->patternlen <= 2 ||
*given->pattern == '*' ||
strstr(given->pattern, "**")) {
/* Not a cone pattern. */
warning(_("unrecognized pattern: '%s'"), given->pattern);
goto clear_hashmaps;
}

prev = given->pattern;
cur = given->pattern + 1;
next = given->pattern + 2;

while (*cur) {
/* Watch for glob characters '*', '\', '[', '?' */
if (!is_glob_special(*cur))
goto increment;

/* But only if *prev != '\\' */
if (*prev == '\\')
goto increment;

/* But allow the initial '\' */
if (*cur == '\\' &&
is_glob_special(*next))
goto increment;

/* But a trailing '/' then '*' is fine */
if (*prev == '/' &&
*cur == '*' &&
*next == 0)
goto increment;

/* Not a cone pattern. */
warning(_("unrecognized pattern: '%s'"), given->pattern);
goto clear_hashmaps;

increment:
prev++;
cur++;
next++;
}

if (given->patternlen > 2 &&
!strcmp(given->pattern + given->patternlen - 2, "/*")) {
if (!(given->flags & PATTERN_FLAG_NEGATIVE)) {
/* Not a cone pattern. */
pl->use_cone_patterns = 0;
warning(_("unrecognized pattern: '%s'"), given->pattern);
goto clear_hashmaps;
}

truncated = xstrdup(given->pattern);
truncated[given->patternlen - 2] = 0;
truncated = dup_and_filter_pattern(given->pattern);

translated = xmalloc(sizeof(struct pattern_entry));
translated->pattern = truncated;
Expand Down Expand Up @@ -695,7 +766,7 @@ static void add_pattern_to_hashsets(struct pattern_list *pl, struct path_pattern

translated = xmalloc(sizeof(struct pattern_entry));

translated->pattern = xstrdup(given->pattern);
translated->pattern = dup_and_filter_pattern(given->pattern);
translated->patternlen = given->patternlen;
hashmap_entry_init(&translated->ent,
ignore_case ?
Expand Down
Loading