Skip to content

Create 'expire' and 'repack' verbs for git-multi-pack-index #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

derrickstolee
Copy link

The multi-pack-index provides a fast way to find an object among a large list of pack-files. It stores a single pack-reference for each object id, so duplicate objects are ignored. Among a list of pack-files storing the same object, the most-recently modified one is used.

Create new subcommands for the multi-pack-index builtin.

  • 'git multi-pack-index expire': If we have a pack-file indexed by the multi-pack-index, but all objects in that pack are duplicated in more-recently modified packs, then delete that pack (and any others like it). Delete the reference to that pack in the multi-pack-index.

  • 'git multi-pack-index repack --batch-size=': Starting from the oldest pack-files covered by the multi-pack-index, find those whose "expected size" is below the batch size until we have a collection of packs whose expected sizes add up to the batch size. We compute the expected size by multiplying the number of referenced objects by the pack-size and dividing by the total number of objects in the pack. If the batch-size is zero, then select all packs. Create a new pack containing all objects that the multi-pack-index references to those packs.

This allows us to create a new pattern for repacking objects: run 'repack'. After enough time has passed that all Git commands that started before the last 'repack' are finished, run 'expire' again. This approach has some advantages over the existing "repack everything" model:

  1. Incremental. We can repack a small batch of objects at a time, instead of repacking all reachable objects. We can also limit ourselves to the objects that do not appear in newer pack-files.

  2. Highly Available. By adding a new pack-file (and not deleting the old pack-files) we do not interrupt concurrent Git commands, and do not suffer performance degradation. By expiring only pack-files that have no referenced objects, we know that Git commands that are doing normal object lookups* will not be interrupted.

  • Note: if someone concurrently runs a Git command that uses get_all_packs(), then that command could try to read the pack-files and pack-indexes that we are deleting during an expire command. Such commands are usually related to object maintenance (i.e. fsck, gc, pack-objects) or are related to less-often-used features (i.e. fast-import, http-backend, server-info).

We are using this approach in VFS for Git to do background maintenance of the "shared object cache" which is a Git alternate directory filled with packfiles containing commits and trees. We currently download pack-files on an hourly basis to keep up-to-date with the central server. The cache servers supply packs on an hourly and daily basis, so most of the hourly packs become useless after a new daily pack is downloaded. The 'expire' command would clear out most of those packs, but many will still remain with fewer than 100 objects remaining. The 'repack' command (with a batch size of 1-3gb, probably) can condense the remaining packs in commands that run for 1-3 min at a time. Since the daily packs range from 100-250mb, we will also combine and condense those packs.

This series is the same as v6 of an earlier series [1].

Thanks,
-Stolee

[1] https://public-inbox.org/git/[email protected]/T/#u

Cc: [email protected], [email protected], [email protected], [email protected], [email protected]

The repack builtin deletes redundant pack-files and their
associated .idx, .promisor, .bitmap, and .keep files. We will want
to re-use this logic in the future for other types of repack, so
pull the logic into 'unlink_pack_path()' in packfile.c.

The 'ignore_keep' parameter is enabled for the use in repack, but
will be important for a future caller.

Signed-off-by: Derrick Stolee <[email protected]>
We will add new subcommands to the multi-pack-index, and that will
make the documentation a bit messier. Clean up the 'verb'
descriptions by renaming the concept to 'subcommand' and removing
the reference to the object directory.

Helped-by: Stefan Beller <[email protected]>
Helped-by: Szeder Gábor <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time
of the pack-files to determine tie-breakers. It is possible to
have a pack-file with no referenced objects because all objects
have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the
multi-pack-index to no longer refer to those files. More details
about the specifics will follow as the method is implemented.

Add a test that verifies the 'expire' subcommand is correctly wired,
but will still be valid when the verb is implemented. Specifically,
create a set of packs that should all have referenced objects and
should not be removed during an 'expire' operation. The packs are
created carefully to ensure they have a specific order when sorted
by size. This will be important in a later test.

Signed-off-by: Derrick Stolee <[email protected]>
Before writing the multi-pack-index, we compute the length of the
pack-index names concatenated together. This forms the data in the
pack name chunk, and we precompute it to compute chunk offsets.
The value is also modified to fit alignment needs.

Previously, this computation was coupled with adding packs from
the existing multi-pack-index and the remaining packs in the object
dir not already covered by the multi-pack-index.

In anticipation of this becoming more complicated with the 'expire'
subcommand, simplify the computation by centralizing it to a single
loop before writing the file.

Signed-off-by: Derrick Stolee <[email protected]>
In anticipation of the expire subcommand, refactor the way we sort
the packfiles by name. This will greatly simplify our approach to
dropping expired packs from the list.

First, create 'struct pack_info' to replace 'struct pack_pair'.
This struct contains the necessary information about a pack,
including its name, a pointer to its packfile struct (if not
already in the multi-pack-index), and the original pack-int-id.

Second, track the pack information using an array of pack_info
structs in the pack_list struct. This simplifies the logic around
the multiple arrays we were tracking in that struct.

Finally, update get_sorted_entries() to not permute the pack-int-id
and instead supply the permutation to write_midx_object_offsets().
This requires sorting the packs after get_sorted_entries().

Signed-off-by: Derrick Stolee <[email protected]>
The 'git multi-pack-index expire' subcommand looks at the existing
mult-pack-index, counts the number of objects referenced in each
pack-file, deletes the pack-fils with no referenced objects, and
rewrites the multi-pack-index to no longer reference those packs.

Refactor the write_midx_file() method to call write_midx_internal()
which now takes an existing 'struct multi_pack_index' and a list
of pack-files to drop (as specified by the names of their pack-
indexes). As we write the new multi-pack-index, we drop those
file names from the list of known pack-files.

The expire_midx_packs() method removes the unreferenced pack-files
after carefully closing the packs to avoid open handles.

Test that a new pack-file that covers the contents of two other
pack-files leads to those pack-files being deleted during the
expire subcommand. Be sure to read the multi-pack-index to ensure
it no longer references those packs.

Signed-off-by: Derrick Stolee <[email protected]>
In an environment where the multi-pack-index is useful, it is due
to many pack-files and an inability to repack the object store
into a single pack-file. However, it is likely that many of these
pack-files are rather small, and could be repacked into a slightly
larger pack-file without too much effort. It may also be important
to ensure the object store is highly available and the repack
operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that
takes a '--batch-size' option. The subcommand will inspect the
multi-pack-index for referenced pack-files whose size is smaller
than the batch size, until collecting a list of pack-files whose
sizes sum to larger than the batch size. Then, a new pack-file
will be created containing the objects from those pack-files that
are referenced by the multi-pack-index. The resulting pack is
likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-
files that have duplicate copies in other pack-files.

The current change introduces the command-line arguments, and we
add a test that ensures we parse these options properly. Since
we specify a small batch size, we will guarantee that future
implementations do not change the list of pack-files.

In addition, we hard-code the modified times of the packs in
the pack directory to ensure the list of packs sorted by modified
time matches the order if sorted by size (ascending). This will
be important in a future test.

Signed-off-by: Derrick Stolee <[email protected]>
To repack with a non-zero batch-size, first sort all pack-files by
their modified time. Second, walk those pack-files from oldest
to newest, compute their expected size, and add the packs to a list
if they are smaller than the given batch-size. Stop when the total
expected size is at least the batch size.

If the batch size is zero, select all packs in the multi-pack-index.

Finally, collect the objects from the multi-pack-index that are in
the selected packs and send them to 'git pack-objects'. Write a new
multi-pack-index that includes the new pack.

Using a batch size of zero is very similar to a standard 'git repack'
command, except that we do not delete the old packs and instead rely
on the new multi-pack-index to prevent new processes from reading the
old packs. This does not disrupt other Git processes that are currently
reading the old packs based on the old multi-pack-index.

While first designing a 'git multi-pack-index repack' operation, I
started by collecting the batches based on the actual size of the
objects instead of the size of the pack-files. This allows repacking
a large pack-file that has very few referencd objects. However, this
came at a significant cost of parsing pack-files instead of simply
reading the multi-pack-index and getting the file information for
the pack-files. The "expected size" version provides similar
behavior, but could skip a pack-file if the average object size is
much larger than the actual size of the referenced objects, or
can create a large pack if the actual size of the referenced objects
is larger than the expected size.

Signed-off-by: Derrick Stolee <[email protected]>
During development of the multi-pack-index expire subcommand, a
version went out that improperly computed the pack order if a new
pack was introduced while other packs were being removed. Part of
the subtlety of the bug involved the new pack being placed before
other packs that already existed in the multi-pack-index.

Add a test to t5319-multi-pack-index.sh that catches this issue.
The test adds new packs that cause another pack to be expired, and
creates new packs that are lexicographically sorted before and
after the existing packs.

Signed-off-by: Derrick Stolee <[email protected]>
The 'git multi-pack-index expire' subcommand may delete packs that
are not needed from the perspective of the multi-pack-index. If
a pack has a .keep file, then we should not delete that pack. Add
a test that ensures we preserve a pack that would otherwise be
expired. First, create a new pack that contains every object in
the repo, then add it to the multi-pack-index. Then create a .keep
file for a pack starting with "a-pack" that was added in the
previous test. Finally, expire and verify that the pack remains
and the other packs were expired.

Signed-off-by: Derrick Stolee <[email protected]>
The 'git multi-pack-index repack' command can take a batch size of
zero, which creates a new pack-file containing all objects in the
multi-pack-index. The first 'repack' command will create one new
pack-file, and an 'expire' command after that will delete the old
pack-files, as they no longer contain any referenced objects in the
multi-pack-index.

We must remove the .keep file that was added in the previous test
in order to expire that pack-file.

Also test that a 'repack' will do nothing if there is only one
pack-file.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 10, 2019

Submitted as [email protected]

@@ -34,6 +34,8 @@
#define MIDX_CHUNK_LARGE_OFFSET_WIDTH (sizeof(uint64_t))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <[email protected]> writes:

> From: Derrick Stolee <[email protected]>
>
> The 'git multi-pack-index expire' subcommand looks at the existing
> mult-pack-index, counts the number of objects referenced in each

s/mult/&i/

> pack-file, deletes the pack-fils with no referenced objects, and

s/fils/files/

> rewrites the multi-pack-index to no longer reference those packs.

As we do *not* want to expire a packfile that was created after the
last run of "git multi-pack-index write", "... in each pack-file" in
the above paragraph is a bit tricky.  It cannot literally be "in
each packfile", but a subset of all existing packfiles that are
known to the midx we are looking at.

An obvious alternative is to actually update midx file (so it no
longer "looks at the existing multi-pack-index", but "makes sure we
have up-to-date view of the multi-pack-index that covers all packs).

And I actually suspect that you implemented that alternative here.
IOW, "looks at the existing multi-pack-index" in the first paragraph
is misleading.

I wondered if the error behaviour of add_pack_to_midx() that ignores
a pack that cannot be added to midx for whatever reason can cause
issues, but such a pack would not become part of m->packs[] so it
may not have anything count[]ed from it, but it will not be eligible
for expiration, either.  So we'd be safe there, I think.

> Test that a new pack-file that covers the contents of two other
> pack-files leads to those pack-files being deleted during the
> expire subcommand. Be sure to read the multi-pack-index to ensure
> it no longer references those packs.

I am curious to learn how we guarantee that the midx logic chooses
to record the copy of an object in the new single packfile over the
other copy of the same object in one of the two older and presumably
smaller packfiles for all objects in the latter two packfiles,
because they cannot be expired if even one object were used from
there.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 11, 2019

This branch is now known as ds/midx-expire-repack.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 11, 2019

This patch series was integrated into pu via git@daf02b8.

@gitgitgadget gitgitgadget bot added the pu label Jun 11, 2019
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 12, 2019

This patch series was integrated into pu via git@3186672.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 12, 2019

This patch series was integrated into pu via git@d9a1aff.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 13, 2019

This patch series was integrated into pu via git@8c516ab.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 13, 2019

This patch series was integrated into pu via git@e36c259.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 14, 2019

This patch series was integrated into pu via git@4a631e9.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 18, 2019

This patch series was integrated into pu via git@3979788.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 20, 2019

This patch series was integrated into pu via git@b6f9228.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 21, 2019

This patch series was integrated into pu via git@4533731.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 24, 2019

This patch series was integrated into pu via git@08a2611.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 26, 2019

This patch series was integrated into pu via git@7ba53de.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 27, 2019

This patch series was integrated into pu via git@4cbf31f.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 30, 2019

On the Git mailing list, Johannes Sixt wrote (reply to this):

On my Windows system where the POSIX commands are provided by MSYS2,
I observe this output:

$ ls -l Makefile
-rw-r--r-- 1 Johannes Sixt 197121 101780 Jun 30 09:33 Makefile

Notice the blank in the user name. Obviously, extracting the size
of a file by counting columns won't work. But two tests in t5319
do that. Use the stat command to print just the file size.

Signed-off-by: Johannes Sixt <[email protected]>
---
Notice further that the group ID is presented numerically. The
actual output used in the tests is:

-r--r--r-- 1 Johannes Sixt 197121 2239 Jan  1 01:03 .git/objects/pack/pack-A-40289997979ee9f2d7198a5b4ac9fe14e27cbe4b.pack
-r--r--r-- 1 Johannes Sixt 197121 1385 Jan  1 01:02 .git/objects/pack/pack-B-926e20b0e641003b2f6bcddd6ccf53e7741c8642.pack
-r--r--r-- 1 Johannes Sixt 197121  838 Jan  1 01:01 .git/objects/pack/pack-C-b33c048d7202659249bb0cea7bbc997ce448c75d.pack
-r--r--r-- 1 Johannes Sixt 197121  560 Jan  1 01:00 .git/objects/pack/pack-D-49ca36507a52bbba6d1768e28666eb0f4e5f7d95.pack
-r--r--r-- 1 Johannes Sixt 197121 4587 Jun 30 16:17 .git/objects/pack/pack-E-31ea0966bb15206fbfd1b4c2f5be4b24a03050f5.pack

In both tests, this group ID was passed as --batch-size, but actually
the first of the two tests does not fail.

Now I have to wonder: Why that? The test carefully attempts to extract
the smallest pack size to pass as --batch-size argument. But when it
passes some large number, the test still passes. Is this the intent?

That is, both a small and a large --batch-size satisfy 'does not alter
existing packs', but some medium sized --batch-size satisfy the next
test 'repack creates a new pack'. That is very curious.

Disclaimer: I've no clue about this multi-pack-index thing.

 t/t5319-multi-pack-index.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 79bfaeafa9..4b4d06a1c8 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
 		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
 		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
 		ls .git/objects/pack >expect &&
-		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		MINSIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 1) &&
 		git multi-pack-index repack --batch-size=$MINSIZE &&
 		ls .git/objects/pack >actual &&
 		test_cmp expect actual
@@ -455,7 +455,7 @@ test_expect_success 'repack creates a new pack' '
 		cd dup &&
 		ls .git/objects/pack/*idx >idx-list &&
 		test_line_count = 5 idx-list &&
-		THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
+		THIRD_SMALLEST_SIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 3 | tail -n 1) &&
 		BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
 		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
 		ls .git/objects/pack/*idx >idx-list &&
-- 
2.21.0.285.gc38d92e052

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 30, 2019

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Sun, Jun 30, 2019 at 2:57 PM Johannes Sixt <[email protected]> wrote:
> On my Windows system where the POSIX commands are provided by MSYS2,
> I observe this output:
>
> $ ls -l Makefile
> -rw-r--r-- 1 Johannes Sixt 197121 101780 Jun 30 09:33 Makefile
>
> Notice the blank in the user name. Obviously, extracting the size
> of a file by counting columns won't work. But two tests in t5319
> do that. Use the stat command to print just the file size.
>
> Signed-off-by: Johannes Sixt <[email protected]>
> ---
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> @@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
> -               MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
> +               MINSIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 1) &&

Unfortunately, this is not portable. While "stat -c %s" works on Linux
and MSYS2, neither that option nor the format directive are recognized
on BSD-like platforms (I tested Mac OS and FreeBSD), which instead
need "stat -f %z".

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 30, 2019

On the Git mailing list, Johannes Sixt wrote (reply to this):

Am 30.06.19 um 21:48 schrieb Eric Sunshine:
> On Sun, Jun 30, 2019 at 2:57 PM Johannes Sixt <[email protected]> wrote:
>> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>> @@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
>> -               MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
>> +               MINSIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 1) &&
> 
> Unfortunately, this is not portable. While "stat -c %s" works on Linux
> and MSYS2, neither that option nor the format directive are recognized
> on BSD-like platforms (I tested Mac OS and FreeBSD), which instead
> need "stat -f %z".

Ouch! I did notice that stat(1) is not in POSIX, but hoped that it was
sufficiently portable. I need a new idea...

-- Hannes

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 30, 2019

On the Git mailing list, Jeff King wrote (reply to this):

On Sun, Jun 30, 2019 at 10:59:34PM +0200, Johannes Sixt wrote:

> Am 30.06.19 um 21:48 schrieb Eric Sunshine:
> > On Sun, Jun 30, 2019 at 2:57 PM Johannes Sixt <[email protected]> wrote:
> >> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> >> @@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
> >> -               MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
> >> +               MINSIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 1) &&
> > 
> > Unfortunately, this is not portable. While "stat -c %s" works on Linux
> > and MSYS2, neither that option nor the format directive are recognized
> > on BSD-like platforms (I tested Mac OS and FreeBSD), which instead
> > need "stat -f %z".
> 
> Ouch! I did notice that stat(1) is not in POSIX, but hoped that it was
> sufficiently portable. I need a new idea...

If we are OK relying on rudimentary perl[1], then:

  perl -le "print((stat)[7]) for @ARGV"

works. If you want it more readable, then maybe:

  perl -MFile::stat -le "print stat(\$_)->size for @ARGV"

Both of those should work with even antique perl versions.

-Peff

[1] I'd also be fine with implementing a test-tool helper in C that
    behaves like stat(1). Or we could punt on that until somebody feels
    like trying to eradicate perl (because this is far from the first
    perl one-liner in the test suite).

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Johannes Sixt wrote (reply to this):

Am 01.07.19 um 00:25 schrieb Jeff King:
> On Sun, Jun 30, 2019 at 10:59:34PM +0200, Johannes Sixt wrote:
> 
>> Am 30.06.19 um 21:48 schrieb Eric Sunshine:
>>> On Sun, Jun 30, 2019 at 2:57 PM Johannes Sixt <[email protected]> wrote:
>>>> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>>>> @@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
>>>> -               MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
>>>> +               MINSIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 1) &&
>>>
>>> Unfortunately, this is not portable. While "stat -c %s" works on Linux
>>> and MSYS2, neither that option nor the format directive are recognized
>>> on BSD-like platforms (I tested Mac OS and FreeBSD), which instead
>>> need "stat -f %z".
>>
>> Ouch! I did notice that stat(1) is not in POSIX, but hoped that it was
>> sufficiently portable. I need a new idea...
> 
> If we are OK relying on rudimentary perl[1], then:
> 
>   perl -le "print((stat)[7]) for @ARGV"

I'm fine with that. But then we should do the sort + head -n 1 at the
same time. I can do that with a small script, but I'm sure it's possible
in a one-liner...

-- Hannes

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Sun, Jun 30, 2019 at 10:59:34PM +0200, Johannes Sixt wrote:
> Am 30.06.19 um 21:48 schrieb Eric Sunshine:
> > On Sun, Jun 30, 2019 at 2:57 PM Johannes Sixt <[email protected]> wrote:
> >> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> >> @@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
> >> -               MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
> >> +               MINSIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 1) &&
> > 
> > Unfortunately, this is not portable. While "stat -c %s" works on Linux
> > and MSYS2, neither that option nor the format directive are recognized
> > on BSD-like platforms (I tested Mac OS and FreeBSD), which instead
> > need "stat -f %z".
> 
> Ouch! I did notice that stat(1) is not in POSIX, but hoped that it was
> sufficiently portable. I need a new idea...

'wc -c' perhaps?  We already use it in a couple of places in the test
suite to get file size.

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 79bfaeafa9..ddba862114 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
 		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
 		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
 		ls .git/objects/pack >expect &&
-		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		MINSIZE=$(wc -c .git/objects/pack/*pack | sort -n | sed -n -e "s/^ *//" -e "1 s/[^ ]*$//p") &&
 		git multi-pack-index repack --batch-size=$MINSIZE &&
 		ls .git/objects/pack >actual &&
 		test_cmp expect actual
@@ -455,7 +455,7 @@ test_expect_success 'repack creates a new pack' '
 		cd dup &&
 		ls .git/objects/pack/*idx >idx-list &&
 		test_line_count = 5 idx-list &&
-		THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
+		THIRD_SMALLEST_SIZE=$(wc -c .git/objects/pack/*pack | sort -n | sed -n -e "s/^ *//" -e "3 s/[^ ]*$//p") &&
 		BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
 		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
 		ls .git/objects/pack/*idx >idx-list &&

Note that 'wc -c' prints the numbers in its output left padded, hence
the 'sed "s/^ *//"' above, and once 'sed' is already up and running,
it might as well take over the role of that 'head -n 3 | tail -n 1'.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Jul 01, 2019 at 08:33:18AM +0200, Johannes Sixt wrote:

> > If we are OK relying on rudimentary perl[1], then:
> > 
> >   perl -le "print((stat)[7]) for @ARGV"
> 
> I'm fine with that. But then we should do the sort + head -n 1 at the
> same time. I can do that with a small script, but I'm sure it's possible
> in a one-liner...

Something like:

  perl -le 'print((sort { $a <=> $b } map { (stat)[7] } @ARGV)[0])'

but that is getting pretty unreadable. If we rely on List::Util, it's a
bit nicer:

  perl -MList::Util=min -e 'print min(map { (stat)[7] } @ARGV)'

but that implies perl v5.7.3. Which is from 2002, and older than our
usual minimum-perl version, but we've typically been very conservative
with these one-liners, since they do not respect NO_PERL.

Probably writing it out like:

  perl -le '
    my @sizes = map { (stat)[7] } @ARGV;
    @sizes = sort { $a <=> $b } @sizes;
    print $size[0];
  '

would be better (and makes "3rd-smallest" a trivial change).

I see Gábor suggested using "wc -c" elsewhere in the thread. That would
be fine with me, too, though I think the required sed there may be
getting pretty unreadable, too. :)

-Peff

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Jul 01, 2019 at 05:16:02AM -0400, Jeff King wrote:
> I see Gábor suggested using "wc -c" elsewhere in the thread. That would
> be fine with me, too, though I think the required sed there may be
> getting pretty unreadable, too. :)

It could be done even without 'sed', though at the expense of running
a coupe more 'wc -c's in a loop:

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index 79bfaeafa9..bacec5e2e4 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -443,7 +443,12 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
 		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
 		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
 		ls .git/objects/pack >expect &&
-		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+		MINSIZE=$(
+			for pack in .git/objects/pack/*pack
+			do
+				wc -c <"$pack"
+			done | sort -n | head -n 1
+		) &&
 		git multi-pack-index repack --batch-size=$MINSIZE &&
 		ls .git/objects/pack >actual &&
 		test_cmp expect actual
@@ -455,7 +460,12 @@ test_expect_success 'repack creates a new pack' '
 		cd dup &&
 		ls .git/objects/pack/*idx >idx-list &&
 		test_line_count = 5 idx-list &&
-		THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
+		THIRD_SMALLEST_SIZE=$(
+			for pack in .git/objects/pack/*pack
+			do
+				wc -c <"$pack"
+			done | sort -n | head -n 3 | tail -n 1
+		) &&
 		BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
 		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
 		ls .git/objects/pack/*idx >idx-list &&

Is it really better?  Dunno, but at least there is no subtlety with
the leading padding spaces.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Derrick Stolee wrote (reply to this):

Sorry I'm late to the thread. Thanks for reporting this issue, Johannes.

On 7/1/2019 7:33 AM, SZEDER Gábor wrote:
> On Mon, Jul 01, 2019 at 05:16:02AM -0400, Jeff King wrote:
>> I see Gábor suggested using "wc -c" elsewhere in the thread. That would
>> be fine with me, too, though I think the required sed there may be
>> getting pretty unreadable, too. :)
> 
> It could be done even without 'sed', though at the expense of running
> a coupe more 'wc -c's in a loop:>
> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 79bfaeafa9..bacec5e2e4 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -443,7 +443,12 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
>  		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
>  		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
>  		ls .git/objects/pack >expect &&
> -		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
> +		MINSIZE=$(
> +			for pack in .git/objects/pack/*pack
> +			do
> +				wc -c <"$pack"
> +			done | sort -n | head -n 1
> +		) &&
>  		git multi-pack-index repack --batch-size=$MINSIZE &&
>  		ls .git/objects/pack >actual &&
>  		test_cmp expect actual
> @@ -455,7 +460,12 @@ test_expect_success 'repack creates a new pack' '
>  		cd dup &&
>  		ls .git/objects/pack/*idx >idx-list &&
>  		test_line_count = 5 idx-list &&
> -		THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
> +		THIRD_SMALLEST_SIZE=$(
> +			for pack in .git/objects/pack/*pack
> +			do
> +				wc -c <"$pack"
> +			done | sort -n | head -n 3 | tail -n 1
> +		) &&
>  		BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
>  		git multi-pack-index repack --batch-size=$BATCH_SIZE &&
>  		ls .git/objects/pack/*idx >idx-list &&
> 
> Is it really better?  Dunno, but at least there is no subtlety with
> the leading padding spaces.

Your change is certainly more straight-forward, and avoids all issues
around Perl compatibility.

It does have the issue of more 'wc' processes, which is a downside. The
count here is not too bad, but if we need to duplicate this pattern elsewhere
we may be better off creating a stat(1) replacement in test-lib.sh.

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Peff,

On Sun, 30 Jun 2019, Jeff King wrote:

> On Sun, Jun 30, 2019 at 10:59:34PM +0200, Johannes Sixt wrote:
>
> > Am 30.06.19 um 21:48 schrieb Eric Sunshine:
> > > On Sun, Jun 30, 2019 at 2:57 PM Johannes Sixt <[email protected]> wrote:
> > >> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index=
.sh
> > >> @@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size d=
oes not alter existing packs' '
> > >> -               MINSIZE=3D$(ls -l .git/objects/pack/*pack | awk "{p=
rint \$5;}" | sort -n | head -n 1) &&
> > >> +               MINSIZE=3D$(stat -c %s .git/objects/pack/*pack | so=
rt -n | head -n 1) &&
> > >
> > > Unfortunately, this is not portable. While "stat -c %s" works on Lin=
ux
> > > and MSYS2, neither that option nor the format directive are recogniz=
ed
> > > on BSD-like platforms (I tested Mac OS and FreeBSD), which instead
> > > need "stat -f %z".
> >
> > Ouch! I did notice that stat(1) is not in POSIX, but hoped that it was
> > sufficiently portable. I need a new idea...
>
> If we are OK relying on rudimentary perl[1], then:
>
>   perl -le "print((stat)[7]) for @ARGV"
>
> works. If you want it more readable, then maybe:
>
>   perl -MFile::stat -le "print stat(\$_)->size for @ARGV"

Or we stop introducing new Perl calls, and use the perfectly fine
`test-tool path-utils file-size` command:

https://github.com/git/git/blob/v2.22.0/t/helper/test-path-utils.c#L302-L3=
12

This solves not only portability problems but also avoids yet another
obstacle into making a `NO_PERL` test suite run really work without Perl.

Ciao,
Dscho

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/1/2019 8:11 AM, Johannes Schindelin wrote:
> Hi Peff,
> 
> On Sun, 30 Jun 2019, Jeff King wrote:
> 
>> On Sun, Jun 30, 2019 at 10:59:34PM +0200, Johannes Sixt wrote:
>>
>>> Am 30.06.19 um 21:48 schrieb Eric Sunshine:
>>>> On Sun, Jun 30, 2019 at 2:57 PM Johannes Sixt <[email protected]> wrote:
>>>>> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>>>>> @@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
>>>>> -               MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
>>>>> +               MINSIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 1) &&
>>>>
>>>> Unfortunately, this is not portable. While "stat -c %s" works on Linux
>>>> and MSYS2, neither that option nor the format directive are recognized
>>>> on BSD-like platforms (I tested Mac OS and FreeBSD), which instead
>>>> need "stat -f %z".
>>>
>>> Ouch! I did notice that stat(1) is not in POSIX, but hoped that it was
>>> sufficiently portable. I need a new idea...
>>
>> If we are OK relying on rudimentary perl[1], then:
>>
>>   perl -le "print((stat)[7]) for @ARGV"
>>
>> works. If you want it more readable, then maybe:
>>
>>   perl -MFile::stat -le "print stat(\$_)->size for @ARGV"
> 
> Or we stop introducing new Perl calls, and use the perfectly fine
> `test-tool path-utils file-size` command:
> 
> https://github.com/git/git/blob/v2.22.0/t/helper/test-path-utils.c#L302-L312
> 
> This solves not only portability problems but also avoids yet another
> obstacle into making a `NO_PERL` test suite run really work without Perl.
Thanks! This does seem like the best option. Thanks for bringing this to our
attention. Here is a diff, and I'll prepare a full patch:

diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index dd6083e61a2..5379e59168a 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -447,7 +447,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
                touch -m -t 201901010002 .git/objects/pack/pack-B* &&
                touch -m -t 201901010003 .git/objects/pack/pack-A* &&
                ls .git/objects/pack >expect &&
-               MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
+               MINSIZE=$(test-tool path-utils file-size .git/objects/pack/*pack | sort -n | head -n 1) &&
                git multi-pack-index repack --batch-size=$MINSIZE &&
                ls .git/objects/pack >actual &&
                test_cmp expect actual
@@ -459,7 +459,7 @@ test_expect_success 'repack creates a new pack' '
                cd dup &&
                ls .git/objects/pack/*idx >idx-list &&
                test_line_count = 5 idx-list &&
-               THIRD_SMALLEST_SIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 3 | tail -n 1) &&
+               THIRD_SMALLEST_SIZE=$(test-tool path-utils file-size .git/objects/pack/*pack | sort -n | head -n 3 | tail -n 1) &&
                BATCH_SIZE=$(($THIRD_SMALLEST_SIZE + 1)) &&
                git multi-pack-index repack --batch-size=$BATCH_SIZE &&
                ls .git/objects/pack/*idx >idx-list &&

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Jeff King wrote (reply to this):

On Mon, Jul 01, 2019 at 02:11:43PM +0200, Johannes Schindelin wrote:

> Or we stop introducing new Perl calls, and use the perfectly fine
> `test-tool path-utils file-size` command:
> 
> https://github.com/git/git/blob/v2.22.0/t/helper/test-path-utils.c#L302-L312

Ah, thanks, I missed that we had already added it. Certainly that seems
like the right solution then.

-Peff

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Andreas Schwab wrote (reply to this):

On Jun 30 2019, Johannes Sixt <[email protected]> wrote:

> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index 79bfaeafa9..4b4d06a1c8 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
>  		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
>  		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
>  		ls .git/objects/pack >expect &&
> -		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
> +		MINSIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 1) &&

You could also use ls -n.

Andreas.

-- 
Andreas Schwab, [email protected]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Johannes Sixt wrote (reply to this):

Am 01.07.19 um 14:30 schrieb Derrick Stolee:
> On 7/1/2019 8:11 AM, Johannes Schindelin wrote:
>> Or we stop introducing new Perl calls, and use the perfectly fine
>> `test-tool path-utils file-size` command:
>>
>> https://github.com/git/git/blob/v2.22.0/t/helper/test-path-utils.c#L302-L312
>>
>> This solves not only portability problems but also avoids yet another
>> obstacle into making a `NO_PERL` test suite run really work without Perl.
> Thanks! This does seem like the best option. Thanks for bringing this to our
> attention. Here is a diff, and I'll prepare a full patch:

Thanks. Please also explain why the first of the two tests does not fail
with a large --batch-size (unless it is obvious for people who know a
bit about multi-pack-index, of course).

-- Hannes

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/1/2019 2:22 PM, Johannes Sixt wrote:
> Am 01.07.19 um 14:30 schrieb Derrick Stolee:
>> On 7/1/2019 8:11 AM, Johannes Schindelin wrote:
>>> Or we stop introducing new Perl calls, and use the perfectly fine
>>> `test-tool path-utils file-size` command:
>>>
>>> https://github.com/git/git/blob/v2.22.0/t/helper/test-path-utils.c#L302-L312
>>>
>>> This solves not only portability problems but also avoids yet another
>>> obstacle into making a `NO_PERL` test suite run really work without Perl.
>> Thanks! This does seem like the best option. Thanks for bringing this to our
>> attention. Here is a diff, and I'll prepare a full patch:
> 
> Thanks. Please also explain why the first of the two tests does not fail
> with a large --batch-size (unless it is obvious for people who know a
> bit about multi-pack-index, of course).

Sorry about missing that. Here is an explanation:

The --batch-size=X option determines how much data we want to move in
a single operation. Based on this value, we select packs greedily as
follows:

  1. If a pack has size at most X, include it.
  2. If our total pack size is greater than X, stop.

At the end of this process, if we have zero or one packs we do nothing.
That is why a small size does nothing.

Also, if our total pack size is smaller than X, we do nothing. Our
batch is not "full" enough to merit the work. This is why a large size
does nothing.

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

On the Git mailing list, Junio C Hamano wrote (reply to this):

Andreas Schwab <[email protected]> writes:

> On Jun 30 2019, Johannes Sixt <[email protected]> wrote:
>
>> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
>> index 79bfaeafa9..4b4d06a1c8 100755
>> --- a/t/t5319-multi-pack-index.sh
>> +++ b/t/t5319-multi-pack-index.sh
>> @@ -443,7 +443,7 @@ test_expect_success 'repack with minimum size does not alter existing packs' '
>>  		touch -m -t 201901010002 .git/objects/pack/pack-B* &&
>>  		touch -m -t 201901010003 .git/objects/pack/pack-A* &&
>>  		ls .git/objects/pack >expect &&
>> -		MINSIZE=$(ls -l .git/objects/pack/*pack | awk "{print \$5;}" | sort -n | head -n 1) &&
>> +		MINSIZE=$(stat -c %s .git/objects/pack/*pack | sort -n | head -n 1) &&
>
> You could also use ls -n.

Nice ;-)

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2019

This patch series was integrated into pu via git@e125aaa.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 2, 2019

This patch series was integrated into pu via git@414f9f7.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 3, 2019

This patch series was integrated into pu via git@5bc2835.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 3, 2019

This patch series was integrated into pu via git@76785e6.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 9, 2019

This patch series was integrated into pu via git@521e0a0.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 10, 2019

This patch series was integrated into pu via git@cbaf64d.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 12, 2019

This patch series was integrated into pu via git@7d55f29.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 19, 2019

This patch series was integrated into pu via git@4308d81.

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 19, 2019

This patch series was integrated into master via git@4308d81.

@gitgitgadget gitgitgadget bot added the master label Jul 19, 2019
@gitgitgadget gitgitgadget bot closed this Jul 19, 2019
@gitgitgadget
Copy link

gitgitgadget bot commented Jul 19, 2019

Closed via 4308d81.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant