Skip to content

Commit 91243b0

Browse files
committed
Merge branch 'en/filter-branch-deprecation'
Start discouraging the use of "git filter-branch". * en/filter-branch-deprecation: t9902: use a non-deprecated command for testing Recommend git-filter-repo instead of git-filter-branch t6006: simplify, fix, and optimize empty message test
2 parents 9bc67b6 + 483e861 commit 91243b0

11 files changed

+296
-68
lines changed

Documentation/git-fast-export.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@ This program dumps the given revisions in a form suitable to be piped
1717
into 'git fast-import'.
1818

1919
You can use it as a human-readable bundle replacement (see
20-
linkgit:git-bundle[1]), or as a kind of an interactive
21-
'git filter-branch'.
22-
20+
linkgit:git-bundle[1]), or as a format that can be edited before being
21+
fed to 'git fast-import' in order to do history rewrites (an ability
22+
relied on by tools like 'git filter-repo').
2323

2424
OPTIONS
2525
-------

Documentation/git-filter-branch.txt

Lines changed: 243 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,19 @@ SYNOPSIS
1616
[--original <namespace>] [-d <directory>] [-f | --force]
1717
[--state-branch <branch>] [--] [<rev-list options>...]
1818

19+
WARNING
20+
-------
21+
'git filter-branch' has a plethora of pitfalls that can produce non-obvious
22+
manglings of the intended history rewrite (and can leave you with little
23+
time to investigate such problems since it has such abysmal performance).
24+
These safety and performance issues cannot be backward compatibly fixed and
25+
as such, its use is not recommended. Please use an alternative history
26+
filtering tool such as https://github.com/newren/git-filter-repo/[git
27+
filter-repo]. If you still need to use 'git filter-branch', please
28+
carefully read <<SAFETY>> (and <<PERFORMANCE>>) to learn about the land
29+
mines of filter-branch, and then vigilantly avoid as many of the hazards
30+
listed there as reasonably possible.
31+
1932
DESCRIPTION
2033
-----------
2134
Lets you rewrite Git revision history by rewriting the branches mentioned
@@ -445,36 +458,236 @@ warned.
445458
(or if your git-gc is not new enough to support arguments to
446459
`--prune`, use `git repack -ad; git prune` instead).
447460

448-
NOTES
449-
-----
450-
451-
git-filter-branch allows you to make complex shell-scripted rewrites
452-
of your Git history, but you probably don't need this flexibility if
453-
you're simply _removing unwanted data_ like large files or passwords.
454-
For those operations you may want to consider
455-
http://rtyley.github.io/bfg-repo-cleaner/[The BFG Repo-Cleaner],
456-
a JVM-based alternative to git-filter-branch, typically at least
457-
10-50x faster for those use-cases, and with quite different
458-
characteristics:
459-
460-
* Any particular version of a file is cleaned exactly _once_. The BFG,
461-
unlike git-filter-branch, does not give you the opportunity to
462-
handle a file differently based on where or when it was committed
463-
within your history. This constraint gives the core performance
464-
benefit of The BFG, and is well-suited to the task of cleansing bad
465-
data - you don't care _where_ the bad data is, you just want it
466-
_gone_.
467-
468-
* By default The BFG takes full advantage of multi-core machines,
469-
cleansing commit file-trees in parallel. git-filter-branch cleans
470-
commits sequentially (i.e. in a single-threaded manner), though it
471-
_is_ possible to write filters that include their own parallelism,
472-
in the scripts executed against each commit.
473-
474-
* The http://rtyley.github.io/bfg-repo-cleaner/#examples[command options]
475-
are much more restrictive than git-filter branch, and dedicated just
476-
to the tasks of removing unwanted data- e.g:
477-
`--strip-blobs-bigger-than 1M`.
461+
[[PERFORMANCE]]
462+
PERFORMANCE
463+
-----------
464+
465+
The performance of git-filter-branch is glacially slow; its design makes it
466+
impossible for a backward-compatible implementation to ever be fast:
467+
468+
* In editing files, git-filter-branch by design checks out each and
469+
every commit as it existed in the original repo. If your repo has 10\^5
470+
files and 10\^5 commits, but each commit only modifies 5 files, then
471+
git-filter-branch will make you do 10\^10 modifications, despite only
472+
having (at most) 5*10^5 unique blobs.
473+
474+
* If you try and cheat and try to make git-filter-branch only work on
475+
files modified in a commit, then two things happen
476+
477+
** you run into problems with deletions whenever the user is simply
478+
trying to rename files (because attempting to delete files that
479+
don't exist looks like a no-op; it takes some chicanery to remap
480+
deletes across file renames when the renames happen via arbitrary
481+
user-provided shell)
482+
483+
** even if you succeed at the map-deletes-for-renames chicanery, you
484+
still technically violate backward compatibility because users are
485+
allowed to filter files in ways that depend upon topology of
486+
commits instead of filtering solely based on file contents or names
487+
(though this has not been observed in the wild).
488+
489+
* Even if you don't need to edit files but only want to e.g. rename or
490+
remove some and thus can avoid checking out each file (i.e. you can use
491+
--index-filter), you still are passing shell snippets for your filters.
492+
This means that for every commit, you have to have a prepared git repo
493+
where those filters can be run. That's a significant setup.
494+
495+
* Further, several additional files are created or updated per commit by
496+
git-filter-branch. Some of these are for supporting the convenience
497+
functions provided by git-filter-branch (such as map()), while others
498+
are for keeping track of internal state (but could have also been
499+
accessed by user filters; one of git-filter-branch's regression tests
500+
does so). This essentially amounts to using the filesystem as an IPC
501+
mechanism between git-filter-branch and the user-provided filters.
502+
Disks tend to be a slow IPC mechanism, and writing these files also
503+
effectively represents a forced synchronization point between separate
504+
processes that we hit with every commit.
505+
506+
* The user-provided shell commands will likely involve a pipeline of
507+
commands, resulting in the creation of many processes per commit.
508+
Creating and running another process takes a widely varying amount of
509+
time between operating systems, but on any platform it is very slow
510+
relative to invoking a function.
511+
512+
* git-filter-branch itself is written in shell, which is kind of slow.
513+
This is the one performance issue that could be backward-compatibly
514+
fixed, but compared to the above problems that are intrinsic to the
515+
design of git-filter-branch, the language of the tool itself is a
516+
relatively minor issue.
517+
518+
** Side note: Unfortunately, people tend to fixate on the
519+
written-in-shell aspect and periodically ask if git-filter-branch
520+
could be rewritten in another language to fix the performance
521+
issues. Not only does that ignore the bigger intrinsic problems
522+
with the design, it'd help less than you'd expect: if
523+
git-filter-branch itself were not shell, then the convenience
524+
functions (map(), skip_commit(), etc) and the `--setup` argument
525+
could no longer be executed once at the beginning of the program
526+
but would instead need to be prepended to every user filter (and
527+
thus re-executed with every commit).
528+
529+
The https://github.com/newren/git-filter-repo/[git filter-repo] tool is
530+
an alternative to git-filter-branch which does not suffer from these
531+
performance problems or the safety problems (mentioned below). For those
532+
with existing tooling which relies upon git-filter-branch, 'git
533+
repo-filter' also provides
534+
https://github.com/newren/git-filter-repo/blob/master/contrib/filter-repo-demos/filter-lamely[filter-lamely],
535+
a drop-in git-filter-branch replacement (with a few caveats). While
536+
filter-lamely suffers from all the same safety issues as
537+
git-filter-branch, it at least ameloriates the performance issues a
538+
little.
539+
540+
[[SAFETY]]
541+
SAFETY
542+
------
543+
544+
git-filter-branch is riddled with gotchas resulting in various ways to
545+
easily corrupt repos or end up with a mess worse than what you started
546+
with:
547+
548+
* Someone can have a set of "working and tested filters" which they
549+
document or provide to a coworker, who then runs them on a different OS
550+
where the same commands are not working/tested (some examples in the
551+
git-filter-branch manpage are also affected by this). BSD vs. GNU
552+
userland differences can really bite. If lucky, error messages are
553+
spewed. But just as likely, the commands either don't do the filtering
554+
requested, or silently corrupt by making some unwanted change. The
555+
unwanted change may only affect a few commits, so it's not necessarily
556+
obvious either. (The fact that problems won't necessarily be obvious
557+
means they are likely to go unnoticed until the rewritten history is in
558+
use for quite a while, at which point it's really hard to justify
559+
another flag-day for another rewrite.)
560+
561+
* Filenames with spaces are often mishandled by shell snippets since
562+
they cause problems for shell pipelines. Not everyone is familiar with
563+
find -print0, xargs -0, git-ls-files -z, etc. Even people who are
564+
familiar with these may assume such flags are not relevant because
565+
someone else renamed any such files in their repo back before the person
566+
doing the filtering joined the project. And often, even those familiar
567+
with handling arguments with spaces may not do so just because they
568+
aren't in the mindset of thinking about everything that could possibly
569+
go wrong.
570+
571+
* Non-ascii filenames can be silently removed despite being in a desired
572+
directory. Keeping only wanted paths is often done using pipelines like
573+
`git ls-files | grep -v ^WANTED_DIR/ | xargs git rm`. ls-files will
574+
only quote filenames if needed, so folks may not notice that one of the
575+
files didn't match the regex (at least not until it's much too late).
576+
Yes, someone who knows about core.quotePath can avoid this (unless they
577+
have other special characters like \t, \n, or "), and people who use
578+
ls-files -z with something other than grep can avoid this, but that
579+
doesn't mean they will.
580+
581+
* Similarly, when moving files around, one can find that filenames with
582+
non-ascii or special characters end up in a different directory, one
583+
that includes a double quote character. (This is technically the same
584+
issue as above with quoting, but perhaps an interesting different way
585+
that it can and has manifested as a problem.)
586+
587+
* It's far too easy to accidentally mix up old and new history. It's
588+
still possible with any tool, but git-filter-branch almost invites it.
589+
If lucky, the only downside is users getting frustrated that they don't
590+
know how to shrink their repo and remove the old stuff. If unlucky,
591+
they merge old and new history and end up with multiple "copies" of each
592+
commit, some of which have unwanted or sensitive files and others which
593+
don't. This comes about in multiple different ways:
594+
595+
** the default to only doing a partial history rewrite ('--all' is not
596+
the default and few examples show it)
597+
598+
** the fact that there's no automatic post-run cleanup
599+
600+
** the fact that --tag-name-filter (when used to rename tags) doesn't
601+
remove the old tags but just adds new ones with the new name
602+
603+
** the fact that little educational information is provided to inform
604+
users of the ramifications of a rewrite and how to avoid mixing old
605+
and new history. For example, this man page discusses how users
606+
need to understand that they need to rebase their changes for all
607+
their branches on top of new history (or delete and reclone), but
608+
that's only one of multiple concerns to consider. See the
609+
"DISCUSSION" section of the git filter-repo manual page for more
610+
details.
611+
612+
* Annotated tags can be accidentally converted to lightweight tags, due
613+
to either of two issues:
614+
615+
** Someone can do a history rewrite, realize they messed up, restore
616+
from the backups in refs/original/, and then redo their
617+
git-filter-branch command. (The backup in refs/original/ is not a
618+
real backup; it dereferences tags first.)
619+
620+
** Running git-filter-branch with either --tags or --all in your
621+
<rev-list options>. In order to retain annotated tags as
622+
annotated, you must use --tag-name-filter (and must not have
623+
restored from refs/original/ in a previously botched rewrite).
624+
625+
* Any commit messages that specify an encoding will become corrupted
626+
by the rewrite; git-filter-branch ignores the encoding, takes the original
627+
bytes, and feeds it to commit-tree without telling it the proper
628+
encoding. (This happens whether or not --msg-filter is used.)
629+
630+
* Commit messages (even if they are all UTF-8) by default become
631+
corrupted due to not being updated -- any references to other commit
632+
hashes in commit messages will now refer to no-longer-extant commits.
633+
634+
* There are no facilities for helping users find what unwanted crud they
635+
should delete, which means they are much more likely to have incomplete
636+
or partial cleanups that sometimes result in confusion and people
637+
wasting time trying to understand. (For example, folks tend to just
638+
look for big files to delete instead of big directories or extensions,
639+
and once they do so, then sometime later folks using the new repository
640+
who are going through history will notice a build artifact directory
641+
that has some files but not others, or a cache of dependencies
642+
(node_modules or similar) which couldn't have ever been functional since
643+
it's missing some files.)
644+
645+
* If --prune-empty isn't specified, then the filtering process can
646+
create hoards of confusing empty commits
647+
648+
* If --prune-empty is specified, then intentionally placed empty
649+
commits from before the filtering operation are also pruned instead of
650+
just pruning commits that became empty due to filtering rules.
651+
652+
* If --prune empty is specified, sometimes empty commits are missed
653+
and left around anyway (a somewhat rare bug, but it happens...)
654+
655+
* A minor issue, but users who have a goal to update all names and
656+
emails in a repository may be led to --env-filter which will only update
657+
authors and committers, missing taggers.
658+
659+
* If the user provides a --tag-name-filter that maps multiple tags to
660+
the same name, no warning or error is provided; git-filter-branch simply
661+
overwrites each tag in some undocumented pre-defined order resulting in
662+
only one tag at the end. (A git-filter-branch regression test requires
663+
this surprising behavior.)
664+
665+
Also, the poor performance of git-filter-branch often leads to safety
666+
issues:
667+
668+
* Coming up with the correct shell snippet to do the filtering you want
669+
is sometimes difficult unless you're just doing a trivial modification
670+
such as deleting a couple files. Unfortunately, people often learn if
671+
the snippet is right or wrong by trying it out, but the rightness or
672+
wrongness can vary depending on special circumstances (spaces in
673+
filenames, non-ascii filenames, funny author names or emails, invalid
674+
timezones, presence of grafts or replace objects, etc.), meaning they
675+
may have to wait a long time, hit an error, then restart. The
676+
performance of git-filter-branch is so bad that this cycle is painful,
677+
reducing the time available to carefully re-check (to say nothing about
678+
what it does to the patience of the person doing the rewrite even if
679+
they do technically have more time available). This problem is extra
680+
compounded because errors from broken filters may not be shown for a
681+
long time and/or get lost in a sea of output. Even worse, broken
682+
filters often just result in silent incorrect rewrites.
683+
684+
* To top it all off, even when users finally find working commands, they
685+
naturally want to share them. But they may be unaware that their repo
686+
didn't have some special cases that someone else's does. So, when
687+
someone else with a different repository runs the same commands, they
688+
get hit by the problems above. Or, the user just runs commands that
689+
really were vetted for special cases, but they run it on a different OS
690+
where it doesn't work, as noted above.
478691

479692
GIT
480693
---

Documentation/git-gc.txt

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -115,15 +115,14 @@ NOTES
115115
-----
116116

117117
'git gc' tries very hard not to delete objects that are referenced
118-
anywhere in your repository. In
119-
particular, it will keep not only objects referenced by your current set
120-
of branches and tags, but also objects referenced by the index,
121-
remote-tracking branches, refs saved by 'git filter-branch' in
122-
refs/original/, reflogs (which may reference commits in branches
123-
that were later amended or rewound), and anything else in the refs/* namespace.
124-
If you are expecting some objects to be deleted and they aren't, check
125-
all of those locations and decide whether it makes sense in your case to
126-
remove those references.
118+
anywhere in your repository. In particular, it will keep not only
119+
objects referenced by your current set of branches and tags, but also
120+
objects referenced by the index, remote-tracking branches, notes saved
121+
by 'git notes' under refs/notes/, reflogs (which may reference commits
122+
in branches that were later amended or rewound), and anything else in
123+
the refs/* namespace. If you are expecting some objects to be deleted
124+
and they aren't, check all of those locations and decide whether it
125+
makes sense in your case to remove those references.
127126

128127
On the other hand, when 'git gc' runs concurrently with another process,
129128
there is a risk of it deleting an object that the other process is using

Documentation/git-rebase.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -830,7 +830,8 @@ Hard case: The changes are not the same.::
830830
This happens if the 'subsystem' rebase had conflicts, or used
831831
`--interactive` to omit, edit, squash, or fixup commits; or
832832
if the upstream used one of `commit --amend`, `reset`, or
833-
`filter-branch`.
833+
a full history rewriting command like
834+
https://github.com/newren/git-filter-repo[`filter-repo`].
834835

835836

836837
The easy case

Documentation/git-replace.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -123,10 +123,10 @@ The following format are available:
123123
CREATING REPLACEMENT OBJECTS
124124
----------------------------
125125

126-
linkgit:git-filter-branch[1], linkgit:git-hash-object[1] and
127-
linkgit:git-rebase[1], among other git commands, can be used to create
128-
replacement objects from existing objects. The `--edit` option can
129-
also be used with 'git replace' to create a replacement object by
126+
linkgit:git-hash-object[1], linkgit:git-rebase[1], and
127+
https://github.com/newren/git-filter-repo[git-filter-repo], among other git commands, can be used to
128+
create replacement objects from existing objects. The `--edit` option
129+
can also be used with 'git replace' to create a replacement object by
130130
editing an existing object.
131131

132132
If you want to replace many blobs, trees or commits that are part of a
@@ -148,13 +148,13 @@ pending objects.
148148
SEE ALSO
149149
--------
150150
linkgit:git-hash-object[1]
151-
linkgit:git-filter-branch[1]
152151
linkgit:git-rebase[1]
153152
linkgit:git-tag[1]
154153
linkgit:git-branch[1]
155154
linkgit:git-commit[1]
156155
linkgit:git-var[1]
157156
linkgit:git[1]
157+
https://github.com/newren/git-filter-repo[git-filter-repo]
158158

159159
GIT
160160
---

Documentation/git-svn.txt

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -769,11 +769,11 @@ option for (hopefully) obvious reasons.
769769
+
770770
This option is NOT recommended as it makes it difficult to track down
771771
old references to SVN revision numbers in existing documentation, bug
772-
reports and archives. If you plan to eventually migrate from SVN to Git
773-
and are certain about dropping SVN history, consider
774-
linkgit:git-filter-branch[1] instead. filter-branch also allows
775-
reformatting of metadata for ease-of-reading and rewriting authorship
776-
info for non-"svn.authorsFile" users.
772+
reports, and archives. If you plan to eventually migrate from SVN to
773+
Git and are certain about dropping SVN history, consider
774+
https://github.com/newren/git-filter-repo[git-filter-repo] instead.
775+
filter-repo also allows reformatting of metadata for ease-of-reading
776+
and rewriting authorship info for non-"svn.authorsFile" users.
777777

778778
svn.useSvmProps::
779779
svn-remote.<name>.useSvmProps::

Documentation/githooks.txt

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -447,10 +447,12 @@ post-rewrite
447447

448448
This hook is invoked by commands that rewrite commits
449449
(linkgit:git-commit[1] when called with `--amend` and
450-
linkgit:git-rebase[1]; currently `git filter-branch` does 'not' call
451-
it!). Its first argument denotes the command it was invoked by:
452-
currently one of `amend` or `rebase`. Further command-dependent
453-
arguments may be passed in the future.
450+
linkgit:git-rebase[1]; however, full-history (re)writing tools like
451+
linkgit:git-fast-import[1] or
452+
https://github.com/newren/git-filter-repo[git-filter-repo] typically
453+
do not call it!). Its first argument denotes the command it was
454+
invoked by: currently one of `amend` or `rebase`. Further
455+
command-dependent arguments may be passed in the future.
454456

455457
The hook receives a list of the rewritten commits on stdin, in the
456458
format

0 commit comments

Comments
 (0)