(Click me to see it.)
- What is
hoardy? - What can
hoardydo? - On honesty in reporting of data loss issues
- Glossary
- Quickstart
- Quirks and Bugs
- Frequently Asked Questions
- I’m using
fdupes/jdupesnow, how do I migrate to usinghoardy? - I have two identical files, but
hoardy deduplicatedoes not deduplicate them. Why? - What would happen if I run
hoardy deduplicatewith an outdated index? Wouldhoardyloose some of my files by wrongly “deduplicating” them? - I have two files with equal
SHA256hash digests andsizes, and yet they are unequal when compared as binary strings. Wouldhoardy“deduplicate” them wrongly? - What would happen if I run
hoardy deduplicate --deletewith the same directory given in two different arguments? Would it consider those files to be equivalent to themselves and delete them, losing all my data? - But what if I give the same directory to
hoardy deduplicate --deletetwice, but not as equivalent paths, but by giving one of them as a symlink into an ancestor of the other, followed by their common suffix? Will it loose my data now? - Alright, but what if I
mount --binda directory to another directory, thenhoardy indexand runhoardy deduplicate --deleteon both. The cloned directory will appear to be exactly the same as the original directory, but paths would be different, and there would be no symlinks involved. Sohoardy deduplicate --deletewould then detect them as duplicates and would need to delete all files from one of them. But deleting a file from one will also delete it from the other! Ha! Finally! Surely, it would loose my data now?! - Hmm, but
hoardy deduplicateimplementation looks rather complex. What if a bug there causes it to “deduplicate” some files that are not actually duplicates and loose data?
- I’m using
- Why does
hoardyexists? - Development history
- Alternatives
- Meta
- Usage
- Development:
./test-hoardy.sh [--help] [--wine] [--fast] [default] [(NAME|PATH)]*
hoardy is an tool for digital data hoarding, a Swiss-army-knife-like utility for managing otherwise unmanageable piles of files.
On GNU/Linux, hoardy it pretty well-tested on my files and I find it to be an essentially irreplaceable tool for managing duplicated files in related source code trees, media files duplicated between my home directory, git-annex, and hydrus file object stores, as well as backup snapshots made with rsync and rsnapshot.
On Windows, however, hoardy is a work in progress essentially unusable alpha software that is completely untested.
Data formats and command-line syntax of hoardy are subject to change in future versions.
See below for why.
hoardy can
-
record hashes and metadata of separate files and/or whole filesystem trees/hierarchies/directories, recursively, in
SQLitedatabases;both one big database and/or many small ones are supported;
-
update those records incrementally by adding new filesystem trees and/or re-indexing previously added ones;
it can also re-
indexfilesystem hierarchies much faster if files in its input directories only ever get added or removed, but their contents never change, which is common with backup directories (seehoardy index --no-update); -
find duplicated files matching specified criteria, and then
-
display them,
-
replace some of the duplicated files with hardlinks to others, or
-
delete some of the duplicated files;
similarly to what
fdupesandjdupesdo, buthoardywon't loose your files, won't loose extended file attributes, won't leave your filesystem in an inconsistent state in case of power failure, is much faster on large inputs, can used even if you have more files than you have RAM to store their metadata, can be run incrementally without degrading the quality of results, ...; -
-
verify actual filesystem contents against file metadata and/or hashes previously recorded in its databases;
which is similar to what
RHashcan do, buthoardyis faster on large databases of file records, can verify file metadata, and slightly more convenient to use, but, also, at the moment,hoardyonly computes and checksSHA256hash digests and nothing else.
See the "Alternatives" section for more info.
This document mentions data loss and situations when it could occur, repeatedly. I realize that this may turn some people off. Unfortunately, the reality is that with modern computing it's quite easy to screw things up. If a tool can delete or overwrite data, it can loose data. Hence, make backups!
With that said, hoardy tries its very best to make situations where it causes data loss impossible by doing a ton of paranoid checks before doing anything destructive.
Unfortunately, the set of situations where it could lose some data even after doing all those checks is not empty.
Which is why "Quirks and Bugs" section documents all of those situations known to me.
(So... Make backups!)
Meanwhile, "Frequently Asked Questions", among other things, documents various cases that are handled safely.
Most of those are quite non-obvious and not recognized by other tools, which will loose your data where hoardy would not.
As far as I know, hoardy is actually the safest tool for doing what it does, but this document mentions data loss repeatedly, while other tools prefer to be quiet about it.
I've read the sources of hoardy's alternatives to make those comparisons there, and to figure out if I maybe should change how hoardy does some things, and I became much happier with hoardy's internals as a result.
Just saying.
Also, should I ever find an issue in hoardy that produces loss off data, I commit to fixing and honestly documenting it all immediately, and then adding new tests to the test suite to prevent that issues in the future.
A promise that can be confirmed by the fact that I did such a thing before for hoardy-web tool, see its tool-v0.18.1 release.
-
Inode is a physical unnamed files.
Directories reference them, giving them names.
Different directories, or different names in the same directory, can refer to the same inode, making that file available under different names.
Editing such a file under one name will change its content under all the other names too.
-
nlinksis the number of times an inode is referenced by all the directories on a filesystem.
See man 7 inode for more info.
-
Install
Python 3:- On a conventional POSIX system like most GNU/Linux distros and MacOS X: Install
python3via your package manager. Realistically, it probably is installed already.
- On a conventional POSIX system like most GNU/Linux distros and MacOS X: Install
-
On a POSIX system:
Open a terminal, install this with
pip install hoardy
and run as
hoardy --help
-
Alternatively, for light development (without development tools, for those see
nix-shellbelow):Open a terminal/
cmd.exe,cdinto this directory, then install withpython -m pip install -e . # or pip install -e .
and run as:
python -m hoardy --help # or hoardy --help -
Alternatively, on a system with Nix package manager
nix-env -i -f ./default.nix hoardy --help
Though, in this case, you'll probably want to do the first command from the parent directory, to install everything all at once.
-
Alternatively, to replicate my development environment:
nix-shell ./default.nix --arg developer true
So, as the simplest use case, deduplicate your ~/Downloads directory.
Index your ~/Downloads directory:
hoardy index ~/DownloadsLook at the list of duplicated files there:
hoardy find-dupes ~/DownloadsDeduplicate them by hardlinking each duplicate file to its oldest available duplicate version, i.e. make all paths pointing to duplicate files point to the oldest available inode among those duplicates:
hoardy deduplicate --hardlink ~/Downloads
# or, equivalently
hoardy deduplicate ~/DownloadsThe following should produce an empty output now:
hoardy find-dupes ~/DownloadsIf it does not (which is unlikely for ~/Downloads), then some duplicates have different metadata (permissions, owner, group, extended attributes, etc), which will be discussed below.
By default, both deduplicate --hardlink and find-dupes run with implied --min-inodes 2 option.
Thus, to see paths that point to the same inodes on disk you'll need to run the following instead:
hoardy find-dupes --min-inodes 1 ~/Downloads
To delete all but the oldest file among duplicates in a given directory, run
hoardy deduplicate --delete ~/Downloads
in which case --min-inodes 1 is implied by default.
The result of which could, of course, have been archived by running this last command directly, without doing all of the above except for index.
Personally, I have
hoardy index ~/Downloads && hoardy deduplicate --delete ~/Downloads
scheduled in my daily crontab, because I frequently re-download files from local servers while developing things (for testing).
Normally, you probably don't need to run it that often.
Assuming you have a bunch of directories that were produced by something like
rsync -aHAXivRyy --link-dest=/backup/yesterday /home /backup/today
you can deduplicate them by running
hoardy index /backup
hoardy deduplicate /backup
(Which will probably take a while.)
Doing this will deduplicate everything by hardlinking each duplicate file to an inode with the oldest mtime while respecting and preserving all file permissions, owners, groups, and user extended attributes.
If you run it as super-user it will also respect all other extended name-spaces, like ACLs, trusted extended attributes, etc.
See man 7 xattr for more info.
But, depending on your setup and wishes, the above might not be what you'd want to run. For instance, personally, I run
hoardy index /backup
hoardy deduplicate --reverse --ignore-meta /backup
instead.
Doing this hardlinks each duplicate file to an inode with the latest mtime (--reverse) and ignores all file metadata (but not extended attributes), so that the next
rsync -aHAXivRyy --link-dest=/backup/today /home /backup/tomorrow
could re-use those inodes via --link-dest as much as possible again.
Without those options the next rsync --link-dest would instead re-create many of those inodes again, which is not what I want, but your mileage may vary.
Also, even with --reverse the original mtime of each path will be kept in the hoardy's database so that it could be restored later.
(Which is pretty cool, right?)
Also, if you have so many files under /backup that deduplicate does not fit into RAM, you can still run it incrementally (while producing the same deduplicated result) via sharding by SHA256 hash digest.
See examples for more info.
Note however, that simply running hoardy deduplicate on your whole $HOME directory will probably break almost everything, as many programs depend on file timestamps not moving backwards, use zero-length or similarly short files for various things, overwrite files without copying them first, and expect them to stay as independent inodes.
Hardlinking different same-data files together on a non-backup filesystem will break all those assumptions.
(If you do screw it up, you can fix it by simply doing cp -a file file.copy ; mv file.copy file for each wrongly deduplicated file.)
However, sometimes deduplicating some files under $HOME can be quite useful, so hoardy implements a fairly safe way to do it semi-automatically.
Index your home directory and generate a list of all duplicated files, matched strictly, like deduplicate would do:
hoardy index ~
hoardy find-dupes --print0 --match-meta ~ > dupes.print0--print0 is needed here because otherwise file names with newlines and/or weird symbols in them could be parsed as multiple separate paths and/or mangled.
By default, without --print0, hoardy solves this by escaping control characters in its outputs, and, in theory, it could then allow to read back its own outputs using that format.
But normal UNIX tools won't be able to use them, hence --print0, which is almost universally supported.
You can then easily view the resulting file from a terminal with:
cat dupes.print0 | tr '\0' '\n' | lesswhich, if none of the paths have control symbols in them, will be equivalent to the output of:
hoardy find-dupes --match-meta ~ | lessBut you can now use grep or another similar tool to filter those outputs.
Say, for example, you want to deduplicate git objects across different repositories:
grep -zP '/\.git/objects/([0-9a-f]{2}|pack)/' dupes.print0 > git-objects.print0
cat git-objects.print0 | tr '\0' '\n' | lessThese are never modified, as so they can be hardlinked together.
In fact, git does this silently when it notices, so you might not get a lot of duplicates there, especially if you mostly clone local repositories from each other.
But if you have several related repositories cloned from external sources at $HOME, the above output, most likely, will not be empty.
So, you can now pretend to deduplicate all of those files:
hoardy deduplicate --dry-run --stdin0 < git-objects.print0and then actually do it:
hoardy deduplicate --stdin0 < git-objects.print0Ta-da! More disk space! For free!
Of course, the above probably won't have deduplicated much.
However, if you use npm lots, then your filesystem is probably chock full of node_modules directories full of files that can be deduplicated.
In fact, pnpm tool does this automatically when installing new stuff, but it won't help with the previously installed stuff.
Whereas hoardy can help:
grep -zF '/node_modules/' dupes.print0 > node_modules.print0
cat node_modules.print0 | tr '\0' '\n' | less
hoardy deduplicate --stdin0 < node_modules.print0Doing this could save quite a bit of space, since nodejs packages tend to duplicate everything dozens of times.
... and then duplicate them on-demand while editing.
Personally, I use git worktrees a lot.
That is, usually, I clone a repo, make a feature branch, check it out into a separate worktree, and work on it there:
git clone --origin upstream url/to/repo repo
cd repo
git branch feature-branch
git worktree add feature feature-branch
cd feature
# now working on a feature-branch
# ....Meanwhile, in another TTY, I check out successive testable revisions and test them in a separate nix-shell session
cd ~/src/repo
hash=$(cd ~/src/repo/feature; git rev-parse HEAD)
git worktree add testing $hash
cd testing
nix-shell ./default.nix
# run long-running tests here
# when feature-branch updated lots
hash=$(cd ~/src/repo/feature; git rev-parse HEAD)
git checkout $hash
# again, run long-running tests herewhich allows me to continue working on feature-branch without interruptions while the tests are being run on a frozen worktree, which eliminates a whole class of testing errors.
With a bit of conscientiousness, it also allows me to compare feature-branch to the latest revision that passed all the tests very easily.
Now, this workflow costs almost nothing for small projects, but for Nixpkgs, Firefox, or the Linux Kernel each worktree checkout takes quite a bit of space. If you have dozens of feature-branches, then space usage can be quite horrifying.
But hoardy and Emacs can help!
Emacs with break-hardlink-on-save variable set to t (M-x customize-variable break-hardlink-on-save) will always re-create and then rename files when writing buffers to disk, always breaking hardlinks.
I.e., with it enabled, Emacs won't be overriding any files in-place, ever.
This has safety advantages, so that, e.g., power loss won't loose your data even if your Emacs happened to be writing out a huge org-mode file to disk at that moment.
Which is nice.
But enabling that option also allows you to simply hoardy deduplicate all source files on your filesystem without care.
That is, I have the above variable set in my Emacs config, I run
hoardy index ~/src/nixpkgs/* ~/src/firefox/* ~/src/linux/*
hoardy deduplicate ~/src/nixpkgs/* ~/src/firefox/* ~/src/linux/*periodically, and let my Emacs duplicate files I actually touch, on-demand.
For Vim, the docs say, the following setting in .vimrc should produce the same effect:
set backupcopy=no,breakhardlinkbut I tried it, and it does not work.
(You can try it yourself:
cd /tmp
echo test > test-file
ln test-file test-file2
vim test-file2
# edit it
# :wq
ls -l test-file test-file2The files should be different, but on my system they stay hardlinked.)
-
hoardydatabases take up quite a bit of space.This will be fixed with database format
v4, which will store file trees instead of plain file tables indexed by paths. -
When a previously indexed file or directory can't be accessed due to file modes/permissions,
hoardy indexwill remove it from the database.This is a design issue with the current scanning algorithm which will be solved after database format
v4.At the moment, it can be alleviated by running
hoardy indexwith--no-removeoption. -
By default,
hoardy indexrequires its input files to live on a filesystem which either has persistent inode numbers or reports all inode numbers as zeros.I.e., by default,
indexing files from a filesystem likeunionfsorsshfs, which use dynamic inode numbers, will produce broken index records.Filesystems like that can still be indexed with
--no-inooption set, but there's no auto-detection for this option at the moment.Though, brokenly
indexed trees can be fixed by simply re-indexing with--no-inoset. -
When
hoardyis running, mounting a new filesystem into a directory given as itsINPUTs could break some things in unpredictable ways, makinghoardyreport random files as having broken metadata.No data loss should occur in this case while
deduplicateis running, but the outputs offind-duplicatescould become useless.
-
Files changing at inconvenient times while
hoardyis running could make it lose either the old or the updated version of each such file.Consider this:
hoardy deduplicate(--hardlinkor--delete) discoverssourceandtargetfiles to be potential duplicates,- checks
sourceandtargetfiles to have equal contents, - checks their file metadata, they match its database state,
- "Okay!", it thinks, "Let's deduplicate them!"
- but the OS puts
hoardyto sleep doing its multi-tasking thing, - another program sneaks in and sneakily updates
sourceortarget, - the OS wakes
hoardyup, hoardyproceeds to deduplicate them, loosing one of them.
hoardycallslstatjust before each file is--hardlinked or--deleted, so this situation is quite unlikely and will be detected with very high probability, but it's not impossible.If it does happen,
hoardyrunning with default settings will loose the updated version of the file, unless--reverseoption is set, in which case it will loose be the oldest one instead.I know of no good solution to fix this. As far as I know, all alternatives suffer from the same issue.
Technically, on Linux, there's a partial workaround for this via
renameat2syscall withRENAME_EXCHANGEflag, which is unused by bothhoardyand all similar tools at the moment, AFAICS.On Windows, AFAIK, there's no way around this issue at all.
Thus, you should not
deduplicatedirectories with files that change.
-
hoardy find-dupesusually produces the same results asjdupes --recurse --zeromatch --order time. -
hoardy deduplicate --hardlinkis a replacement forjdupes --recurse --zeromatch --permissions --order time --linkhard --noprompt. -
hoardy deduplicate --deleteis a replacement forjdupes --recurse --zeromatch --permissions --order time --hardlinks --delete --noprompt.
By default, files must match in everything but timestamps for hoardy deduplicate to consider them to be duplicates.
In comparison, hoardy find-duplicates considers everything with equal SHA256 hash digest and sizes to be duplicates instead.
It works this way because hoardy find-duplicates is designed to inform you of all the potential things you could deduplicate while hoardy deduplicate is designed to preserve all metadata by default (hoardy deduplicate --hardlink also preserves the original file mtime in the database, so it can be restored later).
If things like file permissions, owners, and groups are not relevant to you, you can run
hoardy deduplicate --ignore-meta path/to/file1 path/to/file2
to deduplicate files that mismatch in those metadata fields.
(If you want to control this more precisely, see deduplicate's options.)
If even that does not deduplicate your files, and they are actually equal as binary strings, extended file attributes must be different. At the moment, if you are feeling paranoid, you will need to manually do something like
# dump them all
getfattr --match '.*' --dump path/to/file1 path/to/file2 > attrs.txt
# edit the result so that records of both files match
$EDITOR attrs.txt
# write them back
setfattr --restore=attrs.txt
after which hoardy deduplicate --ignore-meta would deduplicate them (if they are indeed duplicates).
(Auto-merging of extended attributes, when possible, is on the "TODO" list.)
What would happen if I run hoardy deduplicate with an outdated index? Would hoardy loose some of my files by wrongly "deduplicating" them?
No, it would not.
hoardy checks that each soon-to-be deduplicated file from its index matches its filesystem counterpart, printing an error and skipping that file and all its apparent duplicates if not.
I have two files with equal SHA256 hash digests and sizes, and yet they are unequal when compared as binary strings. Would hoardy "deduplicate" them wrongly?
No, it would not.
hoardy checks that source and target inodes have equal data contents before hardlinking them.
What would happen if I run hoardy deduplicate --delete with the same directory given in two different arguments? Would it consider those files to be equivalent to themselves and delete them, losing all my data?
Nope, hoardy will notice the same path being processed twice and ignore the second occurrence, printing a warning.
But what if I give the same directory to hoardy deduplicate --delete twice, but not as equivalent paths, but by giving one of them as a symlink into an ancestor of the other, followed by their common suffix? Will it loose my data now?
Nope, hoardy will detect this too by resolving all of its inputs first.
Alright, but what if I mount --bind a directory to another directory, then hoardy index and run hoardy deduplicate --delete on both. The cloned directory will appear to be exactly the same as the original directory, but paths would be different, and there would be no symlinks involved. So hoardy deduplicate --delete would then detect them as duplicates and would need to delete all files from one of them. But deleting a file from one will also delete it from the other! Ha! Finally! Surely, it would loose my data now?!
Nope, hoardy will detect this and skip all such files too.
Before acting hoardy deduplicate checks that if source and target point to the same file on the same device then it's nlinks is not 1.
If both source and target point to the same last copy of a file, it will not be acted upon.
Note that hoardy does this check not only in --delete mode, but also in --hardlink mode, since re-linking them will simply produce useless link+rename churn and disk IO.
Actually, if you think about it, this check catches all other possible issues of "removing the last copy of a file when we should not" kind, so all other similar "What if" questions can be answered by "in the worst case, it will be caught by that magic check and at least one copy of the file will persist". And that's the end of that.
As far as I know, hoardy is the only tool in existence that handles this properly.
Probably because I'm rare in that I like using mount --binds at $HOME.
(They are useful in places where you'd normally want to hardlink directories, but can't because POSIX disallows it.
For instance, vendor/kisstdlib directory here is a mount --bind on my system, so that I could ensure all my projects work with its latest version without fiddling with git.)
And so I want hoardy to work even while they are all mounted.
Hmm, but hoardy deduplicate implementation looks rather complex. What if a bug there causes it to "deduplicate" some files that are not actually duplicates and loose data?
Firstly, a healthy habit to have is to simply not trust any one tool to not loose your data, make a backup (including of your backups) before running hoardy deduplicate first.
(E.g., if you are feeling very paranoid, you can run rsync -aHAXiv --link-dest=source source copy to make a hardlink-copy or cp -a --reflink=always source copy to make a reflink-copy first.
On a modern filesystem these cost very little.
And you can later remove them to save the space used by inodes, e.g., after you hoardy verifyed that nothing is broken.)
Secondly, I'm pretty sure it works fine as hoardy has quite a comprehensive test suite for this and is rather well-tested on my backups.
Thirdly, the actual body of hoardy deduplicate is written in a rather paranoid way re-verifying all assumptions before attempting to do anything.
Fourthly, by default, hoardy deduplicate runs with --paranoid option enabled, which checks that source and target have equal contents before doing anything to a pair of supposedly duplicate files, and emits errors if they are not.
This could be awfully inefficient, true, but in practice it usually does not matter as on a reasonably powerful machine with those files living on an HDD the resulting content re-checks get eaten by IO latency anyway.
Meanwhile, --paranoid prevents data loss even if the rest of the code is completely broken.
With --no-paranoid is still checks file content equality, but once per every new inode, not for each pair of paths.
Eventually --no-paranoid will probably become the default (when I stop editing all that code and fearing I would accidentally break something).
Which, by the way, is the reason why hoardy deduplicate looks rather complex.
All those checks are not free.
So, since I'm using this tool extensively myself on my backups which I very much don't want to later restore from their cold backups, I'm pretty paranoid at ensuring it does not loose any data. It should be fine.
That is, I've been using hoardy to deduplicate files inside my backup directories, which contain billions of files spanning decades, since at least 2020.
So far, for me, bugs in hoardy caused zero data loss.
Originally, I made hoardy as a replacement for its alternatives so that I could:
-
Find files by hash, because I wanted to easily open content-addressed links in my org-mode files.
-
Efficiently deduplicate files between different backups produced by
rsync/rsnapshot:rsync -aHAXivRyy --link-dest=/backup/yesterday /home /backup/today
since
rsyncdoes not handle file movements and renames very well, even with repeated--fuzzy/-y(see itsmanpage for more info). -
Efficiently deduplicate per-app backups produced by
hoardy-adb:hoardy-adb split backup.ab
-
Efficiently deduplicate files between all of the above and
.git/objectsof related repositories,.git/annex/objectsproduced bygit-annex,.local/share/hydrus/filesproduced byhydrus, and similar, in cases where they all live on the same filesystem.The issue here is that
git-annex,hydrus, and similar tools copy files into their object stores, even when the files you feed them are read-only and can be hardlinked instead. Which, usually, is a good thing preventing catastrophic consequences of user errors. But I never edit read-only files, I do backups of backups, and, in general, I know what I'm doing, thank you very much, so I'd like to save my disk space instead, please.
"But ZFS/BTRFS solves this!" I hear you say?
Well, sure, such filesystems can deduplicate data blocks between different files (though, usually, you have to make a special effort to archive this as, by default, they do not), but how much space gets wasted to store the inodes?
Let's be generous and say an average inode takes 256 bytes (on a modern filesystems it's usually 512 bytes or more, which, by the way, is usually a good thing, since it allows small files to be stored much more efficiently by inlining them into the inode itself, but this is awful for efficient storage of backups).
My home directory has ~10M files in it (most of those are emails and files in source repositories, and this is the minimum I use all the time, I have a bunch more stuff on external drives, but it does not fit onto my SSD), thus a year of naively taken daily rsync-backups would waste (256 * 10**7 * 365) / (1024 ** 3) = 870.22 GiB in inodes alone.
Sure, rsync --link-dest will save a bunch of that space, but if you move a bunch of files, they'll get duplicated.
In practice, the last time I deduplicated a never-before touched pristine rsnapshot hierarchy containing backups of my $HOME it saved me 1.1 TiB of space.
Don't you think you would find a better use for 1.1TiB of additional space than storing useless inodes?
Well, I did.
"But fdupes and its forks solve this!" I hear you say?
Well, sure, but the experience of using them in the above use cases of deduplicating mostly-read-only files is quite miserable.
See the "Alternatives" section for discussion.
Also, I wanted to store the oldest known mtime for each individual path, even when deduplicate-hardlinking all the copies, so that the exact original filesystem tree could be re-created from the backup when needed.
AFAIK, hoardy is the only tool that does this.
Yes, this feature is somewhat less useful on modern filesystems which support reflinks (Copy-on-Write lightweight copies), but even there, a reflink takes a whole inode, while storing an mtime in a database takes <= 8 bytes.
Also, in general, indexing, search, duplicate discovery, set operations, send-receive from remote nodes, and application-defined storage APIs (like HTTP/WebDAV/FUSE/SFTP), can be combined to produce many useful functions.
It's annoying there appears to be no tool that can do all of those things on top of a plain file hierarchy.
All such tools known to me first slurp all the files into their own object stores, and usually store those files quite less efficiently than I would prefer, which is annoying.
See the "Wishlist" for more info.
This version of hoardy is a minimal valuable version of my privately developed tool (referred to as "bootstrap version" in commit messages), taken at its version circa 2020, cleaned up, rebased on top of kisstdlib, slightly polished, and documented for public display and consumption.
The private version has more features and uses a much more space-efficient database format, but most of those cool new features are unfinished and kind of buggy, so I was actually mostly using the naive-database-formatted bootstrap version in production.
So, I decided to finish generalizing the infrastructure stuff to kisstdlib first, chop away everything related to v4 on-disk format and later, and then publish this part first.
(Which still took me two months of work. Ridiculous!)
The rest is currently a work in progress.
If you'd like all those planned features from the the "TODO" list and the "Wishlist" to be implemented, sponsor them. I suck at multi-tasking and I need to eat, time spent procuring sustenance money takes away huge chunks of time I could be working on this and other related projects.
fdupes is the original file deduplication tool.
It walks given input directories, hashes all files, groups them into potential duplicate groups, then compares the files in each group as binary strings, and then deduplicates the ones that match.
jdupes is a fork of fdupes that does duplicate discovery more efficiently by hashing as little as possible, which works really well on an SSD or when your files contain very small number of duplicates.
But in other situations, like with a file hierarchy with tons of duplicated files living on an HDD, it works quite miserably, since it generates a lot of disk seeks by doing file comparisons incrementally.
Meanwhile, since the fork, fdupes added hashing into an SQLite database, similar to what hoardy does.
Comparing hoardy, fdupes, and jdupes I notice the following:
-
hoardywill not loose your data.hoardywill refuse to delete a last known copy of a file, it always checks that at least one copy of content data of each file it processes will still be available after it finishes doing whatever it's doing.fdupesandjdupeswill happily delete everything if you ask, and it's quite easy to ask accidentally, literally a single key press.Also, they will happily delete your data in some of the situations discussed in "Frequently Asked Questions", even if you don't ask.
Yes, usually, they work fine, but I recall restoring data from backups multiple times after using them.
-
Unlike with
jdupes, filesystem changes done byhoardy deduplicateare atomic with respect to power being lost.hoardyimplements--hardlinkbylinkingsourceto atempfile neartarget, and thenrenameing it to thetarget, which, on a journaled filesystem, is atomic. Thus, after a power loss, either thesourceor thetargetwill be in place oftarget.jdupesrenames thetargetfile totemp,link source target, and thenrm tempinstead. This is not atomic. Also, it probably does this to improve safety, but it does not actually help, since if thetargetis open by another process, that process can still write into there after therenameanyway.fdupesdoes not implement--hardlinkat all. -
hoardyis aware of extended file attributes and won't ignore or loose them, unless you specifically ask.Meanwhile, both
fdupesandjdupesignore and then usually loose them when deduplicating. -
jdupesre-starts from zero it gets interrupted, whilefdupesandhoardykeep most of the progress on interrupt.jdupeshas--softabortoption which helps with this issue somewhat, but it won't help if your machine crashes or loses power in the middle.fdupeslacks hardlinking support,jdupestakes literally months of wall-time to finish on my backups, even with files less that 1 MiB excluded, so both tools essentially unusable for my use case.But, if you have a smallish bunch of files sitting on an SSD, like a million or less and you want to deduplicate them once and then never again, like if you are a computer service technician or something, then
jdupesis probably the best solution then.Meanwhile, both
fdupesandhoardy indexindex all files into a database once, which does take quite a bit of time, but for billion-file hierarchies it takes days, not months, since all those files get accessed linearly. And that process can be interrupted at any time, including with a power loss, without losing most of the progress. -
Both
fdupesandhoardycan apply incremental updates to alreadyindexed hierarchies, which take little time to re-index, assuming file sizes and/ormtimes change as they should.Except,
hoardyallows you to optionally tweak itsindexalgorithm to save bunch of disk accesses when run on file hierarchies where files only ever get added or removed, but their contents never change, which is common with backup directories, seehoardy index --no-update.Meanwhile,
fdupesdoes not support this latter feature andjdupesdoes not support database indexes at all. -
hoardycan both dump the outputs offind-dupes --print0and load them back withdeduplicate --stdin0allowing you to filter files it would deduplicate easily.With small number of files you can run
xargs -0 fdupes,xargs -0 jdupes, or some such, but for large numbers it won't work.The number of inputs you can feed into
hoardyis limited by your RAM, not by OS command line argument list size limit.Neither of
fdupesorjdupescan do this. -
hoardy deduplicatecan shard its inputs, allowing it to work with piles of files large enough so that even their metadata alone does not to fit into RAM.Or, you can that feature to run
deduplicateon duplicate-disjoint self-contained chunks of its database, i.e. "dedupicate me about 1/5 of all duplicates, please, taking slightly more than 1/5 of the time of the whole thing", without degrading the quality of the results.I.e., with
fdupesandjdupesyou can shard by running them on subsets of your inputs. But then, files shared by different inputs won't be deduplicated between them. In contrast,hoardycan do sharding bySHA256, which will result in everything being properly deduplicated.See examples below.
Neither of
fdupesorjdupescan do this. -
Both
fdupesandhoardyare faster thanjdupeson large inputs, especially on HDDs.Both
fdupesandhoardy deduplicateuse indexed hashes to find pretty good approximate sets of potential duplicates very quickly on large inputs and walks the filesystem mostly linearly, which greatly improves performance on an HDD.In practice, I have not yet managed to become patient enough for
jdupesto finish deduplicating my whole backup directory once, and I once left it running for two months.Meanwhile, on my backups,
hoardy indextakes a couple of days, whilehoardy deduplicatestakes a couple of weeks of wall time, which can easily be done incrementally with sharding, see examples.fdupesdoes not support hardlinking, and I'm not motivated enough to copy my whole backup hierarchy and run it, comparing its outputs tohoardy deduplicate --delete. -
Also, with both
fdupesandhoardyre-deduplication will skip re-doing most of the work. -
hoardy deduplicateis very good at RAM usage.It uses the database to allow a much larger working set to fit into RAM, since it can unload file metadata from RAM and re-load it later from the database again at any moment.
Also, it pre-computes hash usage counts and then uses them to report progress and evict finished duplicate groups from memory as soon as possible. So, in practice, on very large inputs, it will first eat a ton of memory (which, if it's an issue, can be solved by sharding), but then it will rapidly processes and discards duplicate candidates groups, making all that memory available to other programs rather quickly again.
Meaning, you can feed it a ton of whole-system backups made with
rsyncspanning decades, and it will work, and it will deduplicate them using reasonable amounts time and memory.fdupeshas the--immediateoption which performs somewhat similarly, but at the cost of losing all control about which files get deleted.hoardyis good by default, without compromises.jdupescan't do this at all. -
Unlike
fdupesandjdupes,hoardy find-dupesreports same-hash+length files as duplicates even if they do not match as binary strings, which might not be what you want.Doing this allows
hoardy find-dupesto compute potential duplicates without touching the indexed file hierarchies at all (when running with its default settings), improving performance greatly.On non-malicious files of sufficient size, the default
SHA256hash function makes hash collisions highly improbable, so it's not really an issue, IMHO. Butfdupesandjdupesare technically better at this.hoardy deduplicatedoes check file equality properly before doing anything destructive, similar tofdupes/jdupes, so hash collisions will not loose your data, buthoardy find-dupeswill still list such files as duplicates.
In short, hoardy implements almost a union of features of both fdupes and jdupes, with some more useful features on top, but with some little bits missing here and there, but hoardy is also significantly safer to use than either of the other two.
RHash is "recursive hasher".
Basically, you give it a list of directories, it outputs <hash digest> <path> lines (or similar, it's configurable), then, later, you can verify files against a file consisting of such lines.
It also has some nice features, like hashing with many hashes simultaneously, skipping of already-hashed files present in the output file, and etc.
Practically speaking, it's usage is very similar to hoardy index followed by hoardy verify, except
-
RHashcan compute way more hash functions thanhoardy(at the moment,hoardyonly ever computesSHA256); -
for large indexed file hierarchies
hoardyis much faster at updating its indexes, since, unlike plain-text files generated byrhash,SQLitedatabases can be modified easily and incrementally;also, all the similar
indexing advantages from the previous subsection apply; -
hoardy verifycan verify both hashes and file metadata; -
hoardy's CLI is more convenient thanRHash's CLI, IMHO.
Many years before hoardy was born, I was using RHash quite extensively (and I remember the original forum it was discussed/developed at, yes).
See CHANGELOG.md.
See above, also the bottom of CHANGELOG.md.
LGPLv3+ (because it will become a library, eventually).
Contributions are accepted both via GitHub issues and PRs, and via pure email.
In the latter case I expect to see patches formatted with git-format-patch.
If you want to perform a major change and you want it to be accepted upstream here, you should probably write me an email or open an issue on GitHub first. In the cover letter, describe what you want to change and why. I might also have a bunch of code doing most of what you want in my stash of unpublished patches already.
A thingy for hoarding digital assets.
-
options:
--version: show program's version number and exit-h, --help: show this help message and exit--markdown: show--helpformatted in Markdown-d DATABASE, --database DATABASE: database file to use; default:~/.local/share/hoardy/index.dbon POSIX,%LOCALAPPDATA%\hoardy\index.dbon Windows--dry-run: perform a trial run without actually performing any changes
-
output defaults:
--color: set defaults to--color-stdoutand--color-stderr--no-color: set defaults to--no-color-stdoutand--no-color-stderr
-
output:
--color-stdout: colorstdoutoutput using ANSI escape sequences; default whenstdoutis connected to a TTY and environment variables do not setNO_COLOR=1--no-color-stdout: produce plain-textstdoutoutput without any ANSI escape sequences--color-stderr: colorstderroutput using ANSI escape sequences; default whenstderris connected to a TTY and environment variables do not setNO_COLOR=1--no-color-stderr: produce plain-textstderroutput without any ANSI escape sequences--progress: report progress tostderr; default whenstderris connected to a TTY--no-progress: do not report progress
-
filters:
--size-leq INT:size <= value--size-geq INT:size >= value--sha256-leq HEX:sha256 <= from_hex(value)--sha256-geq HEX:sha256 >= from_hex(value)
-
subcommands:
{index,find,find-duplicates,find-dupes,deduplicate,verify,fsck,upgrade}index: index given filesystem trees and record results in aDATABASEfind: print paths of indexed files matching specified criteriafind-duplicates (find-dupes): print groups of duplicated indexed files matching specified criteriadeduplicate: produce groups of duplicated indexed files matching specified criteria, and then deduplicate themverify (fsck): verify that the index matches the filesystemupgrade: backup theDATABASEand then upgrade it to latest format
Recursively walk given INPUTs and update the DATABASE to reflect them.
- For each
INPUT, walk it recursively (both in the filesystem and in theDATABASE), for each walkedpath:-
if it is present in the filesystem but not in the
DATABASE,- if
--no-addis set, do nothing, - otherwise, index it and add it to the
DATABASE;
- if
-
if it is not present in the filesystem but present in the
DATABASE,- if
--no-removeis set, do nothing, - otherwise, remove it from the
DATABASE;
- if
-
if it is present in both,
- if
--no-updateis set, do nothing, - if
--verifyis set, verify it as ifhoardy verify $pathwas run, - if
--checksumis set or if filetype,size, ormtimechanged,- re-index the file and update the
DATABASErecord, - otherwise, do nothing.
- re-index the file and update the
- if
-
-
positional arguments:
INPUT: input files and/or directories to process
-
options:
-h, --help: show this help message and exit--markdown: show--helpformatted in Markdown--stdin0: read zero-terminatedINPUTs from stdin, these will be processed after allINPUTSs specified as command-line arguments
-
output:
-v, --verbose: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated: print output lines terminated with\n(LF) newline characters; default-z, --zero-terminated, --print0: print output lines terminated with\0(NUL) bytes, implies--no-colorand zero verbosity
-
content hashing:
--checksum: re-hash everything; i.e., assume that some files could have changed contents without changingtype,size, ormtime--no-checksum: skip hashing if filetype,size, andmtimematchDATABASErecord; default
-
index how:
--add: for files present in the filesystem but not yet present in theDATABASE, index and add them to theDATABASE; note that new files will be hashed even if--no-checksumis set; default--no-add: ignore previously unseen files--remove: for files that vanished from the filesystem but are still present in theDATABASE, remove their records from theDATABASE; default--no-remove: do not remove vanished files from the database--update: for files present both on the filesystem and in theDATABASE, if a file appears to have changed on disk (changedtype,size, ormtime), re-index it and write its updated record to theDATABASE; note that changed files will be re-hashed even if--no-checksumis set; default--no-update: skip updates for all files that are present both on the filesystem and in theDATABASE--reindex: an alias for--update --checksum: for all files present both on the filesystem and in theDATABASE, re-index them and then updateDATABASErecords of files that actually changed; i.e. re-hash files even if they appear to be unchanged--verify: proceed like--updatedoes, but do not update any records in theDATABASE; instead, generate errors if newly generated records do not match those already in theDATABASE--reindex-verify: an alias for--verify --checksum: proceed like--reindexdoes, but then--verifyinstead of updating theDATABASE
-
record what:
-
--ino: record inode numbers reported bystatinto theDATABASE; default -
--no-ino: ignore inode numbers reported bystat, recording them all as0s; this will forcehoardyto ignore inode numbers in metadata checks and process such files as if each path is its own inode when doing duplicate search;on most filesystems, the default `--ino` will do the right thing, but this option needs to be set explicitly when indexing files from a filesystem which uses dynamic inode numbers (`unionfs`, `sshfs`, etc); otherwise, files indexed from such filesystems will be updated on each re-`index` and `find-duplicates`, `deduplicate`, and `verify` will always report them as having broken metadata
-
Print paths of files under INPUTs that match specified criteria.
- For each
INPUT, walk it recursively (in theDATABASE), for each walkedpath:- if the
pathand/or the file associated with that path matches specified filters, print thepath; - otherwise, do nothing.
- if the
-
positional arguments:
INPUT: input files and/or directories to process
-
options:
-h, --help: show this help message and exit--markdown: show--helpformatted in Markdown--stdin0: read zero-terminatedINPUTs from stdin, these will be processed after allINPUTSs specified as command-line arguments--porcelain: print outputs in a machine-readable format
-
output:
-v, --verbose: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated: print output lines terminated with\n(LF) newline characters; default-z, --zero-terminated, --print0: print output lines terminated with\0(NUL) bytes, implies--no-colorand zero verbosity
Print groups of paths of duplicated files under INPUTs that match specified criteria.
-
For each
INPUT, walk it recursively (in theDATABASE), for each walkedpath:- get its
group, which is a concatenation of itstype,sha256hash, and all metadata fields for which a corresponding--match-*options are set; e.g., with--match-perms --match-uid, this produces a tuple oftype, sha256, mode, uid; - get its
inode_id, which is a tuple ofdevice_number, inode_numberfor filesystems which reportinode_numbers and a uniqueintotherwise; - record this
inode's metadata andpathas belonging to thisinode_id; - record this
inode_idas belonging to thisgroup.
- get its
-
For each
group, for eachinode_idingroup:- sort
paths as--order-pathssays, - sort
inodess as--order-inodessays.
- sort
-
For each
group, for eachinode_idingroup, for eachpathassociated toinode_id:- print the
path.
- print the
Also, if you are reading the source code, note that the actual implementation of this command is a bit more complex than what is described above.
In reality, there's also a pre-computation step designed to filter out single-element groups very early, before loading of most of file metadata into memory, thus allowing hoardy to process groups incrementally, report its progress more precisely, and fit more potential duplicates into RAM.
In particular, this allows hoardy to work on DATABASEs with hundreds of millions of indexed files on my 2013-era laptop.
With the default verbosity, this command simply prints all paths in resulting sorted order.
With verbosity of 1 (a single --verbose), each path in a group gets prefixed by:
__, if it is the firstpathassociated to aninode, i.e., this means thispathintroduces a previously unseeninode,=>, otherwise, i.e., this means that thispathis a hardlink to the path last marked with__.
With verbosity of 2, each group gets prefixed by a metadata line.
With verbosity of 3, each path gets prefixed by associated inode_id.
With the default spacing of 1 a new line gets printed after each group.
With spacing of 2 (a single --spaced) a new line also gets printed after each inode.
-
positional arguments:
INPUT: input files and/or directories to process
-
options:
-h, --help: show this help message and exit--markdown: show--helpformatted in Markdown--stdin0: read zero-terminatedINPUTs from stdin, these will be processed after allINPUTSs specified as command-line arguments
-
output:
-v, --verbose: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated: print output lines terminated with\n(LF) newline characters; default-z, --zero-terminated, --print0: print output lines terminated with\0(NUL) bytes, implies--no-colorand zero verbosity--spaced: print more empty lines between different parts of the output; can be specified multiples--no-spaced: print less empty lines between different parts of the output; can be specified multiples
-
duplicate file grouping defaults:
--match-meta: set defaults to--match-device --match-permissions --match-owner --match-group--ignore-meta: set defaults to--ignore-device --ignore-permissions --ignore-owner --ignore-group; default--match-extras: set defaults to--match-xattrs--ignore-extras: set defaults to--ignore-xattrs; default--match-times: set defaults to--match-last-modified--ignore-times: set defaults to--ignore-last-modified; default
-
duplicate file grouping; consider same-content files to be duplicates when they...:
--match-size: ... have the same file size; default--ignore-size: ... regardless of file size; only useful for debugging or discovering hash collisions--match-argno: ... were produced by recursion from the same command-line argument (which is checked by comparingINPUTindexes inargv, if the path is produced by several different arguments, the smallest one is taken)--ignore-argno: ... regardless of whichINPUTthey came from; default--match-device: ... come from the same device/mountpoint/drive--ignore-device: ... regardless of devices/mountpoints/drives; default--match-perms, --match-permissions: ... have the same file modes/permissions--ignore-perms, --ignore-permissions: ... regardless of file modes/permissions; default--match-owner, --match-uid: ... have the same owner id--ignore-owner, --ignore-uid: ... regardless of owner id; default--match-group, --match-gid: ... have the same group id--ignore-group, --ignore-gid: ... regardless of group id; default--match-last-modified, --match-mtime: ... have the samemtime--ignore-last-modified, --ignore-mtime: ... regardless ofmtime; default--match-xattrs: ... have the same extended file attributes--ignore-xattrs: ... regardless of extended file attributes; default
-
sharding:
-
--shard FROM/TO/SHARDS|SHARDS|NUM/SHARDS: split database into a number of disjoint pieces (shards) and process a range of them:- with
FROM/TO/SHARDSspecified, split database intoSHARDSshards and then process those with numbers betweenFROMandTO(both including, counting from1); - with
SHARDSsyntax, interpret it as1/SHARDS/SHARDS, thus processing the whole database by splitting it intoSHARDSpieces first; - with
NUM/SHARDS, interpret it asNUM/NUM/SHARDS, thus processing a single shardNUMofSHARDS; - default:
1/1/1,1/1, or just1, which processes the whole database as a single shard;
- with
-
-
--order-*defaults:--order {mtime,argno,abspath,dirname,basename}: set all--order-*option defaults to the given value, except specifying--order mtimewill set the default--order-pathstoargnoinstead (since all of the paths belonging to the sameinodehave the samemtime); default:mtime
-
order of elements in duplicate file groups:
-
--order-paths {argno,abspath,dirname,basename}: in eachinodeinfo record, orderpaths by:argno: the correspondingINPUT's index inargv, if apathis produced by several different arguments, the index of the first of them is used; defaultabspath: absolute file pathdirname: absolute file path without its last componentbasename: the last component of absolute file path
-
--order-inodes {mtime,argno,abspath,dirname,basename}: in each duplicate filegroup, orderinodeinfo records by:argno: same as--order-paths argnomtime: file modification time; defaultabspath: same as--order-paths abspathdirname: same as--order-paths dirnamebasename: same as--order-paths basename
When an
inodehas several associatedpaths, sorting byargno,abspath,dirname, andbasenameis performed by taking the smallest of the respective values.For instance, a duplicate file
groupthat looks like the following when ordered with--order-inodes mtime --order-paths abspath:__ 1/3 => 1/4 __ 2/5 => 2/6 __ 1/2 => 2/1will look like this, when ordered with
--order-inodes basename --order-paths abspath:__ 1/2 => 2/1 __ 1/3 => 1/4 __ 2/5 => 2/6 -
--reverse: when sorting, invert all comparisons
-
-
duplicate file group filters:
--min-paths MIN_PATHS: only process duplicate file groups with at least this manypaths; default:2--min-inodes MIN_INODES: only process duplicate file groups with at least this manyinodes; default:2
Produce groups of duplicated indexed files matching specified criteria, similar to how find-duplicates does, except with much stricter default --match-* settings, and then deduplicate the resulting files by hardlinking them to each other.
-
Proceed exactly as
find-duplicatesdoes in its step 1. -
Proceed exactly as
find-duplicatesdoes in its step 2. -
For each
group:- assign the first
pathof the firstinode_idassource, - print
source, - for each
inode_idingroup, for eachinodeandpathassociated to aninode_id:- check that
inodemetadata matches filesystems metadata ofpath,- if it does not, print an error and skip this
inode_id,
- if it does not, print an error and skip this
- if
source, continue with otherpaths; - if
--paranoidis set or if this the very firstpathofinode_id,- check whether file data/contents of
pathmatches file data/contents ofsource,- if it does not, print an error and skip this
inode_id,
- if it does not, print an error and skip this
- check whether file data/contents of
- if
--hardlinkis set, hardlinksource -> path, - if
--deleteis set,unlinkthepath, - update the
DATABASEaccordingly.
- check that
- assign the first
The verbosity and spacing semantics are similar to the ones used by find-duplicates, except this command starts at verbosity of 1, i.e. as if a single --verbose is specified by default.
Each processed path gets prefixed by:
__, if this is the very firstpathin agroup, i.e. this is asource,- when
--hardlinking:=>, if this is a non-sourcepathassociated to the firstinode, i.e. it's already hardlinked tosourceon disk, thus processing of thispathwas skipped,ln, if thispathwas successfully hardlinked (to an equalsource),
- when
--deleteing:rm, if thispathwas successfully deleted (while an equalsourcewas kept),
fail, if there was an error while processing thispath(which will be reported tostderr).
-
positional arguments:
INPUT: input files and/or directories to process
-
options:
-h, --help: show this help message and exit--markdown: show--helpformatted in Markdown--stdin0: read zero-terminatedINPUTs from stdin, these will be processed after allINPUTSs specified as command-line arguments
-
output:
-v, --verbose: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated: print output lines terminated with\n(LF) newline characters; default-z, --zero-terminated, --print0: print output lines terminated with\0(NUL) bytes, implies--no-colorand zero verbosity--spaced: print more empty lines between different parts of the output; can be specified multiples--no-spaced: print less empty lines between different parts of the output; can be specified multiples
-
duplicate file grouping defaults:
--match-meta: set defaults to--match-device --match-permissions --match-owner --match-group; default--ignore-meta: set defaults to--ignore-device --ignore-permissions --ignore-owner --ignore-group--match-extras: set defaults to--match-xattrs; default--ignore-extras: set defaults to--ignore-xattrs--match-times: set defaults to--match-last-modified--ignore-times: set defaults to--ignore-last-modified; default
-
duplicate file grouping; consider same-content files to be duplicates when they...:
--match-size: ... have the same file size; default--ignore-size: ... regardless of file size; only useful for debugging or discovering hash collisions--match-argno: ... were produced by recursion from the same command-line argument (which is checked by comparingINPUTindexes inargv, if the path is produced by several different arguments, the smallest one is taken)--ignore-argno: ... regardless of whichINPUTthey came from; default--match-device: ... come from the same device/mountpoint/drive; default--ignore-device: ... regardless of devices/mountpoints/drives--match-perms, --match-permissions: ... have the same file modes/permissions; default--ignore-perms, --ignore-permissions: ... regardless of file modes/permissions--match-owner, --match-uid: ... have the same owner id; default--ignore-owner, --ignore-uid: ... regardless of owner id--match-group, --match-gid: ... have the same group id; default--ignore-group, --ignore-gid: ... regardless of group id--match-last-modified, --match-mtime: ... have the samemtime--ignore-last-modified, --ignore-mtime: ... regardless ofmtime; default--match-xattrs: ... have the same extended file attributes; default--ignore-xattrs: ... regardless of extended file attributes
-
sharding:
-
--shard FROM/TO/SHARDS|SHARDS|NUM/SHARDS: split database into a number of disjoint pieces (shards) and process a range of them:- with
FROM/TO/SHARDSspecified, split database intoSHARDSshards and then process those with numbers betweenFROMandTO(both including, counting from1); - with
SHARDSsyntax, interpret it as1/SHARDS/SHARDS, thus processing the whole database by splitting it intoSHARDSpieces first; - with
NUM/SHARDS, interpret it asNUM/NUM/SHARDS, thus processing a single shardNUMofSHARDS; - default:
1/1/1,1/1, or just1, which processes the whole database as a single shard;
- with
-
-
--order-*defaults:--order {mtime,argno,abspath,dirname,basename}: set all--order-*option defaults to the given value, except specifying--order mtimewill set the default--order-pathstoargnoinstead (since all of the paths belonging to the sameinodehave the samemtime); default:mtime
-
order of elements in duplicate file groups; note that unlike with
find-duplicates, these settings influence not only the order they are printed, but also which files get kept and which get replaced with--hardlinks to kept files or--deleted:-
--order-paths {argno,abspath,dirname,basename}: in eachinodeinfo record, orderpaths by:argno: the correspondingINPUT's index inargv, if apathis produced by several different arguments, the index of the first of them is used; defaultabspath: absolute file pathdirname: absolute file path without its last componentbasename: the last component of absolute file path
-
--order-inodes {mtime,argno,abspath,dirname,basename}: in each duplicate filegroup, orderinodeinfo records by:argno: same as--order-paths argnomtime: file modification time; defaultabspath: same as--order-paths abspathdirname: same as--order-paths dirnamebasename: same as--order-paths basename
When an
inodehas several associatedpaths, sorting byargno,abspath,dirname, andbasenameis performed by taking the smallest of the respective values.For instance, a duplicate file
groupthat looks like the following when ordered with--order-inodes mtime --order-paths abspath:__ 1/3 => 1/4 __ 2/5 => 2/6 __ 1/2 => 2/1will look like this, when ordered with
--order-inodes basename --order-paths abspath:__ 1/2 => 2/1 __ 1/3 => 1/4 __ 2/5 => 2/6 -
--reverse: when sorting, invert all comparisons
-
-
duplicate file group filters:
--min-paths MIN_PATHS: only process duplicate file groups with at least this manypaths; default:2--min-inodes MIN_INODES: only process duplicate file groups with at least this manyinodes; default:2when--hardlinkis set,1when --delete` is set
-
deduplicate how:
--hardlink, --link: deduplicate duplicated file groups by replacing all but the very first file in each group with hardlinks to it (hardlinks go from destination file to source file); see the "Algorithm" section above for a longer explanation; default--delete, --unlink: deduplicate duplicated file groups by deleting all but the very first file in each group; see--order*options for how to influence which file would be the first--sync: batch changes, apply them right before commit,fsyncall affected directories, and only then commit changes to theDATABASE; this way, after a power loss, the nextdeduplicatewill at least notice those files being different from their records; default--no-sync: perform all changes eagerly withoutfsyncing anything, commit changes to theDATABASEasynchronously; not recommended unless your machine is powered by a battery/UPS; otherwise, after a power loss, theDATABASEwill likely be missing records about files that still exists, i.e. you will need to re-indexallINPUTSto make the database state consistent with the filesystems again
-
before
--hardlinking or--deleteing a target, check that source and target...:--careful: ... inodes have equal data contents, once for each new inode; i.e.check that source and target have the same data contents as efficiently as possible; assumes that no files change whilehoardyis running--paranoid: ... paths have equal data contents, for each pair of them; this can be slow --- though it is usually not --- but it guarantees thathoardywon't loose data even if other internal functions are buggy; it will also usually, though not always, prevent data loss if files change whilehoardyis running, see "Quirks and Bugs" section of theREADME.mdfor discussion; default
Verfy that indexed files from under INPUTs that match specified criteria exist on the filesystem and their metadata and hashes match filesystem contents.
- For each
INPUT, walk it recursively (in the filesystem), for each walkedpath:- fetch its
DATABASErecord, - if
--checksumis set or if filetype,size, ormtimeis different from the one in theDATABASErecord,- re-index the file,
- for each field:
- if its value matches the one in
DATABASErecord, do nothing; - otherwise, if
--match-<field>option is set, print an error; - otherwise, print a warning.
- if its value matches the one in
- fetch its
This command runs with an implicit --match-sha256 option which can not be disabled, so hash mismatches always produce errors.
-
positional arguments:
INPUT: input files and/or directories to process
-
options:
-h, --help: show this help message and exit--markdown: show--helpformatted in Markdown--stdin0: read zero-terminatedINPUTs from stdin, these will be processed after allINPUTSs specified as command-line arguments
-
output:
-v, --verbose: increase output verbosity; can be specified multiple times for progressively more verbose output-q, --quiet, --no-verbose: decrease output verbosity; can be specified multiple times for progressively less verbose output-l, --lf-terminated: print output lines terminated with\n(LF) newline characters; default-z, --zero-terminated, --print0: print output lines terminated with\0(NUL) bytes, implies--no-colorand zero verbosity
-
content verification:
--checksum: verify all file hashes; i.e., assume that some files could have changed contents without changingtype,size, ormtime; default--no-checksum: skip hashing if filetype,size, andmtimematchDATABASErecord
-
verification defaults:
--match-meta: set defaults to--match-permissions; default--ignore-meta: set defaults to--ignore-permissions--match-extras: set defaults to--match-xattrs; default--ignore-extras: set defaults to--ignore-xattrs--match-times: set defaults to--match-last-modified--ignore-times: set defaults to--ignore-last-modified; default
-
verification; consider a file to be
okwhen it and itsDATABASErecord...:--match-size: ... have the same file size; default--ignore-size: ... regardless of file size; only useful for debugging or discovering hash collisions--match-perms, --match-permissions: ... have the same file modes/permissions; default--ignore-perms, --ignore-permissions: ... regardless of file modes/permissions--match-last-modified, --match-mtime: ... have the samemtime--ignore-last-modified, --ignore-mtime: ... regardless ofmtime; default
Backup the DATABASE and then upgrade it to latest format.
This exists for development purposes.
You don't need to call this explicitly as, normally, database upgrades are completely automatic.
- options:
-h, --help: show this help message and exit--markdown: show--helpformatted in Markdown
-
Index all files in
/backup:hoardy index /backup -
Search paths of files present in
/backup:hoardy find /backup | grep something -
List all duplicated files in
/backup, i.e. list all files in/backupthat have multiple on-disk copies with same contents but using different inodes:hoardy find-dupes /backup | tee dupes.txt -
Same as above, but also include groups consisting solely of hardlinks to the same inode:
hoardy find-dupes --min-inodes 1 /backup | tee dupes.txt -
Produce exactly the same duplicate file groups as those the following
deduplicatewould use by default:hoardy find-dupes --match-meta /backup | tee dupes.txt -
Deduplicate
/backupby replacing files that have exactly the same metadata and contents (but with anymtime) with hardlinks to a file with the earliest knownmtimein each such group:hoardy deduplicate /backup -
Deduplicate
/backupby replacing same-content files larger than 1 KiB with hardlinks to a file with the latestmtimein each such group:hoardy deduplicate --size-geq 1024 --reverse --ignore-meta /backupThis plays well with directories produced by
rsync --link-destandrsnapshot. -
Similarly, but for each duplicate file group use a file with the largest absolute path (in lexicographic order) as the source for all generated hardlinks:
hoardy deduplicate --size-geq 1024 --ignore-meta --reverse --order-inodes abspath /backup -
When you have enough indexed files that a run of
find-duplicatesordeduplicatestops fitting into RAM, you can process your database piecemeal by sharding bySHA256hash digests:# shard the database into 4 pieces and then process each piece separately hoardy find-dupes --shard 4 /backup hoardy deduplicate --shard 4 /backup # assuming the previous command was interrupted in the middle, continue from shard 2 of 4 hoardy deduplicate --shard 2/4/4 /backup # shard the database into 4 pieces, but only process the first one of them hoardy deduplicate --shard 1/4 /backup # uncertain amounts of time later... # (possibly, after a reboot) # process piece 2 hoardy deduplicate --shard 2/4 /backup # then piece 3 hoardy deduplicate --shard 3/4 /backup # or, equivalently, process pieces 2 and 3 one after the other hoardy deduplicate --shard 2/3/4 /backup # uncertain amounts of time later... # process piece 4 hoardy deduplicate --shard 4/4 /backupWith
--shard SHARDSset,hoardytakes about1/SHARDSamount of RAM, but produces exactly the same result as if you had enough RAM to run it with the default--shard 1, except it prints/deduplicates duplicate file groups in pseudo-randomly different order and trades RAM usage for longer total run time. -
Alternatively, you can shard the database manually with filters:
# deduplicate files larger than 100 MiB hoardy deduplicate --size-geq 104857600 /backup # deduplicate files between 1 and 100 MiB hoardy deduplicate --size-geq 1048576 --size-leq 104857600 /backup # deduplicate files between 16 bytes and 1 MiB hoardy deduplicate --size-geq 16 --size-leq 1048576 /backup # deduplicate about half of the files hoardy deduplicate --sha256-leq 7f /backup # deduplicate the other half hoardy deduplicate --sha256-geq 80 /backupThe
--shardoption does something very similar to the latter example.
Sanity check and test hoardy command-line interface.
-
Run internal tests:
./test-hoardy.sh default -
Run fixed-output tests on a given directory:
./test-hoardy.sh ~/rarely-changing-pathThis will copy the whole contents of that path to
/tmpfirst.