Deduplication
Deduplication uses the clone mechanism of a copy-on-write or CoW capable filesystem, a feature that allows to share data of copied but identical files, much like a hardlink until one of the copies actually is written to and thereby changed, i.e. a delayed copy operation and hence the name copy-on-write. If implemented on a block level, only modified blocks are actually stored in the file system, saving space by sharing identical blocks of multiple files.
Copy-on-write (CoW) can be implemented in-band or out-of-band.[1] The later is called deduplication and requires a user application that compares files or blocks and sets the CoW status for identical blocks in the filesystem.
Filesystems
On Linux, only a hand-full of filesystems implement CoW, namely bcachefs, Btrfs, OCFS2 and XFS. The clone ioctl kernel functions were previously private to Btrfs, where CoW deputed on Linux, and moved to the Virtual File System (VFS) layer starting with Linux kernel 4.5 so that any CoW-supporting file system can make use of them.[2] The first additional filesystem to implement CoW was XFS.[3]
Some filesystem tools themselves support deduplication, like the Btrfs subvolumes feature. There are also in-band filesystem options, such as XFS' always_cow
sysfs switch.[4]
Applications with deduplication support
There are user applications that allow to compare existing files and to deduplicate them, which essentially frees disk space. The most common tools are:
- sys-fs/duperemove
- app-misc/fdupes
- app-misc/jdupes
- sys-fs/bees (which works on block-level, but is limited to Btrfs)
Various tools also support CoW themselves when copying files by using the appropriate Linux syscalls if available:
- C++ code using
filesystem::copy_file
using GCC 14 or newer's libstdc++ (commit) - kde-frameworks/kio
- dev-libs/glib as of 2.78, which will benefit e.g. file copies in Nautilus/Files
- dev-lang/ruby - Ruby's
IO.copy_stream
does and e.g.FileUtils::copy_file
benefits from it as a result - dev-lang/go - See https://cs.opensource.google/go/go/+/refs/tags/go1.20.5:src/os/readfrom_linux.go.
- dev-lang/php - PHP 8.2 supports it for streams
- app-editors/emacs -
copy-file
uses copy_file_range
Applications missing support:
- dev-lang/python
- dev-java/openjdk
- https://bugs.openjdk.org/browse/JDK-8282039 (WONTFIX'd, seemingly based on a misunderstanding of when the benefit appears - needs a CoW filesystem or using e.g. NFS)
- net-misc/rsync
- dev-lang/perl
GNU coreutils
GNU coreutils 9.0 and newer default to --reflink=auto for cp and install.
The most basic way to deduplicate a file is to clone it with cp --reflink. cp is part of the sys-apps/coreutils package. At first the result is almost identical to hardlink, in that both files use the same blocks of data on the storage device, with the major difference that, if one file gets changed on hardlinks, every linked file is changed as well. On clones (deduplicated files) however the other files that use data from the same blocks are preserved and only the changed file, or blocks of that file, are written to the storage, hence the name copy-on-write.
user $
cp --reflink=always sourcefile destfile
Unlike hardlinks, changing either destfile
or sourcefile
will preserve the other. Copy-on-write essentially keeps the files separate while (at least initially) benefiting from the same space advantage as hardlinks do. It is however unclear if the whole file is rewritten in case of a change, or only the changed block (chunk) of an initially deduplicated file, and it heavily depends on how an application implements writing files to disk.
If the filesystem doesn't support copy-on-write (CoW), cp will abort with an error massage. With the --reflink=auto
parameter cp will automatically make a regular copy instead when CoW is not available.
user $
cp --reflink sourcefile destfile
cp: failed to clone 'destfile' from 'sourcefile': Operation not supported
user $
cp --reflink=auto --verbose sourcefile destfile
'sourcefile' -> 'destfile'
Portage
Portage uses copy_file_range
or sendfile
if available when merging packages from PORTAGE_TMPDIR to the live filesystem. This support is implemented as a C extension with native-extensions, which is enabled by default for Portage. [5]
Portage 3.0.48 and newer will also avoid overwriting files on the live filesystem if they're identical, as implemented for bug #722270.
Benefits
The obvious benefit of deduplication and copy-on-write is to regain valuable storage space. It might be argued that in-band copy-on-write may also be beneficial for reducing wear on SSD storage by reducing writes to the device, similar to Portage TMPDIR on tmpfs. However, a wear reducing factor is uncertain when a write operation has already occurred, which is always the case when using out-of-band deduplication tools.
Practical use scenarios
Portage hooks
Deduplication can be hooked into pkg_postinst
for specified packages using the standard portage facilities. For example, to deduplicate the Linux kernels from package sys-kernel/gentoo-sources after emerging each new version, a portage environment can be added under /etc/portage/package.env. This will save space for unchanged files of each installed kernel source version under /etc/src/.
The following example uses duperemove:
function post_pkg_postinst() {
echo ":: Running duperemove in /usr/src/"
duperemove -r -d -h -q /usr/src/
}
Genkernel hooks
Additionally, after running genkernel from sys-kernel/genkernel, deduplication can be configured in /etc/genkernel.conf:
CMD_CALLBACK="duperemove -r -d -h -q /usr/src/"
See also
- Duperemove — a btrfs and XFS tool for finding duplicated extents and submitting them to the kernel for deduplication
- fdupes — a tool for identifying duplicate files across a set of directories.
External resources
- Deduplication, Btrfs Wiki
- Which file systems support file cloning, Ctrl.blog by Daniel Aleksandersen
- XFS, Reflinks and Deduplication
References
- ↑ Read the Docs: Deduplication
- ↑ IOCTL_FICLONERANGE(2) from the Linux Programmer's Manual
- ↑ xfs: add reflink and dedupe support on LWN.net, 29 Sep 2016
- ↑ XFS Copy-On-Write Support Being Improved, Always CoW Option, phoronix, 19 Feb 2019
- ↑ portage_util_file_copy_reflink_linux.c, Portage source code (3.0.49), 10 Jul 2023