How to Back Up Millions of Small Files on Linux
Backing up millions of small files can be a significant challenge, especially on Linux. Popular backup tools like rsync or restic, while effective for other use cases, often hit their limits in this specific situation. In this article, we will analyze why these tools may not always be suitable for backing up small files and explore existing solutions to address these challenges.
Why are small files problematic?
Small files stress backup systems differently compared to large files. Since each file is processed individually, the system must perform repeated operations—opening, reading, writing, and closing. This creates processing overhead and consumes significant disk and CPU resources.
Here are some limitations of traditional tools:
Rsync: Single-Threaded by Default
While powerful, rsync is historically single-threaded. This means it processes files one by one, which becomes extremely slow when the number of files is large.
Restic: Block-Based Approach
Restic divides files into small blocks and stores them in a deduplicated manner. While this method is efficient in reducing backup size, it creates a significant issue during restoration: each block requires an HTTP or disk request, significantly slowing down the process.
Solutions to optimize backing up small files
Fortunately, there are strategies and tools available to enhance backup performance for this use case.
1. Use Rsync in Multi-Threaded Mode
One solution is to run multiple rsync instances in parallel to process different file segments simultaneously. Several third-party tools and scripts can partition files into batches for multi-threaded processing. For example:
- GNU Parallel: To parallelize rsync commands.
- Custom scripts to split files based on specific rules.
Example command for parallel rsync with GNU Parallel:
find /source_directory -type f | parallel -j 4 rsync -av {} /destination_directory
While this improves performance, it also increases setup complexity.
2. Choose a Backup System Designed for Small Files
Some backup tools are specifically built to handle a large number of small files. These include:
- BorgBackup: This tool intelligently deduplicates data while grouping multiple files into segments, reducing the number of disk operations.
- Zbackup: Based on deduplication and compression, it performs well for this use case.
3. Group Small Files into Archives
Another strategy is to consolidate small files into compressed archives (e.g., using tar or zip) before backing them up. This reduces the number of operations and improves performance. However, this approach complicates individual file access.
Example command to create an archive:
tar -czf archive.tar.gz /source_directory
Our approach at Datashelter
At Datashelter, we faced the challenge of backing up a dataset containing over 100 million files—a task that existing solutions struggled to handle efficiently. This motivated us to develop snaper, our own backup tool. Snaper offers advanced features such as file-level deduplication and multithreading, optimizing both backup and restoration performance.
File-Level Deduplication
Snaper identifies and eliminates duplicates by storing only one copy of identical files, reducing the required storage space. This approach is particularly beneficial when many small files have similarities or repetitions.
Multithreading
To speed up the backup process, Snaper uses multithreading, allowing simultaneous processing of multiple files. This parallelization greatly enhances backup and restoration speed, especially for large datasets of small files.
Temporary File Management
During data compression or encryption, Snaper creates temporary files. By default, files smaller than 10 MB are processed in RAM, while larger files utilize the system’s temporary partition.
Example Command with Snaper
To create a backup with Snaper, use the following command:
snaper backup files --path /path/to/folder_to_backup
This command initiates the backup of the specified directory, applying deduplication, compression, and multithreading to optimize the process.
With Snaper, Datashelter provides a robust and efficient solution for backing up large sets of small files.
Conclusion
Backing up millions of small files is no simple task, but there are strategies to overcome the limitations of traditional tools. By choosing the right approach—such as multithreading, archiving, or using specialized tools—you can optimize performance and simplify your backups.
If you have specific use cases or questions, feel free to reach out to us!