Guides

How to Back Up Millions of Small Files on Linux

Malo Paletou
· 3 min read
Send by email

Introduction

Backing up millions of small files on Linux is a common yet complex challenge. Traditional methods quickly reach their limits when faced with high I/O load and large-scale data volumes. Fortunately, modern backup tools are built to meet this need.

In this article, we’ll explain why small files are so difficult to back up, the limitations of traditional tools, and most importantly, why Restic and Snaper are the most advanced solutions available today for backing up millions of files on a daily basis.

Why small files are hard to back up

Unlike large files, small files put a much higher load on the filesystem. Each file requires multiple operations — open, read, write, close — which adds up quickly:

  • High CPU usage
  • Heavy disk I/O pressure
  • Significant slowdowns in both backup and restore operations

Limitations of traditional backup tools

rsync: reliable, but outdated for this use case

  • Single-threaded by default
  • No native deduplication
  • Poor performance at scale (millions of files)

duplicity: functional, but not designed for scale

  • Supports volume-based deduplication using encrypted chunks
  • Designed for traditional backup workflows (full + incremental)
  • Slower restore process on large volumes
  • Lacks fine-grained block-level deduplication
  • Limited parallelism

borgbackup: solid, but slower

  • Uses block-based deduplication
  • Performance-critical parts in C/Cython, but core logic in Python
  • Reliable and feature-rich, but not as fast or scalable as Go-based tools like Snaper or Restic

The modern approach: block-based deduplication

A key innovation for efficient small-file backup is block-based deduplication. Instead of storing entire files, data is broken into blocks, enabling:

  • Fine-grained deduplication (even across different files)
  • Highly efficient incremental backups
  • Drastic reduction in storage space

Why block-based backup is a real advantage

Block-level processing fundamentally changes how backups and restores work at scale:

  • Fewer network calls: Rather than uploading millions of individual files, only new or changed blocks are transferred — significantly reducing S3 or HTTP requests.
  • Faster incremental backups: Even if a file changes completely, only the modified blocks are saved.
  • Optimized storage: Duplicate blocks are stored only once, even if they appear in different files.
  • Efficient restoration: Modern tools like Snaper and Restic reassemble files from blocks quickly, minimizing disk and network I/O during restore.
In environments with hundreds of millions of small files, this architecture isn’t just efficient — it’s essential.

The most advanced solutions: Restic & Snaper

restic

Restic is a modern open-source backup tool, designed to be:

  • Secure (native encryption)
  • Fast
  • Multithreaded
  • Compatible with many cloud backends (S3, Azure, Backblaze, etc.)
  • Based on block-level deduplication

It’s well-suited for cloud-native infrastructures and has a strong community and thorough documentation.

snaper (by Datashelter)

Snaper, our backup engine, powers all the features of our SaaS solution, Datashelter — a platform designed for distributed, high-performance, and fully automated backups.

Built for demanding environments, Snaper shares Restic’s block-based approach and adds several key advantages:

  • ✅ Block-level deduplication
  • ✅ Multithreaded backup and restore
  • ✅ Smart compression
  • ✅ In-memory or disk-based temporary file handling
  • Native support for database dumps
  • ✅ Seamless integration with the Datashelter SaaS platform
With Snaper + Datashelter, you can manage all your backups — files, databases, servers — through a centralized and automated web interface.

Backup tool comparison

Tool Deduplication Multithreading Database support Block-based SaaS interface Primary language Open source
Snaper ✅ Block-based ✅ Yes ✅ Native ✅ Yes ✅ Yes (Datashelter) Go ❌ No
Restic ✅ Block-based ✅ Yes ⚠️ Via scripts ✅ Yes ❌ No Go ✅ Yes
BorgBackup ✅ Block-based ⚠️ Partial ❌ No ✅ Yes ❌ No Python + C/Cython ✅ Yes
Duplicity ✅ Encrypted volumes ❌ No ❌ No ❌ Partial ❌ No Python ✅ Yes
Rsync ❌ None ❌ (single-threaded) ❌ No ❌ No ❌ No C ✅ Yes

Conclusion

Backing up millions of small files every day requires a purpose-built solution. Traditional tools quickly fall short under the weight of file system load and operational complexity.

Restic and Snaper represent the most modern and scalable backup tools available today, thanks to their block-level deduplication, incremental efficiency, and robust architecture.

👉 For modern infrastructure, Snaper via the Datashelter platform gives you centralized, automated, and secure backups — covering everything from files to full database dumps — all in a seamless SaaS environment.