How Structured Data Enhances Backup Reliability
If you're just starting to explore data backups, you may not have heard the term "structured data" yet. However, this term plays a critical role in defining your backup plan.
Let’s understand together how to differentiate between structured and unstructured data, and what their implications are for choosing your backup solution.
What does structured data mean?
Structured data is organized in a file with a specific structure that is understandable by software. This allows it to be easily interpreted.
For example, think of a spreadsheet file like .csv or .xls (Excel). The data in this file is arranged in a proper format that Excel can easily recognize.
Similarly, in the case of your SQL database, structured data corresponds to the .sql file that you get when you export your database.
What does unstructured data mean?
In contrast, unstructured data is not in a format that can be interpreted by software. If you write a list of words in a text file, you won’t be able to open this file directly in Excel because it’s not structured in the way Excel expects.
Generally speaking, we can see that there’s a close link between file extensions (.csv, .docx, .txt, .sql) and structured data. If your file has no extension, no file structure is implied. This is called unstructured data.
So, what’s the link with data backups? Why should I focus on backing up structured data?
Risks of Backing Up Unstructured Data
To understand why backing up unstructured data can be problematic, we need to talk about data restorability.
Ensuring the Restorability of a Backup
The guarantee of being able to restore data is generally provided by the software that produces it. This is the case, for instance, when you save an Excel file or ask your SQL client to export a database.
The software structures the data in a format it can handle and exports it while managing any concurrency issues. This ensures that the exported data is in a valid state and not in an intermediate state.
What’s an intermediate state?
Let’s imagine an SQL file. An intermediate state might mean that the last query is only half-written into the SQL file (SELECT * FROM).
This incomplete query doesn’t conform to SQL standards, which require that a query start with a keyword (SELECT, UPDATE, DELETE, etc.) and end with a semicolon.
Thus, the SQL file does not have the correct structure, and some correction is needed to be read by your database—such as deleting the incomplete query. In this case, the file is considered corrupted.
What does this imply?
This implies that you should never back up the /var/lib/mysql directory (the storage directory for MySQL) thinking you're backing up your database. Nothing guarantees that no other write operations were happening at the time of your backup.
Another operation could have been writing into a file within the /var/lib/mysql directory. Thus, when you copy the directory to your backup storage, you may be copying a file where a write operation is only half-completed, leading to an incorrect file structure.
In this way, the backed-up file is inconsistent and won’t be readable by your database when you try to restore it. You’ll end up with a backup you can’t restore. It’s useless.
That’s why you should always prioritize backing up structured data.
How to Back Up Structured Data?
To back up structured data, you must use the export or backup functions of the applications you wish to back up.
In the case of your database, this involves using tools like mysqldump (MySQL), pg_dump (PostgreSQL), and mongodump (MongoDB).
These tools control the export of data to ensure (in most cases) that the data can be restored, ensuring proper formatting and consistency.
Taking Snapshots of Your Virtual Machines is Not Enough
Following the previous reasoning, you’ll understand that taking snapshots of your virtual machines isn’t sufficient. When the applications hosted on your virtual machines are more complex than a simple file repository, you need backups in a structured format.
That’s why I’m at war ⚔️ with people who claim they have reliable backups just because they’ve activated the automatic backup option on their virtual (or dedicated) server.
Beyond the fact that they have no control over these backups—the cloud provider remains in charge—their data is at risk because this is an unstructured data backup.
Snapshots are a first layer of protection, but they don’t replace the need for structured data backups. They offer no guarantees of data integrity.
To use a metaphor, it’s like wearing armor without knowing if it’s made of foam or steel. You won’t have time to realize it’s foam until it’s too late.
Conclusion
I hope this article has helped you understand the importance of backing up structured data, and why taking snapshots of your virtual machines is only a first step in securing your data.
You’ve also understood why, at Datashelter, we make it a point to perform structured backups whenever possible.
Our CLI snaper integrates with your database through the database client, providing reliable backups with strong guarantees of restorability.
A backup that can’t be restored isn’t a backup.