backup script 500x360

MentorMate’s Backup Script


The script starts by creating a copy of all of the important information in a staging area.  Keeping the staging area intact will help speed up the backup after it has been run once since rsync is used for most of the copying.  After the staging area is updated, we use rdiff-backup as our incremental backup system.  The rdiff-backup repository can then be copied to other local and offsite locations to increase redundancy in the backup.  The output is sent by email every time the backup runs, so we are automatically updated on its status, and can correct any errors that may have taken place.

Relevant Tools

Following is a description of the tools that are used to back up our data, as well as some of the tools that are holding out data.  Different sets of data must be backed up in different was in order to maximize efficiency and reliability.


We use a mixture of standard desktop hardware, server hardware, and virtual dedicated servers for our infrastructure.  Services requiring reliability are all run on the virtual dedicated servers providing uptime guarantees.  All of our backups are gathered and stored on two local computers, each with a raid 1 array, and copied to an offsite computer.


Bash is an interactive shell and a scripting language that is installed by default on Linux, and can be configured on many other operating system.  We use it for the backup system because it is easy to call other programs that are already designed to do a lot of the work that we need to do for the backup.  This gets us a full backup in less than 100 lines of code.  The small amount of code makes it easy to proof read and test for errors.


Rsync is a very common program for copying data from one location to another when bandwidth can be limited, or when large amounts of data must be copied.  It first checks if the file exists in the destination, and, if it does exist, makes sure that it is that same file.  If the file already exists, and is up to date, the file is not copied.  For our backup, this saves a lot of time since there is generally only a small amount of files that change.


After the staging area has been updated, we use rdiff-backup to keep track of 10 days of history.  rdiff-backup copies files from the source to an rdiff-backup repository.  Instead of deleting files that have been removed and overwriting files that have changed, it stores a backup of any changes.  It is possible to go into the repository later and restore any version that has not been removed.


OpenVZ allows us to quickly and easily set up new virtualized environments, known as containers.  OpenVZ containers are not fully virtualized, so they do not require as many resources as traditional virtual machines.  Our OpenVZ containers are used to quickly and easily set up testing environments, and to provide some of our services within our office that do not have the uptime requirements of the services running on the offsite virtual dedicated servers.

Logical Volume Management

Logical Volume Management (LVM) is an abstraction layer between the hard drive and the partitions and filesystems on that hard drive.  One or more hard drives or partitions, known as physical volumes in LVM language, can be combined into a volume group.  The volume group can then be divided into multiple logical volumes, which behave like partitions and can be formatted like a normal partition would be.  There are many features in LVM that provide an advantage over normal partition management including resizing partitions and volume groups while the filesystems are mounted, moving the logical volumes from one physical drive to another, also while the filesystems are mounted, and many other management tasks.  Our backup script takes advantage of the ability to create a snapshot of a filesystem.  A snapshot creates another logical volume that is an image of the first logical volume frozen in time.  Any changes that are made to the files on the original logical volume will be stored in the extra space at the end of the volume group that is dedicated to the snapshot when the snapshot is created.  Business can continue as normal as we create a backup from a consistent set of files where we no longer need to worry about someone writing to the files while we are backing them up.


MySQL is a database that comes with it’s own set of backup tools to ensure that you get a consistent copy of the database.  Mysqldump locks the database when necessary to make sure that it does not make a copy of the database when some other program is in the middle of making a change to the database.


Subversion is a version control system.  We use it for all of our development projects to ensure that each team of developers is working with the most recent version of the code.  Because all of our projects use subversion, we have built up a large collection of code that would take a long time to fully back up each day.  We use subversion tools along with some bash scripting to compare the most recent version of the code in version control with the version that we last backed up, and only back up the changes.

Grep, Sed, and Others

There are many other tools used throughout the script.  The two biggest examples are grep, a regular expression program, and sed, a stream editor which takes some input and changes it as specified.  If you want to know more about any of these tools, you can look them up in the Linux manual pages, or you can find more information on them using your favorite search engine.


Following are the variables that are used in this script.  Some of the values have been changed for security reasons.

OpenVZ Containers

For each server that is running OpenVZ containers, we want to make a full backup of each of the containers.

  1. We use LVM to create a snapshot of the filesystem.  This gives us a version of the files that we can copy that is consistent with itself at a given point in time.
  2. In the case that MySQL is running in the container, we want to make sure that we do not get a corrupt copy of the database just in case it was in the middle of a commit when the LVM snapshot was taken.  For this, we will use the standard MySQL backup tool mysqldump.  This will also allow us to easily examine an old version of the database without fully restoring the entire container.
  3. Finally, we copy the configuration for the container, and all of the files within the container.

This will give us enough to restore any of our containers to the point at which the backup was taken.

Version Control (SVN)

We use subversion for version control.   It’s very easy to dump a full backup of a subversion repository, but when the repositories get large, this can start to take a lot of time and bandwidth.  It is better to only back up the new revisions.

  1. Get a list of all subversion repositories
  2. The head is the version that was most recently committed to the repository.  This can be obtained by checking the svn info.
  3. The head that we backed up the last time that we made a backup was saved to a file.  This will give us a range for the last backup that we made to the current head.
  4. An incremental backup is taken of only the new revisions, and the current head is written to a file.  The filename will tell us which revisions were backed up.

When restoring this, the files can be combined and loaded into a new repository, or they can be loaded into a new repository one after the other.  If you decide that there are too many files for one repository, you can simply delete all of the files for that repository, including the last revision file, from the staging area, and a full backup will be taken the next time the backup script runs.

Backup Router Configuration

We use pfSense for our router software.  This allows us great flexibility at a very low cost.  pfSense has a web page with a button for downloading the configuration file.  Wget can be set to send the appropriate information to request the configuration file.  In the event that the router hardware fails, we can quickly install pfSense on new hardware, and load the configuration.

Crontab from the Backup Server

If our main backup server goes down, we want to make sure that we know everything that it was doing.  All of our important backups jobs run as a single user, so we only need to get one crontab here.

Staging to Incremental Backup

Now that we have everything in the staging area, we want to copy it to a more long term location.  We will do this using rdiff-backup, which will store multiple revisions of our files without making a full copy of each one.  We then need to remove old revisions of the files so that the repository does not get too large.  In this case, we keep 10 versions of the files.

Offsite Backup

The final step is to make sure that we have multiple copies of our big backup.  The rdiff-backup repository is copied to a second onsite server, for easy restore in the event of hardware failure on the backup machine, and an offsite server.


By gathering all of our information into one place, then taking an incremental backup and sending that to remote servers, we have easily created a backup script that stores incremental versions of all of our important data in one place.


Secure Data Backup and Recovery Best Practices

Reasons to backup:

  • Hardware Failure, hard drives and other computer components fail regularly.
  • Destruction of your site, fire or flooding.
  • Accidental data loss, somebody unintentionally deletes one or more files.
  • Software error, a software error could corrupt one or more of your files.

The difference between backup and redundancy

Most modern servers use RAID for redundancy.  This usually means that when you put something on the server, it gets written to two hard drives.  In the event that one hard drive goes down, you can replace it with no disruption to service.  This is an example of redundancy, and will help protect you against a very specific type of failure.  Some systems, like Google, will have multiple servers, allowing an entire server to go down without disruption to service.  It is important to have redundancy so that services to your company or to your clients are not disrupted, but it is not a replacement for backups.

Backups involve creating a second copy of your data.  Often times, multiple backups from different points in time will be maintained.  Restoring a backup takes time, and can result in disruption of service.  Backups help protect from user errors like accidental deletion and software errors that could corrupt data.  Having more than one copy of your files is very important for recovery from these errors.

What to backup


If you host your own email, this is probably one of the most important things to backup.  If your email is hosted, see if you can backup your email, or see what your provider is doing to backup your email.

Document Management System

If you have a document management system or some other type of server that you store your files on, this is also a very important group of files to back up.


If you have a lot of people, it can be very difficult to back up each desktop that you have.  A better solution is to make sure that all important files are stored on a server so you only need to back up a single server.  You should have a plan in place for what happens when a desktop goes down in order to reduce the impact on the productivity of the person who was using it.

Web Application Data

If your website collects data, or if you have any web applications that collect data, make sure that you back up the databases and other information stores.  SQL databases have backup programs that can be run.  You can set your operating system to run the backup regularly.  Also, if there is space where users can upload files, make sure to back up the files as well.

Custom Software

Custom software can be expensive to develop.  Make sure that you have a backup of any custom software that you have had built so that you can easily deploy it again if you need to.

Information in “The Cloud”

You may have documents or other information that is hosted on the internet.  If at all possible, you should take your own backups.  If you cannot take your own backups, how reliable is the service provider?  Think carefully about each of the services that you are subscribed to.  These could be Salesforce, Google Docs, email providers, Online stores (do you need your amazon.

Any other products that are generated

Each company produces different types of products.  Make sure that you go through the list of products that you have and ask yourself what would happen and how you would recover if you lost some data.  Make sure to think about digital components that are created to support physical products.


The number of backup programs that exist and techniques that are available are too numerous to count, so I will only cover a few concepts.

Important Concepts

Local and Offsite

It is a good idea to have a copy of your data locally and a copy of your data offsite.  The offsite backup is important in the event that your office is destroyed.  The location of the offsite backup could also be destroyed, so having an onsite backup will help protect you against that.  Another reason to keep an onsite backup is that if you need to restore an item, the backup is immediately available.

Automated Vs. Manual

Human memory is flawed, people take vacations and sick time, and staffing changes.  If your backup is automated, these events will have a smaller impact on your backup system.  An automated backup system can grab all of your data every night, and push it to an offsite location over the internet or a T1 pipe.  There may be parts of the backup that must be done manually, like changing media, or a few parts of the backup that are difficult to automate, but the more automatic it is, the better.  Still, make sure to check it periodically to make sure that it is still taking the backup and you are not getting errors.

Backup Media

Depending on how much you need to back up, you may want different types of media.  Tape drives are popular for large quantities of data.  CD or DVD media is a cheap way to back up small amounts of data, but make sure that you get high quality media, and check it from time to time to make sure that it is not deteriorating.  Hard drives are very convenient since they are rewritable.  Keep in mind that hard drives fail.  Using RAID for redundancy is a good way to help protect yourself from hard drive failure.

Ensure it works

After you have created your backup system and are making sure that it runs, make sure to test it by restoring data from the backup.  If you cannot restore a backup, it is not really a backup.


If you are backing up confidential or sensitive data, make sure that the backup is at least as secure as the data that you are backing up.  A compromised backup is just as bad as the data being compromised since you have a copy of everything.  If the backup is encrypted, make sure that enough people have the key, and that the key is backed up somewhere.  If you encrypt the data and lose the key, you lose the backup.


Make sure that you always know what backup is current, what backup is old, and when you’ve taken the backups.  Your software may take care of this for you.  If it does not, putting dates on the folders, or a file with the information within the folder can be good ways to keep organized.  You also need to know where it is going and how to restore it.  Make sure that this is documented in case the person who built the backup is not available when a restore is needed.  You can also streamline the backup process by making sure that the files that need backing up are organized in as few places as possible.

Human Redundancy

Make sure that multiple people know where the backup is and how to restore it.  It is a good idea to have some documentation on the backup system, how the backups are created, any manual steps and when they need to be performed, and how to restore each component of the backup, either fully or partially.


Virtual Machines

Virtual machines typically have a way to make a copy of the entire machine.  This can be taken advantage of to backup an entire system from all of the software that is in stalled to the data files that are on the machine.

rsync and rdiff-backup

rsync is used to copy files from one place to another, optionally over the network or internet.  All data transfers are encrypted.  When files are copied, only the files that have changed are copied, reducing the amount of bandwidth required for most cases.  rdiff-backup is similar, but it will keep incremental backups making it so that you can revert to a previous version of your backup.  To save space, files are only saved twice if they have changed.


dd can make a copy of a hard drive.  This can be used to make an image of your operating system hard drive from time to time.  If you ever need to restore, you can reload the image to the drive, and all of your software will be ready to go.


There is enough backup software out there that I could not hope to cover all of it.  Look at your needs and find the group of software that fills your needs.


Make sure you think carefully about why you are backing up your data.  Your reasons for backing up your data will have an impact on how you back it up.  What would happen to you or your business if you lost some or all of your files?  Who is going to be able to restore your files?