X

Rsync – A Top-View Introduction

Even with today’s ultra fast fibre optic data lines, the tangible throughput people actually get when transferring files is still functionally limited. This is compounded further by the fact that the majority of uses have low upload speeds compared to higher download speeds. For example a typical home internet connection would promise 4Mbit download while only 256Kbit would be allocated for upload. To make matters worse, today files are many times larger than their ancestors. If your hair has greyed (or fallen off) in sufficient quantities, you probably recall a time when a 1.2MB floppy disk would hold all your documents as well as the word processing program itself with space to spare. This thousand word document is over 6 times larger than that floppy disk! In this environment, when one discusses online backups, the critical issue is to try to limit as much as possible the amount of data that has to be uploaded from the person’s computer (the source) to that person’s archive vault (the destination). Many people would consider online backups only if the lengthy wait periods are few and far between. On a regularly backed up system, the state of the source compared to the destination can be as follows: * New file on source that does not exist on the destination – the file must be copied over to the archive vault; * File on source that no longer exists on the destination – the file must be removed from the archive vault; * File on source and destination are identical – no need to do anything; * File on source is different from that on destination. On a system that is regularly backed up, one would normally find few files that are new or that have been deleted. The majority of files would either have not changed or would have been altered. As I will demonstrate shortly, the Rsync algorithm for transferring data from source to destination is greatly suited for situations in which a file has been adjusted. There are two ways a system can handle altered files, resending the adjusted files in its entirety or simply transferring the changed pieces. The Rsync algorithm does the latter. The Rsync algorithm was developed by Andrew Tridgell and Paul Mackerras. Taking a data backup using the Rsync method results in ultra fast and efficient backups. Imagine a database, worksheet or word processing document in which the author only changed one record, cell or paragraph. Take your PIM database; on a daily basis you receive new emails, delete junk and old mail, setup appointments and have the system remove expired ones. Rsync backups will transmit a few megabytes of changes rather than let you wait until all modified files (that probably run into gigabytes) are uploaded. The Rsync algorithm The file on the archive server (the destination file that needs to be updated) is split into a number of blocks of equal size (the last block could be an exception). For each block a signature is generated. The signature consists of a quick-to-compute 32 bit checksum (I’ll explain the rolling capability later on) as well as a 128 bit checksum. In Rsync, the MD5 (Message digest) algorithm is now being utilised. The block number as well as the associated signatures for each block is transmitted to the source computer. The source computer takes the 32 bit low computational checksum of each block and generates a simple 16 bit hash value. Simplistically speaking, the source end will transmit to the destination how to reconstruct the file using either literal data transmitted from the source to the destination or by telling the destination to utilise a block already present at its end. At the source, a block of identical size as those computed at the destination is analysed and the low overhead 32 bit checksum is generated. Its 16 bit hash is compared to those computed from the destination 32 bit checksums. If no match is found than the code that deals with a nonexistent block kicks in. If the hashes match, the 128 bit checksum is computed on the source block and this is compared to the 128 bit blocks received from the destination that have the same hashing function (there may be more than one) as the source. If no match is found, the code that deals with a nonexistent block kicks in. If a destination block is matched, the source sends an instruction to the destination to copy the block having a particular index to the new output file. The position at the source is advanced forward the length of the block and the process loops. The logic behind the nonexistent block is as follows: 1. It transmits the character at the beginning of the block to the destination to append to the file that is being reconstructed; 2. Advances the block by one character and repeats the process. Point 2 explains where the rolling 32 bit algorithm comes into play. By having a rolling 32 bit algorithm, the computational overhead necessary to calculate the new CRC is minimised further since all that is necessary would be to subtract the value of the byte of the previous start of block and add the value of the byte at the new (shifted) end of block. The alternative would have to sum up the range again and this consumes more computing cycles. The performance increase between a copy-all backup and one using Rsync can be more than 110 times. All data transmitted by Rsync is actually compressed and is encrypted using SSL. For online backups the Rsync method is the best way to guarantee that backups are fast, secure, and non-intrusive and actually get done. For more in depth information on Rsync, you might want to visit the Rsync home page at http://Rsync.samba.org/ as well as spend a couple of evenings reading through Andrew Tridgell’s paper Efficient Algorithms for Sorting and Synchronization.

John:
Related Post