EMC Avamar – HOW SOURCE & GLOBAL DEDUPE MAGIC HAPPENS

EMC Avamar is magical– and if you don’t believe me, read this first. To understand how the dedupe magic works in an EMC Avamar solution, it’s helpful to first understand the methods of data deduplication commonly found today. Below is a brief explanation of the top 3.

File-level (the most common)

- File level identifies duplicate files within a volume, and commonly uses a hashing technique to take the file name, size, last modified date, and other similar metadata information to make a unique hash/fingerprint for that file. If two copies of this exist, and a user opens and changes a single letter or adds a period to it, the hash becomes unique, and this file will not be de-duplicated.

This is what you’d find on an EMC Celerra or VNX CIFS share. This is good for environments with lots of file server type data and users that tend to share data between them. Our customers seem to be in the 20-25% space savings range using this technology on their VNX / Celerra CIFS shares, but your mileage may vary.

Fixed Block Level (also called fixed length)

- Common in snapshot/replication technologies, it breaks a file into a fixed number of bits, and if they’re the same, that segment is de-duplicated. If you have two files that break down into the following bits, the RED sections would be de-duplicated based on a 4-bit block length.

1111 1111 0001 / 1111 1110 0000

This will usually have a higher deduplication perecentage than File level, as files of different types that have identical segments (sub-file level) will still see some level of deduplication. But this isn’t the best solution, and “fixed” itself is an antonym for flexible, which is what we all want our technology to be.

Variable block level (what Avamar uses)

- This uses an intelligent method to determine segment size based on the data itself to determine length size. It provides much greater granularity in finding duplicate data

Avamar’s intelligence allows it to look at a file, read the header and say “this is an X type of file, so I’m going to break it into these variable segments”. This provides the greatest level of deduplication. So with that foundation, we’ll move into more details as to the process Avamar uses to dedupe at the source, and also globally across all clients.

THE PROCESS

Note: Most of the following process happens on the CLIENT itself, which is where “source dedupe” happens. Keep this in mind when reading through this post and watching the video at the bottom. The benefit of this are faster backups, and substantially less consumption of bandwidth, which is ideal in environments that have smaller WAN pipes between offices or backups that run past their backup windows.

When a file is backed up to an Avamar server, the file is hashed using its metadata (file size, date modified, etc.), and this 20-byte SHA1 hash is stored in the “file cache” file locally on the client. When this client goes to backup its data, this file is loaded into RAM, and each file is hashed and compared to the hashes in the file. If it matches, it means the file has already been backed up and has not been modified, so the next file is processed. Because this is done in memory, it happens VERY quickly, and it’s common to see 95-98% of data on a traditional file system be satisfied against the file cache.

Assume the file HAS been changed, and therefore the hash of the file no longer matches the hash in the file cache. From here, the data will be broken into variable segment lengths through Avamar’s patented sticky byte factoring process. These segments are then compressed and hashed, and the resulting hash is written to a HASH CACHE file locally on the client. This file is also loaded into RAM when a backup is initiated.

When the client begins it backup, each of these segments will be compared against hashes in the hash cache, and if it matches any hash (which could be from a different file), it will not send the data to the Avamar system, because it knows that specific piece of data has already been backed up.

If Avamar finds the specific segment has not been backed up (the hash doesn’t exist in the hash cache), it will take that hash and send ONLY the hash to the Avamar grid itself, and check to see if that segment has been backed up by ANY OTHER client in the environment. This reduces bandwidth, as only the hash gets transmitted, not the data itself, and is where the “global” dedupe magic happens. If some other client has already backed up that segment, Avamar adds it into the hash cache and marks it as backed up without having to send the data across the wire. Magic!

For the record, my Macbook Air keeps autocorrecting “deduplication” into “reduplication” in case you see that in any of these posts. And as always, questions/comments/complaints/insults are welcome– @Rattacasa on Twitter.

Our Services

Our solutions drive customers' cost containment, revenue growth, and service objectives by addressing challenges associated with mobility, access to critical applications and data, and security. Learn More