IT Is Power. Try Living Without A Single Breadboard For A Day.

Don MacVittie

Subscribe to Don MacVittie: eMailAlertsEmail Alerts
Get Don MacVittie: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: CRM, NetSuite Journal, F5 Networks

Blog Feed Post

Stop Repeating Yourself. Deduping WAN-Opt Style

Deduplication as Part of Your WAN Optimization Strategy

Ever hang out with the person who just wants to make their point, and no matter what the conversation says the same thing over and over in slightly different ways? Ever want to tell them they were doing their  favorite cause/point/whatever a huge disfavor by acting like a repetitive fool? That’s what your data is doing when you send it across the WAN. Ever seen the data in a database file? Or in your corporate marketing documents? R E P E T I T I V E. And under a normal backup or replication scenario – or a remote office scenario – you are sending the same sequence of bytes over and over and over. Machines may be quad word these days, but your pipe is still measured in bits. That means even most of your large integers have 32 bits of redundant zeroes. Let’s not talk about all the places your corporate logo is in files, or how many times the word “the” appears in your documents. image

It is worth noting for those of you just delving into this topic that WAN deduplication shares some features and even technologies with storage deduplication, but because the WAN has to handle an essentially unlimited stream of data running through it, and it does not have to store that data and keep differentials or anything moving forward, it is a very different beast than disk-based deduplication. WAN deduplication is more along the lines of “fire and forget” (though forget is the wrong word, since it keeps duplicate info for future reference) than storage which is “fire and remember exactly what we did”.

Thankfully, your data doesn’t have feelings, so we can offer a technological solution to its repetitive babbling. There are a growing number of products out there that tell your data “Hey! Say it once and move on!” these products either are or implement in-flight data deduplication. These devices require a system on each end – one to dedupe, one to rehydrate – and there are a variety of options the developer can choose, along with a few that you can choose, to make the deduplication of higher or lower quality. Interestingly, some of these options are perfect for one customers’ data set and not at all high-return for others.

So I thought we’d talk through them generically, giving you an idea of what to ask your vendor when you consider deduplication as part of your WAN Optimization strategy.


The type of cache can have a pretty big impact on the performance of your WAN deduplication system. Unlike disk-based deduplication, you need a place to store the original data and the key while processing the stream. Remember that your data flows through the device in a relatively non-stop manner, and the device has to on-the-fly do a best effort to dedupe it. To achieve this goal, a cache is created that includes the most recently used entries, the key that is actually sent in their stead, when (or how frequently) they were last used, and the actual bits replaced. The location/technology of this cache is what we’re considering – if it is on disk, it can be pretty huge, but will have disk access speed as a latency factor. If it is in memory, the size will be much smaller, but it will be faster. The hybrid model uses write-through caching to make the most of both. You still have lag when looking up something on disk, but then it is in the memory cache and future accesses are at memory speed. Which version your vendor uses will impact how much benefit you get from deduplication. If the cache is in-memory, it is likely smaller, meaning that high volume or highly variable data will cycle things out of the cache relatively frequently, and thus, cache hits will be fewer. A large disk cache means that you’ll have plenty of entries, but they take longer to process. The hybrid version means that the most frequently (or recently depending upon implementation) used bits will be in the memory cache and other stuff will have to come off of disk. Still the possibility of thrashing in a high-change environment, but you get faster speeds for in-memory hits and only go to disk access for less frequent hits. In theory a winning combination.


Hand-in-hand with the type of cache are two variables that some vendors make user-level – cache entry count and cache hit size. The entry count is the number of entries the cache maintains during processing. Because data is constantly flowing through it, and any resend of the same large volume of data is an anomaly not to be expected, the cache must be fluid enough to adapt to changing data sets flowing through it, but small enough that searching it is nearly trivial in terms of access times. If the vendor made the key that is sent in lieu of the data dependent upon the cache size, then growing the cache can grow the number of bits that must be sent, reducing the effectiveness of deduplication.



The cache hit size is another performance variable that means a lot, and is inter-related to the cache entry count. If the number of bytes that must match a cache entry is small, then the number of “hits” will be  larger, but the benefit of any given hit will be reduced. If you can replace “Theatrical” with 0x01 on the pipe, you’ve saved 90% of the bandwidth that word required. On the other hand, if you replace “The” with 0x01, you’ve saved almost nothing. Considering that a ton of stuff doesn’t get deduplicated, saving what you can on the stuff that does is important… But what if you have the length of matching bytes set such that you rarely get a cache hit? Then you’re saving near zero on the pipe because deduplication isn’t being utilized. This cache hit size is important and what value is the best will vary with your data and the volume of data you’re passing through it. I don’t have a magic formula for you, there has been considerable research done into the “perfect” cache hit size, and no one else has the magic formula either. Your vendor’s sales engineers can advise you on what they use, and if the settings are adjustable by you, what they recommend. That’s the best you’re going to get prior to deploying it on your WAN and testing different loads.


Unlike Load Balancing, the actual guts of deduplication – the algorithms used – are not as important as the factors above unless your throughput is extremely high. This is simply because the algorithm used is a small fraction of the latency that a deduplication engine can introduce into your network. As such, asking engineers how they do it isn’t nearly as important as asking about the above three things. The algorithm can have a large impact, but largely as it relates to access times and seeking. It is definitely worth looking at, but remember that most vendors don't hand out their secret sauce, and the real key is how fast they can turn up duplicates.


There are a couple of things you can do to get the most out of your deduplication system, and once you hear them they will probably be obvious. But they’re worth mentioning.

Compressed and encrypted data gets less “hits” than raw data simply by virtue of being compressed and/or encrypted. While some compression algorithms include deduplication, others do not, so you may or may not be including duplicate data “under the covers”. While I’m not suggesting that you not compress anything or encrypt anything,  do not compress and/or encrypt simply because you’re sending those things over the WAN. It used to be straight-up common sense to zip a file before copying it over the WAN. That may not be the best choice in a deduplication environment.

Security too, can reduce hits, and it is often accepted practice to encrypt things before sending them over the WAN. Some products – like our BIG-IP WOM and EDGE Gateway – will do encryption for you after deduplication. You could simulate the same functionality on a vendor that doesn’t do this by putting your encryption tool on the outside of the deduplication tool. Which isn’t a huge stretch, encryption is usually at the edge of the network, I’m just suggesting putting the deduplication product on the inside of it.


The return on investment of deduplication can be massive. Your network is sending a lot of repetitive data over that WAN link, and you have access to the tools to shut it up. The throughput reduction can be a huge percentage of your total bandwidth in many enterprise IT environments, which translates to a many times wider pipe. Knowing what you’re getting can go a long way toward reaping those benefits. As we’d like to tell those repetitive people, in understanding, not repetition, is power found.

EDIT: 29 JUN 10 - Cleaned up Algorithms section on advice of specialists.

Read the original blog entry...

More Stories By Don MacVittie

Don MacVittie is founder of Ingrained Technology, A technical advocacy and software development consultancy. He has experience in application development, architecture, infrastructure, technical writing,DevOps, and IT management. MacVittie holds a B.S. in Computer Science from Northern Michigan University, and an M.S. in Computer Science from Nova Southeastern University.