Wayfair recently deployed Steelhead WAN optimizers in our network.  We were unsatisfied with the 3 (2.5?) main suggested methods of deployment from Riverbed.  So we designed our own deployment.  But first, the vendor suggested methods:

Physical In-Path – Physically puts the device in-line with your traffic, easy deployment, but if you ever need to remove it you will lose connectivity (fail-to-wire still needs wire’s attached to work!).  You also lose the ability to dynamically fully remove the device from the traffic path for troubleshooting purposes.

Virtual In-Path – Logically directs traffic through the steelhead appliance, but depends on redirection mechanisms like WCCP, PBR, or Layer-4 switching/routing.  WCCP is proprietary, so right off the bat that’s out the window.  PBR is incredibly powerful, but not dynamic.  You can make it “dynamic” by utilizing “verify-availability,“ but this feature is not widely supported, particularly in the hardware we run in our branch offices, where this will be deployed. So that option is out as well.

Out-of-Path – This option is only supported for server-side Steelheads (in the datacenter) and requires the remote Steelheads to be in one of the aforementioned, already disqualified, configurations.

So with the predefined options not meeting the grade we set out to find another way.  Our goal was to have an easy physical deployment, with the flexibility to manually or dynamically completely remove the Steelhead from the traffic path, and to do this with the same deployment model company wide.  We ended up designed our own “Dynamic-Physical-In-Path” deployment (DPIP?).  We derived it off of the classic, albeit ugly, way of moving data between two distinct VRFs with a physical cable, we just extended that cable to include the Steelhead as well.  This allowed us to keep this design contained within the switch and the riverbed, which kept with our goal of a nice, compact and relatively simple deployment.

Wayfair’s Dynamic-Physical-In-Path Deployment:

In our remote locations the hardware we’re running this off of is 3560’s with IOS 15.0(1)SE3 and the feature set “IP Services.”  In our datacenter we’re running this off of 4500X’s with IOS 3.3.0SG and the “Enterprise Services” license.  I normally wouldn’t bring up this level of minutia, but we ran into a number of issues that were IOS and model line specific.

3560 CSCtr94182 – ARP was not working correctly between two interfaces on the same switch, even if they were in separate VRFs, in 12.2(58)SE2 (but DID work in earlier versions!) to get around this we upgraded to IOS 15.0(1).  Version 15.0(1) was not without faults, as it has an issue with overlapping IP space between the implicit, default, “global” table and our VRF, so we had to explicitly create another VRF that takes the place of the default table.

4500X CSCue71580 – The 4500X reverts it’s unique interface MAC address to the systems Burned In MAC Address (BIA) whenever that interface switches from L2 mode to L3 mode.  The Steelhead does its WAN-optimization-magic in a blend of L2 transparent and L3 traffic originated at the device and in our situation the Steelhead resides on a link between VRFs on the same device, and sees the same BIA MAC address for both interfaces.  The L2 traffic flows right through without an issue but the traffic sourced from the IP address of the Steelhead has a strange condition.  It has the correct L3 addresses and routes for both sides of the link, but the MAC addresses for both IPs are the same; meaning that when the Steelhead passes this traffic down to the NIC, it doesn’t know the correct physical interface to use and just “picks” one at random.  By default Cisco gear will send ICMP redirects out to correct a situation where you’re talking to the wrong router on a segment, but since this redirect “updates” the Steelhead with the same MAC address it already knew of, it has no effect on traffic aside from dropping traffic while the redirect is happening.  To “fix” this we disabled ICMP Redirects on the interfaces.  Now instead of sending an ICMP Redirect to the Steelhead, the switch just sends that traffic right back out with the correct next-hop; even if that traffic is going back out the interface it was received on.  When this traffic re-traverses the Steelhead it is in L2 form, and is just pushed through to the other side without issue.  This “fix” results in a tangible amount of packet-ricochet, but that is still preferable to packet loss.

 

Now with all of that out of the way, let’s get to the actual config!

We start with a normal 3560-8PC, operating in SDM “desktop routing” mode, and used the first 4 ports for this configuration.  This portion of the config is just for getting traffic through the Steelhead, or around it, you’ll obviously need interfaces from your hosts on the “global” VRF, and an interface to your next hop in the “rb” VRF to have a fully functional data path.

Step 1: Create your VRFs

ip vrf global
 description global vrf for normal traffic
ip vrf rb
 description vrf for wan traffic going through the steelhead

Step 2: Create your fallback data path

interface FastEthernet0/1
 description Steelhead WAN
 no switchport
 ip vrf forwarding rb
 ip address 10.1.1.6 255.255.255.248
 no cdp enable
 no ip redirects
 spanning-tree portfast
 spanning-tree bpdufilter enable
!
interface FastEthernet0/2
 description Steelhead LAN
 no switchport
 ip vrf forwarding global
 ip address 10.1.1.1 255.255.255.248
 no cdp enable
 no ip redirects
 spanning-tree portfast
 spanning-tree bpdufilter enable

Step 3: Create the Steelhead path

interface FastEthernet0/3
 description vrf-loop (rb)
 no switchport
 ip vrf forwarding rb
 ip address 10.2.2.2 255.255.255.252
 no cdp enable
 no ip redirects
 spanning-tree portfast
 spanning-tree bpdufilter enable
!
interface FastEthernet0/4
 description vrf-loop (global)
 no switchport
 ip vrf forwarding global
 ip address 10.2.2.1 255.255.255.252
 no cdp enable
 no ip redirects
 spanning-tree portfast
 spanning-tree bpdufilter enable

Step 4: Create the SLAs and track elements

ip sla 10
 icmp-echo 10.1.1.6 source-ip 10.1.1.1
 vrf global
 threshold 500
 timeout 1000
 frequency 1
ip sla schedule 10 life forever start-time now
ip sla 20
 icmp-echo 10.2.2.1 source-ip 10.2.2.2
 vrf rb
 threshold 500
 timeout 1000
 frequency 1
ip sla schedule 20 life forever start-time now
!
track 10 ip sla 10 reachability
track 20 ip sla 20 reachability

Step 5: Add the routing that pulls it all together

! routes to whatever your host networks are
ip route vrf rb 10.100.100.0 255.255.255.0 10.1.1.1 10 track 10
ip route vrf rb 10.100.100.0 255.255.255.0 10.2.2.2.1 250 track 20
! whatever your next hop is
ip route vrf rb 0.0.0.0 0.0.0.0 10.3.3.1 10
!
ip route vrf global 0.0.0.0 0.0.0.0 10.1.1.6 10 track 10
ip route vrf global 0.0.0.0 0.0.0.0 10.2.2.2 250 track 20