Have you ever wanted to be able to simply switch between Internet Service Providers (ISP) dynamically based on network path reliability? Many network engineers will answer yes. This is a common ask in environments of any size.
At Wayfair, we rely heavily on the resilience and reliability of our Internet and WAN circuits, not just for customer traffic, but for VPN connectivity, back-up DIA for other sites and Data Centers, replication between sites, and Cloud connectivity. While vetting solutions, we decided to keep it simple, keep the cost down, and to innovate the solution on our own and at our pace. These factors led us to select a solution that is widely available as a base feature set on many network devices. There are different names for this feature set, but without getting vendor-specific we will call it Onboard Event Management (OEM).
For our first iteration, we decided to keep it extra simple: we created some automated pings to various reliable hosts on the Internet, tweaked the timers to make them fail at suboptimal values, and monitored them so that the OEM would react when the monitors failed and would kick off some custom configuration scripts we set up for BGP. These scripts, in turn, would cease advertisements out to that ISP and inbound into the lower layers of our datacenter network.
“Why not (insert favorite link load balancer/SDWAN solution)? It does that and we love it!”
It’s not that we don’t like SDWAN solutions, but to be clear, we are looking at our Data Center connectivity to the internet and potentially Data Center to Data Center (Physical and in the Cloud). There are certainly use cases for our campus sites where an SDWAN solution would be more appropriate, would help us shed costs associated with circuits, and improve traffic visibility/control, but that would be a different article.
We know there is a myriad of solutions that have this type of functionality, ranging from feature-based SDWAN solutions to big beefy pieces of hardware that can push a ton of traffic. Many of these solutions are significantly more advanced and could do most of what we want and more. However, these all have compounded costs. Some of the cost considerations that are easy to call out are hardware, licensing, and annual maintenance. The cost to do an aggregate multi 100GB cloud SDWAN is also prohibitive. OEM, on the other hand, costs zero extra dollars.
Some of the commercial solutions on the market become massively cost prohibitive as you scale into that 100GB range. For our network, we would have had to spend a decent amount of money to get a hardware solution that would be able to support our current traffic load and be able to scale year-over-year. Google recently wrote a blog article that featured Wayfair as being among the first adopters of 100GB dedicated interconnects. Last year we provisioned multiple 100GB direct interconnects with Google for both our Google VPCs and any Google destined Public traffic.
In addition to upfront hardware costs, one must also consider the resource costs associated with your team needing to learn new (potentially proprietary) technology, consulting/training, increasing the volume of devices/technology to manage, dealing with architectural complexity, handling potential performance degradation associated with overlay encapsulation, and lastly having to refresh/renew another thing every few years and potentially starting over again. Just about everyone on our team has felt the pain of trying to manage a proprietary solution, especially feature set solutions added to typical network hardware. We have spent many hours troubleshooting what are found to be new bugs, for example. This is where the extensibility and simplicity come in of being able to run ANY command on the box that a human administrator can run based on the condition of a given poller.
We have already begun to think of how to improve and evolve on our solution. For instance, adjusting the solution to be multi-layered and adding de-preferencing for specific routes based on defined connection monitors to multiple points on the Internet, on top of our current cease advertisement methodology. We are also looking at extending this idea into our WAN and Cloud architecture. There are additional advancements available via custom scripts that can be embedded into the hardware as well. We are also looking at having a deeper integration with our NOC and service owners by having a “feature toggle” where they can impact one of the monitored conditions and effect path selection adjust BGP parameters or even adjust policy-based routing mechanisms.
In the end, the goal of any network engineer should be to choose the best solution for any specific use case, ensuring that it is supportable and reliable. Our team at Wayfair decided to really take these core tenets to heart in our consideration, and so far have had success with this solution. We are all looking forward to what new advancements we can coax out of our gear and will continue to innovate and adjust as needed for us to continue to be successful. &
More Reading on OEM Solutions: