The Supply Chain Engineering organization at Wayfair helps power, connect, and optimize differentiated shipping and delivery programs, consistently executes against customer promises, and maintains effective stewardship of the bottom line. We are responsible for every step in the supply chain from the moment you order a couch on our website, all the way on its journey to your door. Our software is used by staff in our warehouses, consolidation facilities, and by our delivery agents. Our main priority? Making sure that the network is operating at its highest level.

One of the most challenging components to building a fulfillment network from the ground up is knowing whether it is working properly. This sounds suspiciously easy, however the devil is in the details. If a customer gets the item they’ve ordered, that means it’s working, right?

Supply Chain 101

A supply chain network is comprised of multiple core components, including a supplier network, an order management system, warehouse operations, and transportation network. The key to enabling a customer order to move through the supply chain is ensuring data flows freely between systems.

Information about the order must be available for warehouse staff the moment they’re required to pick the item, and it must be in the transportation network before the driver leaves the warehouse en route to their next facility. All of this information needs to be delivered to the customer in real time, with accurate tracking information to boot, ensuring they feel connected to their item along its journey.

When there’s only one order to consider, this task seems rather simple. But scale that one order up to the exceptional volume we experience during Cyber Week and things get interesting. Any system under extreme load demonstrates unpredictable behavior, and that includes both technical and human systems.

Preparing to fail

Building the proper alerting around failure cases is a common solution when it comes to monitoring a supply chain at scale. It can detect a system outage, corrupted data, or an untrained user who isn’t following protocol. However, just because the failure alerts aren’t firing doesn’t mean smooth sailing. This is the situation we found ourselves in when we first started building our network of consolidation facilities, referred to as cross docks, throughout the United States. Our alerting was quiet and all technical indicators were looking good. However, the story from our warehouse staff was completely different.

In this particular instance, the folks working on the cross dock floor reported that they were unable to find all items required for an order, and were unable to ship a truck using the software that had previously worked. All of this came as a surprise to our engineering teams: Our monitoring showed no significant problems. As the cross dock floor slowly filled with orders that couldn’t be shipped, we realized that our technical systems weren’t reflecting the reality in the building.

Throughout the investigation of multiple root causes for these errors, we discovered that they all related to improperly syncing data between subsystems. However, the interesting lesson here was that we had no visibility into these problems – we had not explicitly monitored for these particular error types.

Better metrics for better results

It is an impossible task during the implementation phase of a project like this to predict all the ways that something can go wrong. There is even a diminishing return in detecting and handling all of the possible edge cases, even if those errors are caught during follow-up iterations of a project. This is why we chose to flip this monitoring scenario on its head, by fully monitoring for success. What does this allow us to do instead?

By using the ELK stack to monitor for success, we can spot trends and detect issues before they become large enough to impact operations. The above issue regarding the cross dock floor filling up with unshippable items is now handled via our monitoring of carton dwell time in a facility. If items sit too long in a building, we’re alerted due to them unsuccessfully flowing to their next destination in the expected timeframe. This kicks off an investigation and we react before warehouse staff is even aware of the issue. There is still a level of reactivity required since we’re dealing with unforeseen circumstances, but speed to resolution is a key factor in minimizing the amount of bad data created, and that allows us to spend more time implementing new features.

"/

Building a supply chain network and scaling it to Wayfair’s massive growth has been an incredibly ambitious and formidable challenge. We’re solving real world business problems creatively with a good dose of innovation, operational curiosity, and hard work. And we’re never done!

Check out our open positions and apply if you’re interested in helping fulfill a zillion things home.