There’s storm in Wayfair! And yes, the “a” article before the word “storm” is purposely not there. When referring to “storm” at Wayfair, we do not mean a conglomerate of barometric circumstances that lead to downpours from the skies and other natural phenomena (a storm); we mean real-time computation, horizontal scalability, and system robustness. We mean bleeding edge technologies (
a storm). Wayfair’s Order Management System (OMS) team introduced storm into our ever-growing technical infrastructure in February to implement event driven processes.
Wayfair’s OMS is responsible for receiving orders from our storefront and guiding these orders through a pipeline of discrete steps to manage its state from “order placed” → “order sent”. The team accomplishes that by interacting with internal platforms (such as the Warehouse Management System) and platforms that cooperate with external entities (think shipping methods, credit cards, etc). This means that OMS does 2 things:
- Provides browser interfaces for our CSRs (Customer Service Representatives) to manipulate the orders manually.
- Creates automated processes that will continuously take orders in a specific state, process them accordingly, and modify their state– much like a finite state machine (FSM).
For this entry, we will be referring to the OMS team as the royal “we” and focusing on the 2nd bullet point.
Where We Were
OMS Then: Classic ASP + Stored Procedures + SSIS Jobs + Batch
All of the processes used to transition an order state following the paradigm depicted in the above figure – OMS Then. Our processes ran every 5 minutes or so in the form of a job, which would collect n orders in a pre-determined state ( 0 < n < ~200) and place them in an ASP script that would sequentially course the batch of orders on a single thread and transition them to a succeeding state. Different jobs ran concurrently and virtually independent of each other. This was a good solution for processing our endless stream of incoming orders, but posed potential issues for us, the main one being:
A single slow process or physical failure forced a dual-scoped issue for the entire platform.
Micro: bottleneck on a single process
Macro: hold up all subsequent processes
Despite the potential issues, this paradigm has worked so far. But OMS wants more! OMS wants better!
The New Architecture
The figure above depicts a quick overview of a “processing entity” in our system. Firstly, our entity is decoupled from data sources via a RESTful service layer and secondly, we migrated our programming paradigm to OO PHP. Furthermore, we now define discrete “processing entities” in our systems as steps. These take inputs coming from other steps and also produce outputs for future steps to process. These outputs are events produced by these steps, each of which can be placed into specific streams. Notice that a step can attach itself to multiple streams, as well as emit events into multiple streams.
We can now intricately chain our steps and form a complex, event-driven FSM. The steps critical for processing the “order placed” → “order sent” pipeline are part of what we call the OMS core. We can also define auxiliary processes that feed off events produced from core steps. So in essence, we chain the core steps via streams of orders while simultaneously attaching auxiliary steps on secondary streams for more complex event processing (exemplified in the figure below).
How do we manage this? Well this is where storm comes in. We use storm’s primitive entities to form topologies. Do you remember the problems addressed in the old platform? Well those were solved elegantly and effortlessly with our new scalable system. Storm provides horizontally scalability by partitioning fault-tolerant applications to a cluster of nodes. But of course, that’s not the only reason we chose to integrate storm into our automated systems. Yes, we’ve reduced the potential of physical failure as an issue. Yes, we have the ability of scaling bottlenecks with more workers. Aside the robustness in the physical layer, we can also embrace stream processing to create highly complex FSMs and easily implement them as topologies.
Where We Are
In short, we have migrated two of our five core process steps into the new architecture, and are on the brink of deploying a third. So far, this new architecture fits the needs for our OMS pipeline and makes it a good showcase of its success. Moreover, it also can also be a platform to place other applications on; applications can either be appended as sub-topologies to our core topology or created as new independent topologies that follow the paradigms of our architecture.
Although the storm framework makes developing and deploying these applications easy, we also had to develop the following auxiliary modules to get our desired functionality: php multilang adapter, code packaging system, and a node configuration module.
Storm applications are natively developed in Java, but we utilize Wayfair’s PHP codebase in our core steps with our PHP implementation of storm’s multilang protocol. Although we rely on storm’s guaranteed message processing, we’ve added additional guarantees on the application layer just to gain even more code visibility. Remember though, our steps do not have to be implemented in PHP in order to interact with our systems given our language agnostic web services — thank goodness we REST at Wayfair! And how do we seamlessly add more nodes and configure the physical cluster? Well there’s a puppet module we built for that. We’re solving some fun problems here at Wayfair, and our work is never done.
OMS Platform Team: Daniel B, Jonathan L, Maura G, Saurabh C (Manager)