Multiple recommendation systems at Wayfair use collaborative filtering based models to understand user behavior and identify like-minded customers. Despite its success, collaborative filtering has a few significant drawbacks, such as the cold start problem and a limited scope of recommendable products. By leveraging image based product embeddings, Wayfair has created a new recommendation algorithm based on product visual similarity.
This post will begin by briefly describing Wayfair’s Visual Search feature. We will then summarize the difference between collaborative and content based filtering before describing how content based methods can be used to generate recommendations on Wayfair today.
Visual Search Overview
In the Spring of 2017, Wayfair launched a new version of Visual Search. This update saw a 58% lift in repeat engagement of the service by customers over a 7-day period. Visual Search allows shoppers to take photos and search the Wayfair catalogue for similar items (Figure 1). Visual Search can also be used for discount shopping, allowing users to look for affordable alternatives to pricier products. Behind the scenes, Visual Search is a siamese, deep convolutional neural network. Visual Search generates complex content based features by performing calculations on user-supplied imagery on site in real time.
Visual Search embeds images into a 256-dimensional latent space. This latent space maps visually similar products close together, and dissimilar products further apart. Specifically, Wayfair makes use of the Inception V3 Model, proposed in Szegedy et al  as well as the contrastive loss function as defined by Lecun et al in . This contrastive loss function is the driving factor causing the embedding space to map similar products close together and dissimilar products further apart.
When an image is uploaded, Wayfair can calculate its embedding with a forward pass through the neural network and perform a nearest neighbor search over the pre-computed embeddings for products in our catalogue. The items closest to the uploaded image’s embedding are returned as the Visual Search results. For more information about our Visual Search feature, please refer to our previous post on the Tech Blog here.
Visual Search is an extremely useful feature, and the dataset generated by computing embeddings for products within Wayfair’s catalogue can be used in many different ways. We’ll now explore the difference between collaborative filtering and content based recommender systems and describe how Visual Search embeddings can be used to serve product recommendations.
Collaborative Filtering vs. Content Based Recommender Systems
Collaborative filtering is an extremely popular technique for recommender systems today. It has seen tremendous success from the Netflix Prize competition, to music recommendations on Spotify. In the context of Wayfair, collaborative filtering aims to make predictions about what a particular shopper will like based on how similar customers have behaved on site. Typically, this process is done using a matrix factorization approach, where the matrix factors represent latent representations of customers and their affinity to particular products (Figure 2).
Collaborative filtering suffers from a few drawbacks. First, collaborative filtering is unable to recommend products that have had no customer interaction. This poses a problem for products that have been recently added to the catalogue. This is known as the cold start problem.
Another drawback of the collaborative filtering approach is the difficulty to recommend less popular products from the Wayfair catalogue. In the Wayfair catalogue, products are broken up into different classes (i.e. sofas, wall art, beds, dressers, etc.). Wayfair’s product catalogue contains millions of products across thousands of classes, but the most popular products only account for a very small percentage of it. As a result, collaborative filtering approaches tend to recommend popular products, at the expense of potentially showing a more relevant product.
Visual Recommendations aim to combat both of these problems with a content based recommendation approach.
Content based recommender systems rely solely on the content of the products that are being recommended. Instead of using customer interactions, a content based approach can find products that are directly similar to one another, and serve those products as recommendations. The features learned by the Visual Search model can be seen as purely content based features. They are learned based on the similarity of images of products in the Wayfair product catalogue, and can be used to create a content based recommender system.
The Visual Recommendations algorithm is a powerful, robust recommendation algorithm that leverages the Visual Search team’s work to serve high quality recommendations. The main assumption made by the Visual Recommendations algorithm is that customers want to see products that are visually similar to the products they have interacted with.
The motivating factor for the design of this recommendation algorithm is two-fold. First, to tackle the inherent problems associated with collaborative filtering. Second, to develop an algorithm capable of serving meaningful recommendations that preserve customer style preferences across classes. It is difficult for matrix factorization based collaborative filtering approaches to preserve style preference across class due to their inherent separation on class due to customer browsing behavior.
The Visual Recommendations algorithm is used on Wayfair’s class browsing pages, known as a Superbrowse page (Figure 3). The algorithm takes in a list of products that a customer has interacted with on site, referred to as a browse context, and outputs a set of recommendations for a given class. The algorithm can be represented as a function that maps browse contexts to recommendations as follows:
In order to calculate a set of recommendations that have been given a browse context, we perform an approximate nearest neighbors (ANN) search in the embedding space over the recommendation class for each product in the browse context. To maximize runtime efficiency, the ANN search is limited to the class of interest. Computing a full nearest neighbor search in real time over the entire embedded space would be computationally intense, so we opt for a robust ANN approach.
For ANN, we use the method described in , using Hierarchical Navigable Small World Graphs. Small world graphs are highly connected, undirected graph structures where nodes are not necessarily neighbors of one another, but neighbors of a given node are neighbors of each other. This graph structure ensures that any node in the graph can be reached in a small number of hops from any other node. This type of graph structure is used in many places, from modeling social and genetic networks to modeling the underlying architecture of the Internet.
Building Hierarchical Navigable Small World Graphs
By constructing a hierarchy of small world graphs, we can search the space for nearest neighbors in sublinear time relative to the number of points in the embedded space. To construct the graph, we first sample a number from an exponential distribution for each point that represents the level at which that point will live in the hierarchy.
Where λ is a representation of the point to be embedded.
A greedy nearest neighbor search is performed at each level, until level t is reached. A running list of the nearest neighbors to a given point is kept in memory upon insertion, and updated as the graph is traversed (Figure 4). A heuristic is then used to connect less related clusters to one another in order to preserve the globally connected property of small world graphs (Figure 5).
The algorithm computes a set of nearest neighbors for each product in the browse context, and records the distance from each nearest neighbor the product used to generate the set. This accumulates into a list of recommendations and associated distances (Figure 6).
The sets of recommendations are concatenated and sorted by ascending distance. This serves as a proxy for filtering to the most relevant products to recommend, since the distance between objects in the embedding space correlates directly to visual similarity. For example, if the recommended class is nightstands, and the browse context contains beds, dressers, and kitchen plates, then the nightstands will be inherently closer to dressers and beds in the embedded space than kitchen plates. This means the recommendations that are served will have been generated from similar looking items, rather than from obscure objects unrelated to the recommendation class. We also impose a threshold for class similarity to ensure that a customer has a class in their browse context that is similar enough to serve cross-class recommendations.
This algorithm is capable of producing robust recommendations in both in-class and cross-class scenarios (Figure 7).
Visual Recommendations provide a robust content based recommender system that combats the cold start problem. The issue of serving effective cross-class recommendations on site for users that have not yet browsed a particular class is also combatted with this algorithm.
While the current visual embedding space is partitioned primarily based on class, Wayfair plans to generate an embedding space partitioned by style. This will allow cross-class recommendations that more effectively maintain customer style preferences.
For recommendations, we plan to adapt a hybrid recommender system that employs both collaborative filtering and content based filtering. This hybrid model will create a class agnostic style space partitioned by style. Similar types of models have been proven effective in recognizing cross-class style, as seen in .
Finally, it is important to understand where our different recommendation algorithms perform best. By performing customer segmentation, we can decide on a piecewise approach to recommendations at Wayfair. We may discover that it makes more sense to use different algorithms for different types of customers. By analyzing where each algorithm performs best, we can continue to improve and personalize the customer experience.
 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna: Rethinking the Inception Architecture for Computer Vision. eprint arXiv:1512.00567, 2015
 Y. Lecun, R. Hadsell, and S. Chopra: Dimensionality Reduction by Learning an Invariant Mapping. CVPR, 2006
 Y. Malkov, D. A. Yashunin: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv:1603.09320, 2016
 A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, S. Belongie: Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurences. ICCV, 2015