The Visual Complements Model (ViCs): Complementary Product Recommendations From Visual Cues
March 16, 2020
Creating a recommendation system for home decoration poses unique challenges: each customer has his/her own taste and would like to maintain a cohesive personal style across his/her home. We at Wayfair know that it is hard to describe that taste in words and then to search through huge product catalogs to curate a cohesive look. Therefore, we leverage machine learning algorithms by using visual cues to narrow down our expansive catalog and help our customers find the perfect items to complete their homes.
When shopping for home, mixing-and-matching furniture pieces is a must, since things that match don’t always come as a bundle. With this in mind, we work hard to help customers find matches tailored to their own tastes. Compatible product recommendation is an important tool used to ease the process of finding complementary items that go well with each other. Making use of this tool enables us to satisfy the needs of style-conscious customers and help them maintain a cohesive style of home.
Most existing recommendation algorithms are built based on customers’ browsing history, e.g., collaborative filtering. But the uncertainty, diversity, and timeliness of each customer’s profile, as well as the absence of new customers’ history, makes it challenging for such algorithms to be robust to all customers. Moreover, models based on customer interactions are often biased and have a strong trend to recommend low-price and popular items. In turn, there might be a cold start problem for new products. At Wayfair, we wanted to find a better way.
In this post, we will discuss our newest method for aiding customers in their search for complementary items: the Visual Complements model (ViCs). Rather than depending on customer input, this model leverages an image-based model (CNN) to understand compatibility from product imagery, thereby mimicking the way customers find the pieces they want and eliminating the cold start problem in the process. ViCs aims to provide an understanding of compatibility for all Wayfair product imagery, and to deliver customer recommendations for complementary, stylistic similar items across product classes.
In order to provide complementary product recommendations, the outputs of our model needed to serve as a representation of relative compatibility between products. To accomplish this, our goal was to create an embedding space that keeps compatible data points of products close, while pushing non-compatible data points apart. Triplet loss , first introduced in facial recognition tasks, can be used in this case as a way to learn representative embeddings for each piece. The triplet loss minimizes the distance between an anchor and a positive which stylistically matches the anchor, and maximizes the distance between the anchor and a negative which are stylistically incompatible.
In contrast to facial recognition tasks, which typically work with imagery that all belong to the same domain (faces), in our use case we are presented with a variety of features to examine for different pairs of product classes. For example, compatible sofas and accent chairs might be made of the same material, whereas compatible coffee tables and sofas most likely are not. Taking this assumption into account, we added a cross-entropy loss for class prediction, so that the model could learn to pay attention to different criteria when looking at different matches of product classes.
The performance of a deep learning model often relies on the quality of its training data. So for our ViCs model, we incorporated training data from multiple sources to avoid bias. First of all, we performed importance sampling on the recommendations from our context-based style model RoSE v2. We also mined triplets from 3D scene graphs, which 3D artists at Wayfair use as a way to render realistic images for products with 3D models. We did this in order to approximate an expert’s stylistic perspective, our assumption being that products included within a given scene curated by a 3D artist are stylistically compatible. Last but not least, we wanted to include the demonstrated bias of customers’ towards purchasing popular products (as successfully captured in other recommendation algorithms which leverage customer data, e.g. Wayfair’s RecNet). As such, we took customer browsing history into consideration, including products added to lists by customers and co-ordered products.
In our training data, labeled triplets are composed of an Anchor item from class A, a Positive item from class B that stylistically matches the Anchor, and a Negative item from class B that is not compatible with the Anchor. For example an Anchor sofa image would be more compatible with a Positive coffee table than a Negative coffee table. Human labelers trained in recognizing stylistic attributes confirmed the quality of the unsupervised triplets that we mined from various sources mentioned above—considering color, shape, material and other factors that contribute to compatibility—so that they could serve as training data for the ViCs model.
To learn the embedding space, we used a Siamese Network architecture with triplet loss, where compatible products should be close to each other and vice versa.
We used transfer learning in our network, the base of our Siamese network being RoSE v3, our team’s previous model which understands the style of room images by learning from comparison, and which itself transfer-learned from ResNet 50 . We took the second last layer of RoSE v3 as the embedding layer and performed L2-normalization on the embedding vectors to constrain the embeddings to live on the d-dimensional hypersphere. In our final implementation, we applied the square of Euclidean distance instead of Euclidean distance, as well as increasing margin with the training process, to facilitate convergence.
Examination of ViCs Performance Against Model Objectives
In order to assess the performance of the ViCs model, first, we evaluated its performance against the model objectives using offline metrics.
One of our primary objectives was for the model to be able to distinguish the compatible item within each triplet. The model performed well against this metric, as shown in the following figure showcasing an example of triplets that reach agreement between human expert labels and the model output. Results such as this validate that ViCs is able to learn the domain expert knowledge in terms of understanding compatibility between certain product classes in terms of color, shape, material and style, and mimic the choices that humans would make.
But binary classification on the positive and negative was not our only goal; with this model we also wanted to minimize the distance between an anchor and its positive, while pushing its negative as far as possible into its embedding space. The following figure presents a distribution of distances between pairs of products from the test dataset in the embedding space created by ViCs. From the two-peak distribution for Positive and Negative products, we can tell that the model is learning to separate the positive and negative as expected.
Analysis of ViCs Performance in Use Cases
Beyond its performance against our objectives, we, of course, wanted to evaluate the performance of the ViCs model in an actual use case at Wayfair: product recommendations. To do so, we used a single branch of the trained ViCs model to embed all of the product images in Wayfair’s catalog. The embeddings thus would represent a set of visual features of the products that contribute to complementary compatibility. As a result, by doing a nearest neighbor search in the embedding space, we were able to offer compatible product recommendations such as those shown in Fig. 5.
Below is a sample result of compatible product recommendations from the ViCs model, where given a sofa, stylistically similar yet diverse accent chairs and accent tables are recommended.
As shown above, the ViCs model is able to leverage compatibility in various attributes. It can capture consistent features across product classes, and carry the features from piece to piece. The recommended accent chairs have features such as sharp-lined legs with metal accents, various colors and fabrics that do not overpower the leather sofa, and/or tufted cushions that match the sofa handle detailing. For accent tables, there is again no clashing of colors, and a generalized square/rectangular shape to mirror the shapes and lines of the back of the sofa sofa.
Furthermore, as you can see in the example above, rather than solely providing recommendations for products which are so similar as to be nearly identical, ViCs is able to provide a diverse range of recommendations. These recommendations, for example, vary in both color and shape, while still adhering to general stylistic similarity. One way ViCs achieves this is through recognizing stylistic similarity based on product materials. Across our three target classes here (sofas, accent chairs, and accent tables), ViCs was able to do this particularly well for accent tables. For example, in Fig. 5, the recommendations center on tables made of mixed metals and acrylic (as such are a common combination in minimalist modern style) as opposed to leather and marble (which are common in Victorian styling).
With the initial version of ViCs, we have seen good results in our first target classes. One of our first steps in refining our current ViCs model will be to increase class coverage. We will also be working on leveraging the ViCs embeddings to form visual clusters of products which reflect the features that complete the look of a stylistically cohesive yet diverse assortment. These clusters will serve as a starting point for a customer’s shopping journey and function as a complementary approach to the clusters based on visual search embeddings.
 Schroff, F., Kalenichenko, D., & Philbin, J. (2015). FaceNet: A unified embedding for face recognition and clustering. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi: 10.1109/cvpr.2015.7298682. P. 5
 K. He, X. Zhang, S. Ren, and J. Sun. (2016). Deep residual learning for image recognition. IEEE Conf. on Comp. Vision and Pat. Rec., pp. 770-778,.