Introducing RoSE: Wayfair’s Room Style Estimator

May 9, 2019
Esra Cansizoglu posing in front of the ocean

Esra Cansizoglu

Introduction:

Interior design and home decoration involve a high amount of guesswork. In addition to the visual appearance of each individual item, the arrangement of these items in a room and how well they fit together are also highly important. These considerations combine to create the notoriously subjective concept of style. We at Wayfair know that finding the perfect home goods to fit your style can be difficult, whether you’re moving into a new apartment and designing your space from scratch, or simply updating some pieces of decor. You navigate a maze of showrooms, or search through massive online catalogs to find the pieces that are just right, all the while keeping in mind your stylistic preferences and/or the items you already have. We at Wayfair want to do it better.

Understanding the stylistic preferences of our customers’ is vital for us at Wayfair since an accurate representation of each customer’s interests enables us to provide better recommendations and site personalization.  If you are a Wayfair customer, you might have noticed we have eight primary style tags on our website: modern & contemporary, traditional, eclectic, rustic, cottage / country, coastal, industrial, and glam. These stylistic terms are loosely defined and reflect trends that Wayfair has seen come to dominate the home goods space over time.  We use these tags to help customers more easily find items that match their particular aesthetic. However, style-based shopping/browsing is highly subjective to the customer’s understanding or interpretation of style terms. It can be hard for a customer to identify what he or she likes or dislikes with words, but when a customer is faced with a visual representation of a style, he or she can make more decisive choices about their preferences.

In this work, our goal is to talk with our customers through images and inspire them with style-aware room recommendations. We provide customers an onboarding experience where they are able to submit their preferences and see photos that are most relevant to their needs. Our style quiz provides customers a personalized list of room ideas based on their likes and dislikes. Mainly, each room image is represented with the room style embeddings inferred from our model. The images that are closer to liked images in this feature space are shown in higher ranks while the images that are close to the disliked images are pushed back to the end.

In this post we will describe how we trained Room Style Estimator (RoSE), our in-house style model,  in order to depict distinctive visual features reflecting style.

 

Method:

Our goal is to build a room image retrieval framework to inspire customers by finding room ideas that exhibit similar stylistic characteristics as a given seed image. In other words, we want to train a model that would measure the style similarity between room images. Considering the subjectivity of the problem, we follow a deep learning-based classification method that focuses on high volume classes and high-agreement samples. Utilizing certain samples enables us to better capture the boundary between different classes.

Since training machine learning models requires a good amount of data, we started by curating a dataset of 800K room images. For labeling data, we collaborated with style experts. In order to gather more opinions in terms of style spectrum of images, each image was tagged by 10 experts using one of the 8 master styles seen at Wayfair: modern & contemporary, traditional, eclectic, rustic, cottage / country, coastal, industrial, and glam. Our experts relied on guidelines where each style is defined by the use of a certain criteria of fabrics, color schemes, materials, furniture styles, and flooring.

At this stage, we were faced with two major challenges. First, we observed a high amount of disagreement among experts. Second, the data displayed a large class imbalance. To solve these issues, we decided to focus on high sample classes with high agreement samples. Consequently, our model was trained on two main styles, modern & contemporary and traditional, that are highly popular and cover more than 70% of our room images. We picked samples that had high agreement among labelers for our training set to enable us to focus on highly discriminative features representing each style.

For the first version of RoSE, we followed a classification method utilizing deep neural networks. We transfer learned from a VGG network architecture [1] that is trained for place classification on Places365 dataset [2]. We removed the final soft-max layer and added an 8 dimensional hidden layer followed by a binary cross entropy loss as seen in Figure 1. We used second last layer of the network as a feature extractor in our retrieval experiments.

 

Figure 1. VGG Architecture used in our model.

 

How we use the model:

RoSE gives a classification accuracy of 88.7% for modern and traditional room images (see Figure 2) However, the major use case for our model is a room image retrieval framework where the output of second last layer is treated as visual embedding for each room image. We formed a retrieval test set covering all styles and computed recall rate at 1 as 0.4 as compared to 0.2 for random baseline. Although the model is only trained on two styles, it still works well on other styles showing its power on representing style-specific features.

 

Figure 2. Prediction results on our test set. Images on the leftmost and rightmost sides are correctly predicted as modern and traditional respectively while the middle images yielded low confidence scores.

 

Future Work:

Now that we have a better understanding of the room style space, we would like to understand the mapping between room styles and products. This will be highly useful for customers who are looking for items to complete the look in their rooms with the existing furniture.

 

 

References:

[1] K.  Simonyan  and A. Zisserman. Very  deep convolutional networks  for large-scale image recognition. arXiv  preprint arXiv:1409.1556, 2014.

[2] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places:  A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

Submit a Comment

Note: Your email address is required to add a comment but will NOT be published.