Search has long been an essential tool for Wayfair that enables users to discover products among millions. While we offer great text and faceted search, those features can only go so far. It can be difficult to precisely describe an item in mind, especially in a way search engines can understand. But there’s one type of search query that’s very explicit, specific, and interesting – an image that you upload from your phone, after you snap something in the real world, like so:
Over the past few months we’ve been rolling out visual search, which finds visually similar products based on an uploaded image. Users upload a photo and crop around an item of interest before hitting the search button. In less than a second, the page is populated with a collection of visually similar products. This feature was recently featured on Techcrunch, with some great commentary from one of our product managers, and we figured we would share some details on how we have been implementing it.
The heart of visual search lies in our representation of images. Images are mapped onto a low-dimensional latent space where similar images are located nearby. To find visually similar images for a query image, we simply find its neighbors within the latent space. Mapping images onto a meaningful latent space is achieved with a deep convolutional neural network.
The network has two outputs, an embedding vector for the latent space and product-type classification probabilities for known product categories. The classifier uses the embedding vector as input which shares computations between both tasks. Training data consist of positive image pairs, negative image pairs, and class labels. Positive pairs are different images of the same product and negative pairs are randomly sampled pairs. Batches of image pairs are iterated through the network to optimize weights for two tasks. Positive pairs are pushed closer together than negative pairs and classification accuracy is improved. We loosely followed implementations presented in this paper.
Image search is handled via a k-nearest neighbor search in the latent space. Multiple images of each product are embedded with the network and stored in multiple binary tree structures for fast k-nearest neighbor search. A query image is pushed through the network to get embedding and classification scores. Highly probabilistic class predictions restrict the search space and a subset of binary trees are searched for nearest neighbors of the embedding. The final search result is a merged list of product images from the binary tree searches.
Speeding It Up
Early challenges with model performance hindered the experience of real-time applications. Initially looking at examples others have done with visual similarity, many used VGG base networks to produce feature maps, the output of a fully convolutional network. Replicating these projects, we got great results but subpar response times. We swapped out the VGG base network for an Inception base network and found speed improvements of more than 40%! Inception achieves this by parallelizing multiple convolutions on intermediate feature maps. VGG’s serializing convolutions limits concurrent processing. We also experimented with dimensionality reduction to improve downstream network computations and nearest neighbor searches. Our performance enhancements allow us to run image processing services on either GPU and CPU. Response time for visual search is 200ms on servers with an NVIDIA Tesla P100 and 32-core CPU. Our 12-core CPU servers without GPU can return results in 700ms. These are not small boxes, but there is nothing exotic about them these days. Visual search is run as a Python web service using Keras with Theano as a backend. We are working to switch to a Tensorflow backend, which allows us to do neat things offline like distributed training.
Going forward we’re making strides in object detection to eliminate the cropping step and expand use cases. Improvements to support scaling visual search services are in the works as well. Wayfair is really excited to innovate in this space and make it easy and fun for users to find what they’re searching for.