Wayfair at WSDM: Conference Recap
February 26, 2018
At the beginning of February, I attended one of ACM’s premier conferences, WSDM 2018 in Los Angeles. Apart from the Hollywood-style banquet dinner on a boat, I was looking forward to the opportunity to talk through ideas with some of the brightest minds in the world from both academia and industry. Here are my nuggets from the conference –
As humans we leverage some subset of our senses – hearing, sight, touch, etc., when we learn new skills. Recall the first time you learned how to ride a bike or play catch. You ‘listened’ to instructions, ‘balanced’ yourself using your feet, and ‘felt’ the environment around you before you could estimate or predict what the best response would be. For people, it’s second nature. But for machines, this kind of multi-sensor or in other words multimodal learning is a challenge. Machine learning algorithms currently do incredibly well at learning certain generalizable narrow tasks within single modalities, but they struggle wildly when they try to combine information from multiple sources. With the advancement of both Deep Learning architectures and AI accelerators, researchers have come up with innovative ways to leverage multi-modal information.
From Saeid Balaneshin-Kordan and Alexander Kotov. 2018. Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM ’18). ACM, New York, NY, USA, 28-36. DOI: https://doi.org/10.1145/3159652.3159735
One piece of interesting work was Katrien Laenen’s paper on designing an architecture that can perform image augmented text search. This is highly relevant to Wayfair’s e-commerce offerings, since they can also be described by both images and text. Essentially, if you are looking to buy a black dress that goes well with a pair of shoes, you’ll be able to upload a photo of the shoe and make a text based query like ‘black dress’ and the search engine will return black dresses that match your shoes. Pinterest recently launched a similar capability called Lens Your Look.
What I learned: Multimodal learning is the next step of innovation which will has the potential to create ML algorithms that get ever closer to how humans make decisions.
Commercial Computer Vision and Computational Imaging has existed for a long time, but we’re now seeing Deep-Learning-powered Computer Vision being provided as a service. Ken Weiner, CTO of Gumgum who was at the conference says there’s a huge market for such tech in almost all sectors, from creating augmented online advertising (one of his company’s focus areas), to placing instant filters on Snapchat, to providing mobile eye exams for poor communities (shoutout to eyeNetra).
You are now able to leverage state of the art Computer Vision algorithms with minimal effort and drive value almost instantaneously. Google recently launched its machine-learning-as-a-service platform Cloud AutoML. No prizes for guessing what was their most performant model and service. Here at Wayfair, our in-house Deep-Learning-powered Computer Vision capabilities serve Search with Photo, Out of Stock recommendations, Duplicate Review, and Automated Merchandising, among other image-based applications. In a short period of time, it has become a cornerstone of innovation.
What I learned: Text is dead! Look around you – people connect and converse in images and gifs now. (Okay, this may have do with the popularity of memes!)
A Strategy to Address Bias in CrowdSourcing
Crowdsourcing data has been a popular method to collect data for researchers and ML practitioners.Amazon’s Mechanical Turk is a great example of a cheap and efficient way to collect a lot of well-annotated data. However, setting up quality training data with current frameworks necessitates a very nuanced understanding of the subject matter. Even with comprehensive annotator training and guidance, the prevalence of bias is profound. One innovative idea in this space was Ballpark Crowdsourcing: The Wisdom of Rough Group Comparisons. The authors in the paper suggest a ballpark annotation setting for crowdsourcing. Instead of asking the annotators to annotate every image, one could construct bags of images based on some simple attributes and ask people to guess which bag had images that go well together. The assumption here is that annotators’ guesses on simple groups require less expertise than individual labels.
At Wayfair Computer Vision, we often crowdsource image annotations for various projects such as visual similarity, object detection, style detection etc. One of our tools is designed on the same principle described above: annotators are shown images of the same product (different colors, finishes, styles, etc.) and are asked to group images that are visually similar together.
What I learned: While getting data annotated is easy and cheap, creating quality training data involves thinking through ways to mitigate bias introduced by annotators.