Better Lucene/Solr searches with a boost from an external naive Bayes classifier
October 23, 2012
Me: Doug, what are you doing?
Doug: Solving the problem of class struggle with one of Greg‘s classifiers.
Me: Karl Marx should call his office. What do you mean by that?
Doug: Let me explain…
Class struggle at Wayfair search used to manifest itself as searches for ‘red cups’ that returned lists of beer pong tables, and other unlikely and embarrassing results. In this particular example, our prospective customer was clearly looking for something in the ‘cups’ class of things, but the pack of search monkeys behind the scenes were interpreting this request wrong. Why, you may ask? Easy answer. There happens to be a ‘Red Cup’ brand of beer pong tables, which forms a serendipitous linguistic pairing with these search terms, of a type that bedevils search engineers the world over. But not at Wayfair, not any more.
When you type something into the search box on Wayfair or AllModern, your request will most likely be processed by the Solr search platform. It’s a great platform, but in some cases vanilla Solr doesn’t do exactly what we want. In some of these situations we hack Solr, and in others we augment its functionality with a pre-processor. This blog post is about a classifier that we use as a pre-processor, which has dramatically improved the quality of our Solr search results.
The way we did this is not the only way to do such things. If you want to do everything in Java, there’s an emerging pattern (perhaps already common?) of Hadoop+Mahout+Solr, which Grant Ingersoll describes here (Powerpoint) and here (techie how-to), with particular attention to the index-time aspects of such setups. We don’t currently have a lot of needs in that area, because our catalogue is pretty well classified before it makes its way to Solr. Trey Grainger of CareerBuilder gives a good overview of various plain-Solr and search-time machine-learning techniques here. I think you could reasonably work along those lines by integrating some Mahout libraries into a custom search handler. You might get the same results we did. To us, that seemed like a lot of trouble. We already have some machine-learning components, which we wrote for non-search purposes, in C, C++ and Python. That stack is closer to the metal than Java anyway, and for most of the mathematical operations that we need in the machine-learning area we prefer to stick with it if possible. When we’re writing our own compiled code, or when we need C or C++ libraries like BLAS, GSL, etc., for the hard or fast math in our processing stream, an expressive scripting language like Python feels to us like a better choice for wrapper programs than Java. In a Cython/JNI bake-off, Cython wins hands down.
But enough about platform choice. The flow of control for the pre-processors is this:
- User types search terms into box.
- Php code pre-processes search terms, sometimes, as in this case, by calling out to a Python service.
- Php code generates Solr request, executes it, displays results.
Let’s return to our prospective customer who has typed ‘red cups’ into the search box, because he wants to buy red cups. Our unclassified search would reach for the ‘Red Cup’ beer pong tables, but we can school it a bit. We use the pre-processing step to prioritize items in the ‘cups’ class, or whatever we have that most closely resembles a ‘cups’ class. We send ‘red+cup’ to a naive Bayesian classifier trained on our product catalogue, and we get this: “Confidence: 0.605732999499 4087, Plates, Bowls & Mugs| 283, Cups & Accessories|5027, Food Wrap & Containers|…”. That output represents a confidence score and the identifiers and names of some classes in our catalogue. If we get a confidence score above a certain threshold, we will then generate a Solr query like this:
Boosting is a small opening into Solr, through which a big bucket of smarts from external systems can be poured. Of course, this only works if you have an intelligently classified catalogue, or whatever you call your body of material. Creating such a thing is a domain-specific black art if there ever was one. That’s a barrier to entry, to be sure. But if you can figure all that out (wink!), this technique can really help you.