On the search and recommendations team at Wayfair, we like to think of ourselves as sophisticated men and women of the world.  We have speakers of French, Chinese, Hebrew, German and Persian in the crew.  Some of us host couchsurfers at our swanky apartments, and then turn around and couchsurf all over Europe.  Others travel to Singapore, attend music festivals in Thailand, etc., etc.  But until recently, when it came to giving our German-speaking customers a decent search experience, we could fairly have been characterized as inhospitable xenophobes.

What is it about German that was tripping us up? It’s easy enough to explain. Compared to the English equivalents, there are many more words in German that are composites of smaller words, which you can write either as all of the small words with spaces between them, or as one word with all the parts run together. For example, ‘lautsprecher wand halter’, and ‘lautsprecherwandhalter’ are both acceptable ways to say ‘audio-speaker wall-mounting arm’.

It’s not that we never do this in English.  We take ‘dish’ and ‘washer’ and turn them into ‘dishwasher.’  But if we want to say ‘dishwasher safe’, that’s two words rather than one.  In German it’s ‘spülmaschinenfest’.  At Wayfair, we have plenty of spülmaschinenfest cookware, and we want our customers to be able to find it.

So we scoured the Solr documentation for ways to deal with this, and couldn’t find a solution.  Fortunately, as is so often the case with popular opensource platform software, we were not the first to encounter this problem.  A couple of fellows named Daniel Naber and Dawid Weiss got there first. A while back, they wrote an Apache-licensed German compound splitter, available here: https://github.com/dweiss/compound-splitter.  We needed to hack it a bit, in order to get it to work with the latest Solr, which Jon from our team did.  Our patches are here: https://github.com/wayfair/compound-splitter, and Dawid has been good enough to accept our pull request, and merge the changes back into the master.

Let’s look at some examples of the behavior.

Before this change, ‘lautsprecher wand halter’ gave a perfectly good list of audio-speaker wall-mounting arms:

 

 

 

 

 

 

 

 

 

 

 

 

But if we ran those words together, which is a reasonable thing for a German speaker to do, we got ‘keine relevanten Suchergebnisse’, ‘no relevant search results’:

 

 

 

 

 

 

 

 

 

If you’re aushangin’ in your supercool Berlin loft, planning the hottest techno dance party of 2013, we want you to be able to buy the hardware to set up your speakers without making you guess how the Wayfair search team wants you to split up your words!

With the compound splitter, we’re back to good results, even with the words run together:

 

 

 

 

 

 

 

 

 

 

 

 

Great!  Our compounds are working at this point, and the revisions to the compound splitter have been released back to the opensource community.

While I’m on this topic, I’ll give a little recap of some other internationalization tips and tricks for your Solr-backed sites.  This will be a HOWTO that encompasses some XML configuration and also some Java hacking.  I don’t think the hacks I’m going to describe are ready for us to make a Solr JIRA for general use, but they might help you if you have similar requirements and you’re up for patching your Solr install.

Let’s take German plurals.  They are of course constructed differently from English ones.  You can’t just add an ‘s’ to things and hope for the best.  Let’s examine the results for ‘washing machine’ and ‘washing machines’ (‘waschmaschine’ and ‘waschmachinen’), before and after.  If you were looking for a washing-machine-shaped storage sack / hamper, ‘waschmachine aufbewahrungssack’ was good enough to get you to this page:

 

 

 

 

 

 

 

 

 

 

 

But try ‘waschmaschinen’, and no such luck:

Those mats are machine washable, so they’re showing up because of some language in the product descriptions.  But we do have better results, and we want to make sure they float to the top.

So let’s get to work on our Solr schema, and configure it properly for German plurals. We can show the compound splitter configuration snippets at the same time:

<fieldType name="parsedtext_de" positionIncrementGap="100">
<analyzer type="index">
...
<filter class="org.apache.lucene.analysis.de.compounds.GermanCompoundSplitterTokenFilterFactory" compileDict="true" dataDir="<path-to-your-compound-splitter-input-files(morphy.txt files in the code)-and-fsts>"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="name-of-stopword-file (words that won't be indexed)" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="name-of-protected-words-file (words it won't stem)"/>
</analyzer>
<analyzer type="query">
...
<filter class="org.apache.lucene.analysis.de.compounds.GermanCompoundSplitterTokenFilterFactory" compileDict="false" dataDir="<path-to-your-compound-splitter-input-files(morphy files in the code)-and-fsts>"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="name-of-stopword-file (words that won't be queried)" enablePositionIncrements="true"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" protected="name-of-protected-words-file (words it won't stem)"/>
</analyzer>
</fieldType>

Notice that ‘compileDict’ is ‘true’ at index time and ‘false’ at query time.  The compileDict property directs Solr to distill a text configuration file down into a finite state transducer.  Finite state transducers are slow to create, but very quick to read, and you can pack a lot of reference information into them.  FSTs FTW! It makes me feel more intelligent just to write ‘finite state transducer’, let alone use it properly in a sentence, or make them work in my cluster of Solrs!

While he was at it, Jon also added a conditional copy transformer to our fork of lucene-solr.   This is where you copy data from a source field to different places depending on circumstances.  It’s a fairly common thing to want to do in Solr, and there has been a lot of discussion of it over the years.  There’s a wiki article on how to do it in the context of a posted document update here.  But that does not help you if you want to declare behavior in configuration files to govern database exports to Solr.  There was a discussion of that point on the Lucene mailing list, with some encouragement for movement in this direction from no less than Grant Ingersoll here.  We do a lot of imports from authoritative data sources in a relational database.  We have a few fields that are common across the languages and countries that we support, but which need different index and query behavior based on the language or country.  So make different derived fields for different types of derived data.  We have a pretty big schema, but there are only a handful of fields that need this treatment.  We support only two languages and a handful of countries, so we don’t need to engineer this, at least for now, for arbitrarily large combinations of fields and languages.  We settled on some naming conventions, and now we just have to add something like this to the DataImportHandler configuration file:


<document>
<entity name="name-of-index" dataSource="name-of-datasource" query="select stuff from source table" transformer="ClobTransformer,HTMLStripTransformer,RegexTransformer,ConditionalCopyTransformer">
....
<field column="DataSourceTableColumnName1" name="SolrSchemaFieldName1" stripHTML="true" clob="true" sourceColName="DataSourceTableColumnName1" copyTo="de" fieldsToCheck="dispatchField" valuesToMatch="3"/>
<field column="DataSourceTableColumnName2" name="SolrSchemaFieldName2" clob="true" sourceColName="DataSourceTableColumnName2" copyTo="uk,de" fieldsToCheck="dispatchField,dispatchField" valuesToMatch="2,3" keepOrig="false"/>
....
</document>

Then back in schema.xml, we declare these fields:

<field name="SolrSchemaFieldName1" type="parsedtext" indexed="true" stored="true" omitNorms="true"/>
<field name="SolrSchemaFieldName1_de" type="parsedtext_de" indexed="true" stored="true" omitNorms="true"/>
<field name="SolrSchemaFieldName2" type="spellcheckquery" indexed="true" stored="true" omitNorms="true"/>
<field name="SolrSchemaFieldName2_de" type="spellcheckquery" indexed="true" stored="true" omitNorms="true"/>
<field name="SolrSchemaFieldName2_uk" type="spellcheckquery" indexed="true" stored="true" omitNorms="true"/>

The first field is an example of an ordinary text field in our schema, which needs to be tokenized with the compound splitter. The second field contains words for use in spell checking.  We have separate spelling dictionaries by country, even when the language is nominally the same.  It has come to our attention that we don’t exactly spell the Queen’s English on the Boston side of the Atlantic.  The ‘dispatchField’ is the one whose values might be different, and then based on one value or another, we’ll want to copy the text to particular target fields.

Happy patching and searching!