One of the coolest things of Solr/Lucene is the ability to apply any number of token filters when indexing (or querying) a piece of text. For those new to lucene, when a piece of text is indexed, a stream of tokens is created from the text during the analysis stage. Those tokens (along with some other information – their position, offsets, etc.) are what are actually stored in the lucene index (and not the text itself). When you query against the lucene index, the same tokenization process happens against your query string, and why it is so important to specify the same token filters for querying and indexing (unless you really know what you’re doing … and bad things will still probably happen).
Now, some of the tokenizers and filters have names that are synonymous with their function: WhitespaceTokenizer (a tokenizer that divides text at whitespace), LowerCaseFilter (normalizes token text to lower case), StopFilter (removes stop words from a token stream), etc.

Others are a bit more confusing. Take for instance the ASCIIFoldingFilter – which, at face value, is a very ambiguous name. Lucene documentation provides a bit more insight – This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists … but you’re probably still thinking, what the hell does that mean?!?

So, let’s suppose we indexed a document with the following piece of text to lucene – “this sentence is über” – and it produced the following tokens => {‘sentence’, ‘über’}. Now, suppose we query the lucene index with the query term ‘über’. Will it find a match to our document? Of course it will.

BUT, what if we queried the lucene index with the query term ‘uber’ (note the difference). Will it find a match then? … pause for effect … it will NOT. So, how do we get our document to match regardless of whether the user entered ‘über’ or ‘uber’? That’s where the ASCIIFoldingFilter comes to the rescue!

If you re-read its description, it will convert all characters (outside the standard ASCII block) into their ASCII equivalents, if AND only if one exists (or else, it leaves that character alone). So, if we applied the ASCIIFoldingFilter during analysis, our sentence “this sentence is über” produces the tokens {‘sentence’, ‘uber’} … and querying with ‘über’ or ‘uber’ would yield a document match! Remember, query strings go through the same tokenization process as well, so your query for ‘über’ actually is converted to ‘uber.

Now, I leave it up to you to find out which characters have reasonable ASCII alternatives (è -> e, ß -> ss, etc.). Have fun!

See Lucene’s documentation for more info