ASCIIFoldingFilter, say what?

One of the coolest things of Solr/Lucene is the ability to apply any number of token filters when indexing (or querying) a piece of text. For those new to lucene, when a piece of text is indexed, a stream of tokens is created from the text during the analysis stage. Those tokens (along with some other information - their position, offsets, etc.) are what are actually stored in the lucene index (and not the text itself). When you query against the lucene index, the same tokenization process happens against your query string, and why it is so important to specify the same token filters for querying and indexing (unless you really know what you’re doing … and bad things will still probably happen).
  
Now, some of the tokenizers and filters have names that are synonymous with their function: WhitespaceTokenizer (a tokenizer that divides text at whitespace), LowerCaseFilter (normalizes token text to lower case), StopFilter (removes stop words from a token stream), etc.

Others are a bit more confusing. Take for instance the ASCIIFoldingFilter - which, at face value, is a very ambiguous name. Lucene documentation provides a bit more insight - This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists … but you’re probably still thinking, what the hell does that mean?!?

So, let’s suppose we indexed a document with the following piece of text to lucene - “this sentence is über” - and it produced the following tokens => {‘sentence’, 'über’}. Now, suppose we query the lucene index with the query term 'über’. Will it find a match to our document? Of course it will.

BUT, what if we queried the lucene index with the query term 'uber’ (note the difference). Will it find a match then? … pause for effect … it will NOT. So, how do we get our document to match regardless of whether the user entered 'über’ or 'uber’? That’s where the ASCIIFoldingFilter comes to the rescue!

If you re-read its description, it will convert all characters (outside the standard ASCII block) into their ASCII equivalents, if AND only if one exists (or else, it leaves that character alone). So, if we applied the ASCIIFoldingFilter during analysis, our sentence “this sentence is über” produces the tokens {'sentence’, 'uber’} … and querying with 'über’ or 'uber’ would yield a document match! Remember, query strings go through the same tokenization process as well, so your query for 'über’ actually is converted to 'uber.

Now, I leave it up to you to find out which characters have reasonable ASCII alternatives (è -> e, ß -> ss, etc.). Have fun!

See Lucene’s documentation for more info


MongoDB - Spring Integration

Interesting … Spring provides integration with MongoDB through its spring-data project: http://www.springsource.org/spring-data/mongodb

Some very interesting ideas and concepts that they have implemented:

  • Spring configuration support using Java based @Configuration classes or an XML namespace for a Mongo driver instance and replica sets.
  • MongoTemplate helper class that increases productivity performing common Mongo operations. Includes integrated object mapping between documents and POJOs.
  • Annotation based mapping metadata but extensible to support other metadata formats
  • Persistence and mapping lifecycle events
  • Low-level mapping using MongoReader/MongoWriter abstractions
  • Java based Query, Criteria, and Update DSLs
  • Automatic implementation of Repository interfaces including support for custom finder methods.
  • Map-Reduce integration
  • JMX administration and monitoring

To read more, check out their reference documentation


The 10gen Blog on MongoDB and NoSQL: Free Webcast: MongoDB Schema Design: How to Think Non-Relational

The 10gen Blog on MongoDB and NoSQL: Free Webcast: MongoDB Schema Design: How to Think Non-Relational

Grails external logging

Grails has some fantastic built-in facilities for logging. If you haven’t see them already, you should check them out.

There is only one problem: what happens if you want to configure your logging outside of Grails at the app server layer? Grails is only one of the apps in our stack, so we prefer to control our logging configuration at the app server JVM layer. Looking at the Grails user guide’s logging section, there is seemingly no clear way of doing this. But if you know one thing about Grails, you know that you can do anything with it. You just have to do some digging. After you do, you’ll find out that Grails is using an app context listener plugin to wrap and control logging in a WAR deployment: org.codehaus.groovy.grails.plugins.log4j.web.util.Log4jConfigListener . This is configured inside the web.xml that is deployed with the app, so you need to modify it before deployment. There are two ways to do this. You can run 

grails install-templates

and edit out the Log4j listener element from src/templates/war/web.xml:

<listener>
   <listener-class>org.codehaus.groovy.grails.plugins.log4j.web.util.Log4jConfigListener</listener-class>
</listener>

Alternatively, you can create a file scripts/_Events.groovy under your app with these contents:

// Add this code to scripts/_Events.groovy under your Grails app
// Removes Log4jConfigListener from Grails web.xml

import groovy.xml.DOMBuilder
import groovy.xml.XmlUtil
import groovy.xml.dom.DOMCategory

eventCreateWarStart = { warName, stagingDir ->

   def webXmlFile = new File(stagingDir, '/WEB-INF/web.xml') 
   def wxml = DOMBuilder.parse(new StringReader(webXmlFile.text)).documentElement

   String className = 'org.codehaus.groovy.grails.plugins.log4j.web.util.Log4jConfigListener'
   use (DOMCategory) {
      def listenerNodes = wxml.'listener'
      for (n in listenerNodes) {
         if (n.'listener-class'.text() == className) {
            wxml.removeChild n
         }
      }
   }

   webXmlFile.withWriter { it << XmlUtil.serialize(wxml) }
}

That’s it!


Arya a MongoDB based Search Engine

Arya a MongoDB based Search Engine

Solr 3.5 Upgrade

Over the weekend, we upgraded our search boxes to Solr 3.5 - which now offers grouping support for distributed searches (SOLR-2066 & SOLR-2776). Due to our ever increasing set of search data, we have been indexing our search content to multiple Solr shards for quite some time now. As a result, we have not been able to experiment with this powerful feature … but not anymore!

To read more about result grouping, check out Solr’s wiki: http://wiki.apache.org/solr/FieldCollapsing


UTF-8 all the way

Recently we have been tracking the various integration points in the Traackr technology stack and making sure we are 100% UTF-8 compliant all the way thru. Each layer has it’s own set up and required configuration. Sometimes right out of the box, sometimes with more tweaking. 

Ran into this article while doing some research, that shows how to make your web app UTF-8 enabled. A lot of configuration for something seemingly simple.

http://www.itnewb.com/tutorial/UTF-8-Enabled-Apache-MySQL-PHP-Markup-and-JavaScript


Effective Scala

Effective Scala

Using Traackr API

Today Engage121 announced they are launching a new version of their product that integrates with Traackr: Engage121 Launches Version 2.1

How do they do that you might ask? Well, very easy, they are using our awesome API. I thought I would show you how you can do it to. We are going to build a little Traackr widget from one of our alpha lists, Cloud Computing. The widget will display random posts from the list on a web page.

First of, the HTML for the page. Let’s keep it simple. We load JQuery because we will need it later to load the A-List via the API and display the posts.

The body contains a simple DIV and TABLE where we will display the image for the influencer and the text of the post.

<!DOCTYPE html>
<html>
    <head>
        <title>AList Widget</title>
        <script type=“text/javascript”
            src=“https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js”>
        </script>
    </head>
    <body>
        <h1>A-List Widget</h1>

        <!– alist title –>
        <div id=“alist-title”><i>Loading</i></div>

        <!– random post to display –>
        <div id=“alist-post" style="display: none; margin-top: 15px;”>
            <table><tr>
                <td>
                    <!– author's image –>
                    <img id=“author" src=”“/>    
                </td>
                <td>
                    <!– post text –>
                    <div id="post”></div>
                </td>
            </tr></table>
        </div>

    </body>
</html>

Now, the fun part. The trick it load to load the A-List via our API, here is the link for it. If you are a Traackr customer, this link is accessible from your campaign’s setting.

Once we have loaded the A-List, we can simply call the Javascript function show_post() every 5 seconds to load a new post. We select each post by randomly selecting 1 influencer from the list, then randomly select 1 channel from this influencer and finally 1 random post. Here is what it looks like:

<script type=“text/javascript”>
            $(document).ready(function(){
                $.ajax({
                    url: 'http://alist.traackr.com/influencers/all/4233.json’,
                    data: {sec: '2728ea00020714632aa811e6f4a89e3a’},
                    dataType: 'jsonp’,
                    jsonp: 'jsonpcallback’,
                    success: function(data) { show_alist(data); }
                });
            });

            alist = null;

            var show_alist = function(data) {
                // read list and display title
                alist = data;
                $(’#alist-title’).html(alist.name);
                setTimeout(show_post, 5000);
            } // End function show_alist()

            var show_post = function() {
                // Find random influencer
                current_influencer = Math.floor(Math.random() * (alist.list.length - 1));
                influencer = alist.list[current_influencer];
                // Find random channel
                current_channel = Math.floor(Math.random() * (influencer.channels.length - 1));
                channel = influencer.channels[current_channel];
                // Find channel has posts
                if ( channel.posts.length > 0 ) {
                    // FInd random post
                    current_post = Math.floor(Math.random() * (channel.posts.length - 1));
                    // get data
                    img  = influencer.pics.small;
                    post = channel.posts[current_post].title;
                    url  = channel.posts[current_post].url;
                    // display
                    $(’#alist-post’).hide();
                    $(’#author’).attr(‘src’, img);
                    $(’#post’).html(’<a target=“_blank" href=”' + url + ’“>' + post + ’</a>’);
                    $(’#alist-post’).fadeIn(750);
                    setTimeout(show_post, 5000);
                }
                else {
                    setTimeout(show_post, 100);
                }
            }
        </script>

15 min in the oven at 350 and we are done. Check out the final result

And the best part about it? Traackr’s A-Lists refresh automatically weekly, so without having to do anything, just come back every week and discover new content.

That’s all folks!