Tuesday, June 11, 2013

SOLR facet component customizations

Here is the use case:   Assume, you are developing e-commerce  search system, one fine day, someone from marketing department comes and says he wants to able to sort the brad facet differently ( neither increasing count order or lexicographical order  supported by the SOLR facets module. For example bring Samsung after Apple, for a tablet category brand facet) In this case, development team will end up customizing the sort at application layer for each facet field. I did this as  custom SOLR component. Following recipe will show.  I am not including all the component logic, but following logic which does parsing of facet component outputs & setting the order differently.

public void process(ResponseBuilder rb) throws IOException
      
  NamedList facet_counts = (NamedList)rb.rsp.getValues().get("facet_counts" );

   if (facet_counts == null) return;
   NamedList> facet_fields = (NamedList>) facet_counts.get( "facet_fields" );

           if (facet_fields == null) return;
           if( facet_fields != null ) {
           TreeMap sortedMap = new  TreeMap();
              int ctr=0;
            for( Map.Entry> facet : facet_fields ) {
                 String key =  facet.getKey() ;
                 if ( customFacetCache.containsKey(key )){
                    FacetObj facetDetails = customFacetCache.get(key);
                    if (facetDetails != null ){
                          String sortOrder = facetDetails.sortOrder;
                           sortedMap.clear();
                           for( Map.Entry entry : facet.getValue() ) {
                             sortedMap.put(entry.getKey(), entry.getValue().longValue() );               
                           }

                  if (sortOrder.equalsIgnoreCase("custom")){
                          ///now introduce your own custom order.
                  }

Tuesday, May 28, 2013

SOLR 4.2 features: Rest API to fetch SOLR schema, fields etc in JSON

Most f the SOLR business logic developers end up parsing solr schema with Luke request.
 & parsing the SOLR response structures etc in their logic.
 If you are a pure JSON geek, this aspect turn off.

 However starting with Solr 4.2 onwards, now SOLR supports
 REST API to request the the schema in json format.
Not only entire schema file, one can request, just few fields or field types,
dynamic fields, copy fields etc.
Wish they support wild cards. May be for  future.

For now this is solid beginning.

 Entier schema
 http://localhost:8983/solr/collection1/schema?wt=json 

Request price field:
 http://localhost:8983/solr/collection1/schema/fields/price?wt=json

 Request dynamic field ending with _I 
http://localhost:8983/solr/collection1/schema/dynamicfields/*_i?wt=json

 Request data field type.
 http://localhost:8983/solr/collection1/schema/fieldtypes/date?wt=json

Old  Style:

 LukeRequest request = new LukeRequest();
request.setShowSchema(true);
 request.setMethod(METHOD.GET);
LukeResponse response = request.process(getServer());
if(response != null && response.getFieldInfo() != null)
{ Map fieldInfoMap = response.getFieldInfo(); 
  Set fieldkeys = fieldInfoMap.keySet();
  for (String fieldKey : fieldkeys) { 
       ////do your stuff
  }
 } }

Friday, May 24, 2013

Sample R code to count number of the terms in end user queries & plot


dups<-function p="">  df<-read .csv="" csv="" input="" nbsp="" p="" query="" read="" strip.white="TRUE)" temr="">
  df[[1]] <- any="" cleanup="" df="" fixed="T)" gsub="" nbsp="" of="" p="" redirect="" term="">   ind <- df="" duplicate="" duplicated="" filter="" p="">
  new.df <- df="" ind="" p="">
  myh<-nchar 1="" gsub="" nbsp="" new.df="" p="">  #buckets
  one<- length="" myh="=1])</p">  two<- length="" myh="=2])</p">  three<- length="" myh="=3])</p">  four<- length="" myh="=4])</p">  five<- length="" myh="=5])</p">  six<- length="" myh="=6])</p">  seven<- length="" myh="=7])</p">  eight<- length="" myh="=8])</p">  cvec <- c="" eight="" five="" four="" nbsp="" one="" p="" seven="" six="" three="" two="">
  result.frame = as.data.frame(matrix(ncol=2, nrow=10))
  names(result.frame) = c("Number", "Total")
  # following is OK for now

  result.frame = rbind(result.frame, c(1, one))
  result.frame = rbind(result.frame, c(2, two))
  result.frame = rbind(result.frame, c(3, three))
  result.frame = rbind(result.frame, c(4, four))
  result.frame = rbind(result.frame, c(5, five))
  result.frame = rbind(result.frame, c(6, six))
  result.frame = rbind(result.frame, c(7, seven))
  result.frame = rbind(result.frame, c(8, eight))

  plot(result.frame$Number,result.frame$Total,pch=19,col="blue" , xlab="Number of terms in a query" ,ylab="Total")


  lines (result.frame$Number, result.frame$Total,lwd="4",col="red"  )
  lm1<-lm otal="" p="" result.frame="" umber="">  abline (lm1,lwd="4",col="green"  )


}

Wednesday, May 15, 2013

Moving from FAST to SOLR: Project Updates

Some background: Decision to move to SOLR was already made. Team hardly knows anything about FAST ESP. However business team know what they want in the new SOLR platform. (A clone of FAST based search system while addressing their pain points)

  Current status: Phase I of the project was completed & system is already in production (With all complexities, we wrapped the project in record time.)

  Positives: SOLR is fast in-terms of content indexing & searches. (Overall QPS is greater than FAST and current sizing issues are resolved.)

  Challenges: 1) During implementation we noticed strong customizations around business rules. This functionality is not available in SOLR/Lucene. I did some domain specific customizations. 2) We are replacing a search product with decent relevancy (because of all business rules) & we started late with relevancy. Relevancy echo system includes fine-tuning of similarity algorithms (tf/idf, B25 etc.) plus fine-tuning of synonyms/ spell-check modules. SOLR synonyms/spell check modules need more improvements/core bug fixes. Again I did more customizations to meet the needs. 3) Dynamic range facets & site taxonomy traversal/updates need future work. Basic stuff is working. However if the taxonomy changes often, doing incremental updates is a complex issue. For now, we have a workaround in place. Some extent business rules stuff was invented to work around some of these problems. Map reduce & Graph DB frameworks seems to solve issues around dynamic range facets/dynamic taxonomies. Exploring simple integration approaches between Hadoop/SOLR.

  Luck factor: Existing FAST based search was not completely leveraging FAST strong document processing capabilities(Linguistic normalization/ sentiment /Taxonomy etc). So we managed with little customizations around Lucene analyzers.