I am a caffeinated, busy software junkie. Daily I help teams with solution engineering solutions

Daily I help teams with solution engineering aspect of connected vehicle data projects. (massive datasets & always some new datasets with new car models aka new technologies.) Lately in the spare time, applying some of the ML/Deep learning techniques on datasets (many are create based on observations of real datasets)To Share some thoughts on my work (main half of this blog) and the other half will be about my family and friends.

Tuesday, June 11, 2013

SOLR facet component customizations

Here is the use case: Assume, you are developing e-commerce search system, one fine day, someone from marketing department comes and says he wants to able to sort the brad facet differently ( neither increasing count order or lexicographical order supported by the SOLR facets module. For example bring Samsung after Apple, for a tablet category brand facet) In this case, development team will end up customizing the sort at application layer for each facet field. I did this as custom SOLR component. Following recipe will show. I am not including all the component logic, but following logic which does parsing of facet component outputs & setting the order differently.

public void process(ResponseBuilder rb) throws IOException

NamedList facet_counts = (NamedList)rb.rsp.getValues().get("facet_counts" );

if (facet_counts == null) return;

NamedList> facet_fields = (NamedList>) facet_counts.get( "facet_fields" );

if (facet_fields == null) return;

if( facet_fields != null ) {

TreeMap sortedMap = new TreeMap();

int ctr=0;

for( Map.Entry> facet : facet_fields ) {

String key = facet.getKey() ;

if ( customFacetCache.containsKey(key )){

FacetObj facetDetails = customFacetCache.get(key);

if (facetDetails != null ){

String sortOrder = facetDetails.sortOrder;

sortedMap.clear();

for( Map.Entry entry : facet.getValue() ) {

sortedMap.put(entry.getKey(), entry.getValue().longValue() );

}

if (sortOrder.equalsIgnoreCase("custom")){

///now introduce your own custom order.

}

Tuesday, May 28, 2013

SOLR 4.2 features: Rest API to fetch SOLR schema, fields etc in JSON

Most f the SOLR business logic developers end up parsing solr schema with Luke request.
& parsing the SOLR response structures etc in their logic.
If you are a pure JSON geek, this aspect turn off.

However starting with Solr 4.2 onwards, now SOLR supports
REST API to request the the schema in json format.
Not only entire schema file, one can request, just few fields or field types,
dynamic fields, copy fields etc.
Wish they support wild cards. May be for future.

For now this is solid beginning.

Entier schema
http://localhost:8983/solr/collection1/schema?wt=json

Request price field:
http://localhost:8983/solr/collection1/schema/fields/price?wt=json

Request dynamic field ending with _I
http://localhost:8983/solr/collection1/schema/dynamicfields/*_i?wt=json

Request data field type.
http://localhost:8983/solr/collection1/schema/fieldtypes/date?wt=json

Old Style:

LukeRequest request = new LukeRequest();
request.setShowSchema(true);
request.setMethod(METHOD.GET);
LukeResponse response = request.process(getServer());
if(response != null && response.getFieldInfo() != null)
{ Map fieldInfoMap = response.getFieldInfo();
Set fieldkeys = fieldInfoMap.keySet();
for (String fieldKey : fieldkeys) {
////do your stuff
}
} }

Friday, May 24, 2013

Sample R code to count number of the terms in end user queries & plot

dups<-function p=""> df<-read .csv="" csv="" input="" nbsp="" p="" query="" read="" strip.white="TRUE)" temr="">
df[[1]] <- any="" cleanup="" df="" fixed="T)" gsub="" nbsp="" of="" p="" redirect="" term=""> ind <- df="" duplicate="" duplicated="" filter="" p="">
new.df <- df="" ind="" p="">
myh<-nchar 1="" gsub="" nbsp="" new.df="" p=""> #buckets
one<- length="" myh="=1])</p"> two<- length="" myh="=2])</p"> three<- length="" myh="=3])</p"> four<- length="" myh="=4])</p"> five<- length="" myh="=5])</p"> six<- length="" myh="=6])</p"> seven<- length="" myh="=7])</p"> eight<- length="" myh="=8])</p"> cvec <- c="" eight="" five="" four="" nbsp="" one="" p="" seven="" six="" three="" two="">
result.frame = as.data.frame(matrix(ncol=2, nrow=10))
names(result.frame) = c("Number", "Total")
# following is OK for now

result.frame = rbind(result.frame, c(1, one))
result.frame = rbind(result.frame, c(2, two))
result.frame = rbind(result.frame, c(3, three))
result.frame = rbind(result.frame, c(4, four))
result.frame = rbind(result.frame, c(5, five))
result.frame = rbind(result.frame, c(6, six))
result.frame = rbind(result.frame, c(7, seven))
result.frame = rbind(result.frame, c(8, eight))

plot(result.frame$Number,result.frame$Total,pch=19,col="blue" , xlab="Number of terms in a query" ,ylab="Total")

lines (result.frame$Number, result.frame$Total,lwd="4",col="red" )
lm1<-lm otal="" p="" result.frame="" umber=""> abline (lm1,lwd="4",col="green" )

}

Wednesday, May 15, 2013

Moving from FAST to SOLR: Project Updates

Some background: Decision to move to SOLR was already made. Team hardly knows anything about FAST ESP. However business team know what they want in the new SOLR platform. (A clone of FAST based search system while addressing their pain points)

Current status: Phase I of the project was completed & system is already in production (With all complexities, we wrapped the project in record time.)

Positives: SOLR is fast in-terms of content indexing & searches. (Overall QPS is greater than FAST and current sizing issues are resolved.)

Challenges: 1) During implementation we noticed strong customizations around business rules. This functionality is not available in SOLR/Lucene. I did some domain specific customizations. 2) We are replacing a search product with decent relevancy (because of all business rules) & we started late with relevancy. Relevancy echo system includes fine-tuning of similarity algorithms (tf/idf, B25 etc.) plus fine-tuning of synonyms/ spell-check modules. SOLR synonyms/spell check modules need more improvements/core bug fixes. Again I did more customizations to meet the needs. 3) Dynamic range facets & site taxonomy traversal/updates need future work. Basic stuff is working. However if the taxonomy changes often, doing incremental updates is a complex issue. For now, we have a workaround in place. Some extent business rules stuff was invented to work around some of these problems. Map reduce & Graph DB frameworks seems to solve issues around dynamic range facets/dynamic taxonomies. Exploring simple integration approaches between Hadoop/SOLR.

Luck factor: Existing FAST based search was not completely leveraging FAST strong document processing capabilities(Linguistic normalization/ sentiment /Taxonomy etc). So we managed with little customizations around Lucene analyzers.

Tuesday, May 15, 2012

Lucene revolution conference 2012

After long time, I attended Lucene revolution conference. Overall this is good experience. Few sessions are too good & few missed the mark.

Personally I like the following sessions because of the contents/presenters energy/passion behind the search technology.

Automata Invasion
Challenges in Maintaining a High Performance Search Engine Written in Java
Updateable Fields in Lucene and other Codec Applications
Solr 4: The SolrCloud Architecture

Also “Stump the Chump” questions are interesting & learned quite a bit.
I won small prizes too. In general, Lucid imagination uploads the
conference vedios at the following location. Keep watching.
I also missed few good sessions in Big data area.

http://www.lucidimagination.com/devzone/videos-podcasts/conference-videos

Friday, May 04, 2012

Mapping hierarchical data in to SOLR

Experiments with Solr 4.0 (beta)

While modeling hierarchical data in XML easy, (for example org charts, Bill of materials structures) mapping to a persistent storage is very challenging. Relational SQL manages it however fetching/ updating the hierarchies is very difficult. Even books are written on mapping Tree/Graphs in to RDBMS world.

Consider simple hierarchy list looks like this:
Satya
Saketh
Dhanvi
Venkata
Dhasa

The most common and familiar known method is adjacency model, and it usually works as every node knows the adjacency node. (In SOLR world, ID contains in the unique value & parent field contains parent. Assume for the root node it is null or same.)

In SOLR each row is a document:

SOLRID Name Parent
01 Satya NULL
02 Saketh 01
03 Dhanvi 01
04 Venkata 01
05 Dhasa 04

Latest SOLR(4.0 beta?) Join functionality gives you the full hierarchy (or any piece of it) very quickly and easily.

Example queries:

1) Give me complete hierarchical list: q= {!join from=id to=parent}id:*
2) Give me immediate childs of Satya: q={!join from=id to=parent}id:Satya

An example SOLR query component to pull other objects

Consider Linked-in example. i.e. I have a set of connections. With single SOLRrequest, following SOLR component brings all first level connections information. In RDBMS world, typical example is employee table. i.e. bring me first level employees of manager. Assume single table contains both employee and manager information. (typical Adjacency list)

In terms of physical SOLR mapping, every document contains a connection field which contains the list(Adjacency list) of connections (root ids)

Configuration point of view, add the following to solrconfig.xml


public class ExampleComponent extends SearchComponent
{
  public static final String COMPONENT_NAME = "example";

  @Override
  public void prepare(ResponseBuilder rb) throws IOException
  {
  
  }
  @SuppressWarnings("unchecked")
  @Override
  public void process(ResponseBuilder rb) throws IOException
  {
     DocSlice slice = (DocSlice) rb.rsp.getValues().get("response");
       SolrIndexReader reader = rb.req.getSearcher().getReader();
       SolrDocumentList rl = new SolrDocumentList();
       int docId=0;//// at this point consider only one rootid.
       for (DocIterator it = slice.iterator(); it.hasNext(); ) {
         docId = it.nextDoc();
         Document doc = reader.document(docId);
         String id = (String)doc.get("id");
         String connections = (String)doc.get("contains");
         System.out.println("\n id:"+id+" contains-->"+connections);
         List<String> list = new ArrayList<String>();
         list.add(id);//add rootid too. If we have joins in solr4.0
   int pos = 0, end;
   while ((end = connections.indexOf(',', pos)) >= 0) {
    list.add(connections.substring(pos, end));
    pos = end + 1;
   }
   BooleanQuery bq = new BooleanQuery();
   Iterator<String> cIter = list.iterator();
   while (cIter.hasNext()) {
    String anExp = cIter.next();
    TermQuery tq = new TermQuery(new Term("id",anExp));
    bq.add(tq, BooleanClause.Occur.SHOULD);
   }
         SolrIndexSearcher searcher = rb.req.getSearcher();
         DocListAndSet results = new DocListAndSet();
         results.docList = searcher.getDocList(bq, null, null,0, 100,rb.getFieldFlags());
         System.out.println("\n results.docList-->"+results.docList.size() );
         rl.setNumFound(results.docList.size());
         rb.rsp.getValues().remove("response");
         rb.rsp.add("response", results.docList);
     }
  }

  @Override
  public String getDescription() {
    return "Information";
  }
  @Override
  public String getVersion() {
    return "Solr gur";
  }
  @Override
  public String getSourceId() {
    return "Satya Solr Example";
  }
  @Override
  public String getSource() {
    return "$URL:   }
  @Override
  public URL[] getDocs() {
    return null;
  }
}

Wednesday, March 28, 2012

Data format conversion in java

Basic need is we used to get xml date data type in different formats for data sources. However our federated search engine we are storing in a neutral format. (For simplicity sake.) After analyzing couple of formats, added this code in year 2005. Nowadays Team is using some apache package utils. But I don’t think they will something different. I am not seeing many options.



public static final String[] date_format_list = {
     "yyyy-MM-dd'T'HH:mm:ss'Z'",
     "yyyy-MM-dd'T'HH:mm:ss",
     "yyyy-MM-dd",
     "yyyy-MM-dd hh:mm:ss",
     "yyyy-MM-dd HH:mm:ss",
     "EEE MMM d hh:mm:ss z yyyy"
     /// add your own format here
   };

   public static Date parseDate(String d) throws ParseException {
     return parseInputWithFormats(d, date_format_list);
   }

   public static Date parseInputWithFormats(
           String dateValue,
           String[] formatList
   ) throws ParseException {
     if (dateValue == null || formatList == null || formatList.length == 0) {
       throw new IllegalArgumentException("dateValue is null");
     }
 
     if (dateValue.length() > 1
             && dateValue.startsWith("'")
             && dateValue.endsWith("'")
             ) {
       dateValue = dateValue.substring(1, dateValue.length() - 1);
     }

     SimpleDateFormat dateParser = null;
        for(int i=0;i < formatList.length;i++){
       String format = (String) formatList[i];
       if (dateParser == null) {
         dateParser = new SimpleDateFormat(format, Locale.US);
       } else {
         dateParser.applyPattern(format);
       }
       try {
         return dateParser.parse(dateValue);
       } catch (ParseException pe) {
        //pe.printStackTrace();
       }
     }
     throw new ParseException("Unable to parse the input date " + dateValue, 0);
   }

public static void main(String[] args) {
  String fromDt="";
  String nPattern = "yyyy-MM-dd'T'HH:mm:ss'Z'";
  SimpleDateFormat sdf = new SimpleDateFormat(nPattern);
  
  String currentValue="Fri Jul 22 04:22:14 CEST 2011";
  
   try{  
             fromDt = sdf.format(parseDate(currentValue.toString() )  );
      } catch (Exception e) {
       System.out.print("\n Case1: date format exception"+e.getMessage()+ " SOLR currentValue:"+currentValue);
       fromDt="";
      } 
      System.out.println("Case1. date as str---"+fromDt);
      
      currentValue="2011-07-21 21:22:14";
       try{
              fromDt = sdf.format(parseDate(currentValue.toString() )  );
       } catch (Exception e) {
        System.out.print("\n Case2: date format exception"+e.getMessage()+ " SOLR currentValue:"+currentValue);
        fromDt="";
       } 
      System.out.println("\n Cse2.  date as str---"+fromDt);
  }

Monday, March 26, 2012

Latest family picicture from Disney world.

This is from latest Disney world. (Photo from Disney pass port photo service. We took good number of phones with Disney theme characters. Photos are not brilliant by professional photographer standards. We must not blame them. They are shooting nonstop. However they are not bad too.)

Friday, March 23, 2012

Parsing complex Xsd file with Java code.

This one nearly 4+ year old code. Now Team is using apache xerces XSModel.
However this code is there for sanity check. Still this works fine.
(You need xsom.jar + relaxngDatatype.jar file. Google it. You will find the jars.)


public class XsdReader {
 public static void main (String args[]) 
 {
  XsdReader rr = new XsdReader();
  rr.parseSchema();
    }
 
 public void parseSchema()
 {  
  File file = new File("D:\\tmp\\books.xsd");
  try {
      XSOMParser parser = new XSOMParser();
      parser.parse(file);
      XSSchemaSet sset = parser.getResult();
      XSSchema mys = sset.getSchema(1);
      Iterator itr = sset.iterateSchema();
      while( itr.hasNext() ) {
        XSSchema s = (XSSchema)itr.next();
        System.out.println("Target namespace: "+s.getTargetNamespace());
        XSComplexType ct = mys.getComplexType("books");
        int ctr=0;
        if ( ct != null){
        Collection c = ct.getAttributeUses();
        Iterator i = c.iterator();while(i.hasNext()){
            XSAttributeDecl attributeDecl = i.next().getDecl(); 
           
            System.out.print("ctr="+ctr++ +"name:"+ attributeDecl.getName());
            System.out.print("    type: "+attributeDecl.getType());
            System.out.println("");
          }
        }
        Iterator jtr = s.iterateElementDecls();
        while( jtr.hasNext() ) {
          XSElementDecl e = (XSElementDecl)jtr.next();
          
          System.out.print( e.getName() );
          if( e.isAbstract() )
            System.out.print(" (abstract)");
          System.out.println();
        }
      }
    }
    catch (Exception exp) {
      exp.printStackTrace(System.out);
    }
  }
}

Monday, February 06, 2012

Lucene Standard Analyzer vs. Lingpipe EnglishStop Tokenizer Analyzer

This is old code from prototypes directory.
For some odd reason, I end up prototyping different analyzers for PLM space content vs 3ed party analyzers. (Basic need is which got better control on STOP words. At least based on my quick proto type, SOLR got easy constructs.)

Small sample code comparing both analyzers is included.
I did not see much difference for small input text.



public class AnalyzerTest {
 
  private static Analyzer analyzer;
   private static long perfTime = 0;

   public static void main(String[] args) {
     try {
 
         analyzer = new StandardAnalyzer(org.apache.lucene.util.Version.LUCENE_34);

         String str = "PLM technology refers to the group of software applications that create and manage the data that define a product and the process for building the product. Beyond just technology, PLM is a discipline that defines best practices for product definition, configuration management, change control, design release, and many other product and process-related procedures.";

         perfTime -= System.currentTimeMillis();
         displayTokensWithLuceneAnalyzer(analyzer, str);
         perfTime += System.currentTimeMillis();
         
         System.out.println("Lucene Analyzer: " + perfTime + " msecs.");
         
         perfTime -= System.currentTimeMillis();
         displayTokensWithLingpipeAnalyzer(str);
         perfTime += System.currentTimeMillis();
         
         System.out.println("Lingpipe Analyzer: " + perfTime + " msecs.");
         
   } catch (IOException ie) {
         System.out.println("IO Error " + ie.getMessage());
       }
       System.out.println("Time: " + perfTime + " msecs.");
       System.out.println("Ended");
        }

    private static void displayTokensWithLingpipeAnalyzer(String text)
       throws IOException {
     
     System.out.println("Inside LingpipeAnalyzer ");
     
         TokenizerFactory ieFactory  
            = IndoEuropeanTokenizerFactory.INSTANCE;
         
          TokenizerFactory factory
                = new EnglishStopTokenizerFactory(ieFactory);
               // = new IndoEuropeanTokenizerFactory();
        
     char[] cs =text.toCharArray();
     Tokenizer tokenizer = factory.tokenizer(cs, 0, cs.length);
     String[] tokens = tokenizer.tokenize();
     for (int i = 0; i < tokens.length; i++)
        System.out.println(tokens[i]);
     
     System.out.println("Total no. of Tokens: " +tokens.length );
     
    }
   private static void displayTokensWithLuceneAnalyzer(Analyzer analyzer, String text)
      throws IOException {
    System.out.println("Inside LuceneAnalyzer ");
        TokenStream tokenStream = analyzer.tokenStream("contents",new StringReader(text) );
    OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
    CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
       int length=0;
       
    while (tokenStream.incrementToken()) {
        int startOffset = offsetAttribute.startOffset();
        int endOffset = offsetAttribute.endOffset();
        String term = charTermAttribute.toString();
        System.out.println("term->"+term+ " start:"+startOffset+" end:"+endOffset);
        length++;
    }
       System.out.println("Total no. of Tokens: " + length);
  }
   
}

Tuesday, January 31, 2012

Java Set Operations

As you know, for a while I’ve been playing with Solr. Basically implementing federated search framework to aggregate different ERP, CRM, PLM space data sources.
This post is not about federated search but I keep using bunch of set operations to compare the search results, comparing the unique doc id etc.


public class SetOperations {
 public static  Set union(Set setA, Set setB) {
     Set tmp = new TreeSet(setA);
     tmp.addAll(setB);
     return tmp;
   }

   public static  Set intersection(Set setA, Set setB) {
     Set tmp = new TreeSet();
     for (T x : setA)
       if (setB.contains(x))
         tmp.add(x);
     return tmp;
   }

   public static  Set difference(Set setA, Set setB) {
     Set tmp = new TreeSet(setA);
     tmp.removeAll(setB);
     return tmp;
   }
   
   public static void main(String[] args) {
         
        SortedSet s1=  new TreeSet();
        s1.add("one");
        s1.add("two");

        SortedSet s2=  new TreeSet();
        s2.add("two");
        s2.add("three");
        s2.add("four");
         
        SortedSet result = (SortedSet) union(s1,s2);
        
        Iterator it = result.iterator();
        System.out.print("union result  -->");
         while (it.hasNext()) {
             String value = it.next();
             System.out.print(value+", ");
         }
         System.out.println("\n");
         
         result = (SortedSet) intersection(s1,s2);
         
      it = result.iterator();
      System.out.print("intersection result-->");
      while (it.hasNext()) {
              String value = it.next();
              System.out.print(value+ ", ");
      }
      System.out.println("\n");

         result = (SortedSet) difference(s1,s2);
         
      it = result.iterator();
      System.out.print("difference result-->");
      while (it.hasNext()) {
              String value = it.next();
              System.out.print(value+", ");
      }
      System.out.println("\n");
      
         /*
         SortedSet i1=  new TreeSet();
         i1.add(new Integer("1"));
         
         SortedSet i2=  new TreeSet();
         i2.add(new Integer("2"));
         
         SortedSet iresult = (SortedSet) union(i1,i2);
         
         Iterator iit = iresult.iterator();
         System.out.println("Integer union result");
          while (iit.hasNext()) {
           Integer value = iit.next();
              System.out.println(value+",");
          }
         */
   }
   
}

Monday, October 24, 2011

XSL recursion sample

In my use case, I am receiving part numbers as a big string and all are separated by comma(,), I need to split them in to array of elements.
Here is the sample code.




<xsl:template name="printChildObjects">
  <xsl:param name="inputString"/>
  <xsl:param name="delimiter"/>

  <xsl:choose>
    <xsl:when test="contains($inputString, $delimiter)">
    <xsl:variable name="aChild">
    
    <xsl:value-of select="substring-before($inputString,$delimiter)"/>
    </xsl:variable>
    <xsl:element name="field">
    <xsl:attribute name="name">
    <xsl:text> childObject</xsl:text>
     </xsl:attribute> 
     <xsl:value-of select="$aChild" /> 
     </xsl:element>
        <xsl:call-template name="printChildObjects">
           <xsl:with-param name="inputString" select="substring-after($inputString,$delimiter)"/>
           <xsl:with-param name="delimiter"
           
          select="$delimiter"/> 
            </xsl:call-template>
            </xsl:when> 
             <xsl:otherwise> 
             <xsl:choose> 
                <xsl:when test="$inputString != ''"> 
                <xsl:element name="field"> 
                <xsl:attribute name="name"> 
                <xsl:text> childObject</xsl:text> 
                     </xsl:attribute> 
                     <xsl:value-of select="$inputString" />    </xsl:element> 
             </xsl:when> 
             <xsl:otherwise>
         </xsl:otherwise> 
        </xsl:choose> 
      </xsl:otherwise> 
   </xsl:choose>
</xsl:template>

Thursday, July 28, 2011

xsl:text must not contain child elements

Some info from old xsl cheetsheet.

Often there is a requirement to print a variable value in output.
Following xsl code will display compilation errors.
Simple workaround is pass it to a template

<xsl:text>

<xsl:value-of select="$objType"/> </xsl:text>

<xsl:template name="varValue">

<xsl:param name="value" />

<xsl:choose> <xsl:when test="$value = ''"> <xsl:text>n/a</xsl:text> </xsl:when> <xsl:otherwise> <xsl:value-of select="$value" /> </xsl:otherwise> </xsl:choose> </xsl:template>

Tuesday, May 24, 2011

Typical industry search input requirements/ patterns

An object string property contains all possible values.
Now I want to filter the list with different kind of input patterns (aka single field search) In particular accepting wildcards (*,? & + etc escaping is fun in java.)
sample code:

String str="*+PATAC+*";
Pattern pat=Pattern.compile(".*\\+*\\+.*");

Matcher matcher=pat.matcher(str);
boolean flag=matcher.find(); // true;

Logger.println("1) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);

str = "adjkfh+PATAC+ajdskfhhk";

matcher=pat.matcher(str);
flag=matcher.find(); // true;

Logger.println("2) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);

str = "PATAC";

matcher=pat.matcher(str);
flag=matcher.find(); // true;

Logger.println("3) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);

str = "adjkfh+PATAC+";

matcher=pat.matcher(str);
flag=matcher.find(); // true;

Logger.println("4) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);

str = "+PATAC+testingsuffixchars";

matcher=pat.matcher(str);
flag=matcher.find(); // true;

Logger.println("5) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);

Sample code to create SOLR document from CSV file

Just one guy stopped by office & asked to index this legacy data. So quick & dirty solution is the following. (May be Perl etc are there, but this one gives me more flexibility.



public class CsvToSolrDoc
{
public String columnName(int i)
{
//workarounds workarounds
if ( i == 0) return "id";
if ( i == 1) return "what ever you want as field name";
return null;
}
public void csvToXML(String inputFile, String outputFile) throws java.io.FileNotFoundException, java.io.IOException
{
BufferedReader br = new BufferedReader(new FileReader(inputFile));
StreamTokenizer st = new StreamTokenizer(br);
String line = null;
FileWriter fw = new FileWriter(outputFile);
// Write the XML declaration and the root element
fw.write("\n");
fw.write("\n");
while ((line = br.readLine()) != null)
{
String[] values = line.split(",");
fw.write(" \n");
int i = 1;
for ( int j=0;j it is length; J++)
{
String colName = "field name=\""+columnName(j)+"\"";
fw.write("<" + colName + ">");
fw.write(values[j].trim());
fw.write( "\n");
}
fw.write(" \n");
}
// Now we're at the end of the file, so close the XML document,
// flush the buffer to disk, and close the newly-created file.
fw.write("\n");
fw.flush();
fw.close();
}

public static void main(String argv[]) throws java.io.IOException
{
CsvToSolrDoc cp = new CsvToSolrDoc();
cp.csvToXML("c:\\tmp\\m2.csv", "c:\\tmp\\m2.xml");
}

SOLR project stories. Lack of SOLR post filter support

For past few quarters, I am working on a project to implement security on object documents. My goals is Decorate every SOLR document with ACL field. This ACL field used to determine what users have access to this document. ACL syntax is something like +u (dave)-g(support) etc. My thoughts are process these ACL fields after search aka I want to subject the query component results via some kind of post filter. However SOLR is not offering any direct mechanism to specify this kind of post filter along with the search request. At Lucene level, there is an option to specify the filter however current AS IS implementation, it sucks.
It iterates entire document sets. For large documents this sucks. Also we need to SOLR distributed capabilities. Also during computing the ACL fields, I tried to encode users names etc with Base64, URLEncoder.encode etc. For small set of strings, this is working Ok. But for large sets, it is a pain. Ultimately affecting the search performance.
Another blocker.
Encode/decoder tes code.
startTime = System.currentTimeMillis();
String inputText = "Hello#$#%^#^&world";
for (int i =0;i<50000;i++)
{
String baseString = i+ " "+inputText;
encodedText = URLEncoder.encode(baseString,"UTF-8");
decodedText = URLDecoder.decode(encodedText, "UTF-8");
}
endTime = System.currentTimeMillis();

elapsedTime = endTime - startTime;

System.out.println( "\n URLEncoder/decoder Elapsed Time = " + elapsedTime + "ms");

>>>>
Elapsed Time = 2246ms

Monday, November 08, 2010

Few Inspiring Quotes from The Great Ones The Transformative Power of a Mentor

Adapt the pace of nature: her secret is patience – Emerson

Watch your thoughts; they become words
Watch you words; they become actions
Watch your actions; they become habits
Watch you habits; they become character
Watch your character; it becomes your destiny. – Unknown

I hope I shall always posse’s firmness and virtue enough to maintain what I consider the most enviable of all titles, the character of an Honest man – George Washington

The person who makes a success of living is the one who see his goal steadily and aims for it unsercingly. That is dedication – DeMille

What we say and what we do
Ultimately comes back to us
So let us own our responsibility
Place it our hands And carry it with dignity and strength -Anzaldua

Friday, November 05, 2010

Simple SOLR example code to index & search metadata

Roughly three years back, I did some a prototype to index/search unstructured content inside enterprise. At that time we are using Autonomy. So nobody cared. Basically all search engines does some kind of structured content searches & major focus on unstructured content i.e. PDF files to WS word document etc. So coming to the point, again I was asked to some prototyping. This time I am looking to solr framework because my C++ business tire makes requests to index/searches for content (mostly un structured content.) after downloading Solr, looking for some good examples. At the end, I started writing the following. I will port this kind of example with using CURL.


  protected URL solrUrl;
  public CommonsHttpSolrServer solr = null;

  public SolrServer startSolrServer() throws MalformedURLException,
  SolrServerException 
  {
     solr = new CommonsHttpSolrServer("http://localhost:8983/solr/"); 
     solr.setParser(new XMLResponseParser()); 

     return solr;
  }
   
   public TestQueryTool()
   {
          //will add more junk later
   }

 
   public static void main(String[] args)
   {
     
     TestQueryTool t = new TestQueryTool();   
     try
     {
            //1)start & work with existing SOLR instance
   t.startSolrServer();
 
   // 2)Now index content ( files later. metadata)
   t.indexBOAsDocument("uid1","bolt", "0001");
   t.indexBOAsDocument("uid2","nut", "0002");
   
   //3)now perform search
   t.performSearch("uid");

  } catch (MalformedURLException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  } catch (SolrServerException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
     

 }

  
 private void indexBOAsDocument(String uid, String id, String name)throws SolrServerException 
 {
  SolrInputDocument doc = new SolrInputDocument();
  
  doc.addField("id",  uid);
  doc.addField("part_name",  name);
  doc.addField("part_id", id);

  try {
   solr.add(doc);
   solr.commit();
  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
 }

 
 public void performSearch(String qstring) throws SolrServerException 
 {
  SolrQuery query = new SolrQuery();
  //query.setQuery( "*:*" );
  
  query.setQuery("id:uid* OR part_name:nut");
  
        query.addSortField( "part_name", SolrQuery.ORDER.desc);
  QueryResponse response = solr.query(query);
  SolrDocumentList docs = response.getResults();
  info("docs size is ="+docs.size());
  Iterator iter =docs.iterator();

     while (iter.hasNext()) {
       SolrDocument resultDoc = iter.next();

       String part_name = (String) resultDoc.getFieldValue("part_name");
       String id = (String) resultDoc.getFieldValue("id"); //id is the uniqueKey field
        // Now you got id, part_name. End of story
   }
       
      }

Monday, October 04, 2010

Nice little book "The Great Ones" by Ridgely Goldsborough

Books come with two different stores. One side authors’ journey before, after a mentor & his own life changes and simple set of guide lines & the other side a simple fictional story. Through the tales, the author suggests importance of mentoring, benefits in ones life virtually every aspect of life. Many inspiring quotes, compact size & fictional story also effective one.

Following are simple code of conduct

1.Make a decision & commitment
2.Conceive and execute a plan
3.Take full responsibility
4.Embrace patience and temperance
5.Act with courage
6.Cultivate passion
7.Exercise discipline
8.Remain single minded
9.Demand integrity
10.Let go of past failure. ( mostly learn from them)
11.Pay the price