Tuesday, May 15, 2012

Lucene revolution conference 2012

After long time, I attended Lucene revolution conference. Overall this is good experience. Few sessions are too good & few missed the mark.

Personally I like the following sessions because of the contents/presenters energy/passion behind the search technology.

Automata Invasion
Challenges in Maintaining a High Performance Search Engine Written in Java
Updateable Fields in Lucene and other Codec Applications
Solr 4: The SolrCloud Architecture

Also “Stump the Chump” questions are interesting & learned quite a bit.
I won small prizes too. In general, Lucid imagination uploads the
conference vedios at the following location. Keep watching.
I also missed few good sessions in Big data area.

http://www.lucidimagination.com/devzone/videos-podcasts/conference-videos

Friday, May 04, 2012

Mapping hierarchical data in to SOLR

Experiments with Solr 4.0 (beta)

While modeling hierarchical data in XML easy, (for example org charts, Bill of materials structures) mapping to a persistent storage is very challenging. Relational SQL manages it however fetching/ updating the hierarchies is very difficult. Even books are written on mapping Tree/Graphs in to RDBMS world.

Consider simple hierarchy list looks like this:
Satya
Saketh
Dhanvi
Venkata
Dhasa

The most common and familiar known method is adjacency model, and it usually works as every node knows the adjacency node. (In SOLR world, ID contains in the unique value & parent field contains parent. Assume for the root node it is null or same.)

In SOLR each row is a document:

SOLRID Name Parent
01 Satya NULL
02 Saketh 01
03 Dhanvi 01
04 Venkata 01
05 Dhasa 04

Latest SOLR(4.0 beta?) Join functionality gives you the full hierarchy (or any piece of it) very quickly and easily.

Example queries:

1) Give me complete hierarchical list: q= {!join from=id to=parent}id:*
2) Give me immediate childs of Satya: q={!join from=id to=parent}id:Satya

An example SOLR query component to pull other objects

Consider Linked-in example. i.e. I have a set of connections. With single SOLRrequest, following SOLR component brings all first level connections information. In RDBMS world, typical example is employee table. i.e. bring me first level employees of manager. Assume single table contains both employee and manager information. (typical Adjacency list)
In terms of physical SOLR mapping, every document contains a connection field which contains the list(Adjacency list) of connections (root ids)

Configuration point of view, add the following to solrconfig.xml



public class ExampleComponent extends SearchComponent
{
public static final String COMPONENT_NAME = "example";

@Override
public void prepare(ResponseBuilder rb) throws IOException
{

}
@SuppressWarnings("unchecked")
@Override
public void process(ResponseBuilder rb) throws IOException
{
DocSlice slice = (DocSlice) rb.rsp.getValues().get("response");
SolrIndexReader reader = rb.req.getSearcher().getReader();
SolrDocumentList rl = new SolrDocumentList();
int docId=0;//// at this point consider only one rootid.
for (DocIterator it = slice.iterator(); it.hasNext(); ) {
docId = it.nextDoc();
Document doc = reader.document(docId);
String id = (String)doc.get("id");
String connections = (String)doc.get("contains");
System.out.println("\n id:"+id+" contains-->"+connections);
List<String> list = new ArrayList<String>();
list.add(id);//add rootid too. If we have joins in solr4.0
int pos = 0, end;
while ((end = connections.indexOf(',', pos)) >= 0) {
list.add(connections.substring(pos, end));
pos = end + 1;
}
BooleanQuery bq = new BooleanQuery();
Iterator<String> cIter = list.iterator();
while (cIter.hasNext()) {
String anExp = cIter.next();
TermQuery tq = new TermQuery(new Term("id",anExp));
bq.add(tq, BooleanClause.Occur.SHOULD);
}
SolrIndexSearcher searcher = rb.req.getSearcher();
DocListAndSet results = new DocListAndSet();
results.docList = searcher.getDocList(bq, null, null,0, 100,rb.getFieldFlags());
System.out.println("\n results.docList-->"+results.docList.size() );
rl.setNumFound(results.docList.size());
rb.rsp.getValues().remove("response");
rb.rsp.add("response", results.docList);
}
}

@Override
public String getDescription() {
return "Information";
}
@Override
public String getVersion() {
return "Solr gur";
}
@Override
public String getSourceId() {
return "Satya Solr Example";
}
@Override
public String getSource() {
return "$URL: }
@Override
public URL[] getDocs() {
return null;
}
}

Wednesday, March 28, 2012

Data format conversion in java

Basic need is we used to get xml date data type in different formats for data sources. However our federated search engine we are storing in a neutral format. (For simplicity sake.) After analyzing couple of formats, added this code in year 2005. Nowadays Team is using some apache package utils. But I don’t think they will something different. I am not seeing many options.



public static final String[] date_format_list = {
"yyyy-MM-dd'T'HH:mm:ss'Z'",
"yyyy-MM-dd'T'HH:mm:ss",
"yyyy-MM-dd",
"yyyy-MM-dd hh:mm:ss",
"yyyy-MM-dd HH:mm:ss",
"EEE MMM d hh:mm:ss z yyyy"
/// add your own format here
};

public static Date parseDate(String d) throws ParseException {
return parseInputWithFormats(d, date_format_list);
}

public static Date parseInputWithFormats(
String dateValue,
String[] formatList
) throws ParseException {
if (dateValue == null || formatList == null || formatList.length == 0) {
throw new IllegalArgumentException("dateValue is null");
}

if (dateValue.length() > 1
&& dateValue.startsWith("'")
&& dateValue.endsWith("'")
) {
dateValue = dateValue.substring(1, dateValue.length() - 1);
}

SimpleDateFormat dateParser = null;
for(int i=0;i < formatList.length;i++){
String format = (String) formatList[i];
if (dateParser == null) {
dateParser = new SimpleDateFormat(format, Locale.US);
} else {
dateParser.applyPattern(format);
}
try {
return dateParser.parse(dateValue);
} catch (ParseException pe) {
//pe.printStackTrace();
}
}
throw new ParseException("Unable to parse the input date " + dateValue, 0);
}

public static void main(String[] args) {
String fromDt="";
String nPattern = "yyyy-MM-dd'T'HH:mm:ss'Z'";
SimpleDateFormat sdf = new SimpleDateFormat(nPattern);

String currentValue="Fri Jul 22 04:22:14 CEST 2011";

try{
fromDt = sdf.format(parseDate(currentValue.toString() ) );
} catch (Exception e) {
System.out.print("\n Case1: date format exception"+e.getMessage()+ " SOLR currentValue:"+currentValue);
fromDt="";
}
System.out.println("Case1. date as str---"+fromDt);

currentValue="2011-07-21 21:22:14";
try{
fromDt = sdf.format(parseDate(currentValue.toString() ) );
} catch (Exception e) {
System.out.print("\n Case2: date format exception"+e.getMessage()+ " SOLR currentValue:"+currentValue);
fromDt="";
}
System.out.println("\n Cse2. date as str---"+fromDt);
}

Monday, March 26, 2012

Latest family picicture from Disney world.


This is from latest Disney world. (Photo from Disney pass port photo service. We took good number of phones with Disney theme characters. Photos are not brilliant by professional photographer standards. We must not blame them. They are shooting nonstop. However they are not bad too.)

Friday, March 23, 2012

Parsing complex Xsd file with Java code.

This one nearly 4+ year old code. Now Team is using apache xerces XSModel.
However this code is there for sanity check. Still this works fine.
(You need xsom.jar + relaxngDatatype.jar file. Google it. You will find the jars.)


public class XsdReader {
public static void main (String args[])
{
XsdReader rr = new XsdReader();
rr.parseSchema();
}

public void parseSchema()
{
File file = new File("D:\\tmp\\books.xsd");
try {
XSOMParser parser = new XSOMParser();
parser.parse(file);
XSSchemaSet sset = parser.getResult();
XSSchema mys = sset.getSchema(1);
Iterator itr = sset.iterateSchema();
while( itr.hasNext() ) {
XSSchema s = (XSSchema)itr.next();
System.out.println("Target namespace: "+s.getTargetNamespace());
XSComplexType ct = mys.getComplexType("books");
int ctr=0;
if ( ct != null){
Collection c = ct.getAttributeUses();
Iterator i = c.iterator();while(i.hasNext()){
XSAttributeDecl attributeDecl = i.next().getDecl();

System.out.print("ctr="+ctr++ +"name:"+ attributeDecl.getName());
System.out.print(" type: "+attributeDecl.getType());
System.out.println("");
}
}
Iterator jtr = s.iterateElementDecls();
while( jtr.hasNext() ) {
XSElementDecl e = (XSElementDecl)jtr.next();

System.out.print( e.getName() );
if( e.isAbstract() )
System.out.print(" (abstract)");
System.out.println();
}
}
}
catch (Exception exp) {
exp.printStackTrace(System.out);
}
}
}

Monday, February 06, 2012

Lucene Standard Analyzer vs. Lingpipe EnglishStop Tokenizer Analyzer

This is old code from prototypes directory.
For some odd reason, I end up prototyping different analyzers for PLM space content vs 3ed party analyzers. (Basic need is which got better control on STOP words. At least based on my quick proto type, SOLR got easy constructs.)

Small sample code comparing both analyzers is included.
I did not see much difference for small input text.



public class AnalyzerTest {

private static Analyzer analyzer;
private static long perfTime = 0;

public static void main(String[] args) {
try {

analyzer = new StandardAnalyzer(org.apache.lucene.util.Version.LUCENE_34);

String str = "PLM technology refers to the group of software applications that create and manage the data that define a product and the process for building the product. Beyond just technology, PLM is a discipline that defines best practices for product definition, configuration management, change control, design release, and many other product and process-related procedures.";

perfTime -= System.currentTimeMillis();
displayTokensWithLuceneAnalyzer(analyzer, str);
perfTime += System.currentTimeMillis();

System.out.println("Lucene Analyzer: " + perfTime + " msecs.");

perfTime -= System.currentTimeMillis();
displayTokensWithLingpipeAnalyzer(str);
perfTime += System.currentTimeMillis();

System.out.println("Lingpipe Analyzer: " + perfTime + " msecs.");

} catch (IOException ie) {
System.out.println("IO Error " + ie.getMessage());
}
System.out.println("Time: " + perfTime + " msecs.");
System.out.println("Ended");
}

private static void displayTokensWithLingpipeAnalyzer(String text)
throws IOException {

System.out.println("Inside LingpipeAnalyzer ");

TokenizerFactory ieFactory
= IndoEuropeanTokenizerFactory.INSTANCE;

TokenizerFactory factory
= new EnglishStopTokenizerFactory(ieFactory);
// = new IndoEuropeanTokenizerFactory();

char[] cs =text.toCharArray();
Tokenizer tokenizer = factory.tokenizer(cs, 0, cs.length);
String[] tokens = tokenizer.tokenize();
for (int i = 0; i < tokens.length; i++)
System.out.println(tokens[i]);

System.out.println("Total no. of Tokens: " +tokens.length );

}
private static void displayTokensWithLuceneAnalyzer(Analyzer analyzer, String text)
throws IOException {
System.out.println("Inside LuceneAnalyzer ");
TokenStream tokenStream = analyzer.tokenStream("contents",new StringReader(text) );
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
int length=0;

while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
System.out.println("term->"+term+ " start:"+startOffset+" end:"+endOffset);
length++;
}
System.out.println("Total no. of Tokens: " + length);
}

}

Tuesday, January 31, 2012

Java Set Operations

As you know, for a while I’ve been playing with Solr. Basically implementing federated search framework to aggregate different ERP, CRM, PLM space data sources.
This post is not about federated search but I keep using bunch of set operations to compare the search results, comparing the unique doc id etc.

public class SetOperations {
public static Set union(Set setA, Set setB) {
Set tmp = new TreeSet(setA);
tmp.addAll(setB);
return tmp;
}

public static Set intersection(Set setA, Set setB) {
Set tmp = new TreeSet();
for (T x : setA)
if (setB.contains(x))
tmp.add(x);
return tmp;
}

public static Set difference(Set setA, Set setB) {
Set tmp = new TreeSet(setA);
tmp.removeAll(setB);
return tmp;
}

public static void main(String[] args) {

SortedSet s1= new TreeSet();
s1.add("one");
s1.add("two");

SortedSet s2= new TreeSet();
s2.add("two");
s2.add("three");
s2.add("four");

SortedSet result = (SortedSet) union(s1,s2);

Iterator it = result.iterator();
System.out.print("union result -->");
while (it.hasNext()) {
String value = it.next();
System.out.print(value+", ");
}
System.out.println("\n");

result = (SortedSet) intersection(s1,s2);

it = result.iterator();
System.out.print("intersection result-->");
while (it.hasNext()) {
String value = it.next();
System.out.print(value+ ", ");
}
System.out.println("\n");

result = (SortedSet) difference(s1,s2);

it = result.iterator();
System.out.print("difference result-->");
while (it.hasNext()) {
String value = it.next();
System.out.print(value+", ");
}
System.out.println("\n");

/*
SortedSet i1= new TreeSet();
i1.add(new Integer("1"));

SortedSet i2= new TreeSet();
i2.add(new Integer("2"));

SortedSet iresult = (SortedSet) union(i1,i2);

Iterator iit = iresult.iterator();
System.out.println("Integer union result");
while (iit.hasNext()) {
Integer value = iit.next();
System.out.println(value+",");
}
*/
}

}