After long time, I attended Lucene revolution conference. Overall this is good experience. Few sessions are too good & few missed the mark.
Personally I like the following sessions because of the contents/presenters energy/passion behind the search technology.
Automata Invasion
Challenges in Maintaining a High Performance Search Engine Written in Java
Updateable Fields in Lucene and other Codec Applications
Solr 4: The SolrCloud Architecture
Also “Stump the Chump” questions are interesting & learned quite a bit.
I won small prizes too. In general, Lucid imagination uploads the
conference vedios at the following location. Keep watching.
I also missed few good sessions in Big data area.
http://www.lucidimagination.com/devzone/videos-podcasts/conference-videos
Daily I help teams with solution engineering aspect of connected vehicle data projects. (massive datasets & always some new datasets with new car models aka new technologies.) Lately in the spare time, applying some of the ML/Deep learning techniques on datasets (many are create based on observations of real datasets)To Share some thoughts on my work (main half of this blog) and the other half will be about my family and friends.
Tuesday, May 15, 2012
Friday, May 04, 2012
Mapping hierarchical data in to SOLR
Experiments with Solr 4.0 (beta)
While modeling hierarchical data in XML easy, (for example org charts, Bill of materials structures) mapping to a persistent storage is very challenging. Relational SQL manages it however fetching/ updating the hierarchies is very difficult. Even books are written on mapping Tree/Graphs in to RDBMS world.
Consider simple hierarchy list looks like this:
Satya
Saketh
Dhanvi
Venkata
Dhasa
The most common and familiar known method is adjacency model, and it usually works as every node knows the adjacency node. (In SOLR world, ID contains in the unique value & parent field contains parent. Assume for the root node it is null or same.)
In SOLR each row is a document:
SOLRID Name Parent
01 Satya NULL
02 Saketh 01
03 Dhanvi 01
04 Venkata 01
05 Dhasa 04
Latest SOLR(4.0 beta?) Join functionality gives you the full hierarchy (or any piece of it) very quickly and easily.
Example queries:
1) Give me complete hierarchical list: q= {!join from=id to=parent}id:*
2) Give me immediate childs of Satya: q={!join from=id to=parent}id:Satya
While modeling hierarchical data in XML easy, (for example org charts, Bill of materials structures) mapping to a persistent storage is very challenging. Relational SQL manages it however fetching/ updating the hierarchies is very difficult. Even books are written on mapping Tree/Graphs in to RDBMS world.
Consider simple hierarchy list looks like this:
Satya
Saketh
Dhanvi
Venkata
Dhasa
The most common and familiar known method is adjacency model, and it usually works as every node knows the adjacency node. (In SOLR world, ID contains in the unique value & parent field contains parent. Assume for the root node it is null or same.)
In SOLR each row is a document:
SOLRID Name Parent
01 Satya NULL
02 Saketh 01
03 Dhanvi 01
04 Venkata 01
05 Dhasa 04
Latest SOLR(4.0 beta?) Join functionality gives you the full hierarchy (or any piece of it) very quickly and easily.
Example queries:
1) Give me complete hierarchical list: q= {!join from=id to=parent}id:*
2) Give me immediate childs of Satya: q={!join from=id to=parent}id:Satya
An example SOLR query component to pull other objects
Consider Linked-in example. i.e. I have a set of connections. With single SOLRrequest, following SOLR component brings all first level connections information. In RDBMS world, typical example is employee table. i.e. bring me first level employees of manager. Assume single table contains both employee and manager information. (typical Adjacency list)
In terms of physical SOLR mapping, every document contains a connection field which contains the list(Adjacency list) of connections (root ids)
Configuration point of view, add the following to solrconfig.xml
public class ExampleComponent extends SearchComponent
{
public static final String COMPONENT_NAME = "example";
@Override
public void prepare(ResponseBuilder rb) throws IOException
{
}
@SuppressWarnings("unchecked")
@Override
public void process(ResponseBuilder rb) throws IOException
{
DocSlice slice = (DocSlice) rb.rsp.getValues().get("response");
SolrIndexReader reader = rb.req.getSearcher().getReader();
SolrDocumentList rl = new SolrDocumentList();
int docId=0;//// at this point consider only one rootid.
for (DocIterator it = slice.iterator(); it.hasNext(); ) {
docId = it.nextDoc();
Document doc = reader.document(docId);
String id = (String)doc.get("id");
String connections = (String)doc.get("contains");
System.out.println("\n id:"+id+" contains-->"+connections);
List<String> list = new ArrayList<String>();
list.add(id);//add rootid too. If we have joins in solr4.0
int pos = 0, end;
while ((end = connections.indexOf(',', pos)) >= 0) {
list.add(connections.substring(pos, end));
pos = end + 1;
}
BooleanQuery bq = new BooleanQuery();
Iterator<String> cIter = list.iterator();
while (cIter.hasNext()) {
String anExp = cIter.next();
TermQuery tq = new TermQuery(new Term("id",anExp));
bq.add(tq, BooleanClause.Occur.SHOULD);
}
SolrIndexSearcher searcher = rb.req.getSearcher();
DocListAndSet results = new DocListAndSet();
results.docList = searcher.getDocList(bq, null, null,0, 100,rb.getFieldFlags());
System.out.println("\n results.docList-->"+results.docList.size() );
rl.setNumFound(results.docList.size());
rb.rsp.getValues().remove("response");
rb.rsp.add("response", results.docList);
}
}
@Override
public String getDescription() {
return "Information";
}
@Override
public String getVersion() {
return "Solr gur";
}
@Override
public String getSourceId() {
return "Satya Solr Example";
}
@Override
public String getSource() {
return "$URL: }
@Override
public URL[] getDocs() {
return null;
}
}
Wednesday, March 28, 2012
Data format conversion in java
Basic need is we used to get xml date data type in different formats for data sources. However our federated search engine we are storing in a neutral format. (For simplicity sake.) After analyzing couple of formats, added this code in year 2005. Nowadays Team is using some apache package utils. But I don’t think they will something different. I am not seeing many options.
public static final String[] date_format_list = {
"yyyy-MM-dd'T'HH:mm:ss'Z'",
"yyyy-MM-dd'T'HH:mm:ss",
"yyyy-MM-dd",
"yyyy-MM-dd hh:mm:ss",
"yyyy-MM-dd HH:mm:ss",
"EEE MMM d hh:mm:ss z yyyy"
/// add your own format here
};
public static Date parseDate(String d) throws ParseException {
return parseInputWithFormats(d, date_format_list);
}
public static Date parseInputWithFormats(
String dateValue,
String[] formatList
) throws ParseException {
if (dateValue == null || formatList == null || formatList.length == 0) {
throw new IllegalArgumentException("dateValue is null");
}
if (dateValue.length() > 1
&& dateValue.startsWith("'")
&& dateValue.endsWith("'")
) {
dateValue = dateValue.substring(1, dateValue.length() - 1);
}
SimpleDateFormat dateParser = null;
for(int i=0;i < formatList.length;i++){
String format = (String) formatList[i];
if (dateParser == null) {
dateParser = new SimpleDateFormat(format, Locale.US);
} else {
dateParser.applyPattern(format);
}
try {
return dateParser.parse(dateValue);
} catch (ParseException pe) {
//pe.printStackTrace();
}
}
throw new ParseException("Unable to parse the input date " + dateValue, 0);
}
public static void main(String[] args) {
String fromDt="";
String nPattern = "yyyy-MM-dd'T'HH:mm:ss'Z'";
SimpleDateFormat sdf = new SimpleDateFormat(nPattern);
String currentValue="Fri Jul 22 04:22:14 CEST 2011";
try{
fromDt = sdf.format(parseDate(currentValue.toString() ) );
} catch (Exception e) {
System.out.print("\n Case1: date format exception"+e.getMessage()+ " SOLR currentValue:"+currentValue);
fromDt="";
}
System.out.println("Case1. date as str---"+fromDt);
currentValue="2011-07-21 21:22:14";
try{
fromDt = sdf.format(parseDate(currentValue.toString() ) );
} catch (Exception e) {
System.out.print("\n Case2: date format exception"+e.getMessage()+ " SOLR currentValue:"+currentValue);
fromDt="";
}
System.out.println("\n Cse2. date as str---"+fromDt);
}
Monday, March 26, 2012
Latest family picicture from Disney world.
Friday, March 23, 2012
Parsing complex Xsd file with Java code.
This one nearly 4+ year old code. Now Team is using apache xerces XSModel.
However this code is there for sanity check. Still this works fine.
(You need xsom.jar + relaxngDatatype.jar file. Google it. You will find the jars.)
However this code is there for sanity check. Still this works fine.
(You need xsom.jar + relaxngDatatype.jar file. Google it. You will find the jars.)
public class XsdReader {
public static void main (String args[])
{
XsdReader rr = new XsdReader();
rr.parseSchema();
}
public void parseSchema()
{
File file = new File("D:\\tmp\\books.xsd");
try {
XSOMParser parser = new XSOMParser();
parser.parse(file);
XSSchemaSet sset = parser.getResult();
XSSchema mys = sset.getSchema(1);
Iterator itr = sset.iterateSchema();
while( itr.hasNext() ) {
XSSchema s = (XSSchema)itr.next();
System.out.println("Target namespace: "+s.getTargetNamespace());
XSComplexType ct = mys.getComplexType("books");
int ctr=0;
if ( ct != null){
Collection c = ct.getAttributeUses();
Iterator i = c.iterator();while(i.hasNext()){
XSAttributeDecl attributeDecl = i.next().getDecl();
System.out.print("ctr="+ctr++ +"name:"+ attributeDecl.getName());
System.out.print(" type: "+attributeDecl.getType());
System.out.println("");
}
}
Iterator jtr = s.iterateElementDecls();
while( jtr.hasNext() ) {
XSElementDecl e = (XSElementDecl)jtr.next();
System.out.print( e.getName() );
if( e.isAbstract() )
System.out.print(" (abstract)");
System.out.println();
}
}
}
catch (Exception exp) {
exp.printStackTrace(System.out);
}
}
}
Monday, February 06, 2012
Lucene Standard Analyzer vs. Lingpipe EnglishStop Tokenizer Analyzer
This is old code from prototypes directory.
For some odd reason, I end up prototyping different analyzers for PLM space content vs 3ed party analyzers. (Basic need is which got better control on STOP words. At least based on my quick proto type, SOLR got easy constructs.)
Small sample code comparing both analyzers is included.
I did not see much difference for small input text.
For some odd reason, I end up prototyping different analyzers for PLM space content vs 3ed party analyzers. (Basic need is which got better control on STOP words. At least based on my quick proto type, SOLR got easy constructs.)
Small sample code comparing both analyzers is included.
I did not see much difference for small input text.
public class AnalyzerTest {
private static Analyzer analyzer;
private static long perfTime = 0;
public static void main(String[] args) {
try {
analyzer = new StandardAnalyzer(org.apache.lucene.util.Version.LUCENE_34);
String str = "PLM technology refers to the group of software applications that create and manage the data that define a product and the process for building the product. Beyond just technology, PLM is a discipline that defines best practices for product definition, configuration management, change control, design release, and many other product and process-related procedures.";
perfTime -= System.currentTimeMillis();
displayTokensWithLuceneAnalyzer(analyzer, str);
perfTime += System.currentTimeMillis();
System.out.println("Lucene Analyzer: " + perfTime + " msecs.");
perfTime -= System.currentTimeMillis();
displayTokensWithLingpipeAnalyzer(str);
perfTime += System.currentTimeMillis();
System.out.println("Lingpipe Analyzer: " + perfTime + " msecs.");
} catch (IOException ie) {
System.out.println("IO Error " + ie.getMessage());
}
System.out.println("Time: " + perfTime + " msecs.");
System.out.println("Ended");
}
private static void displayTokensWithLingpipeAnalyzer(String text)
throws IOException {
System.out.println("Inside LingpipeAnalyzer ");
TokenizerFactory ieFactory
= IndoEuropeanTokenizerFactory.INSTANCE;
TokenizerFactory factory
= new EnglishStopTokenizerFactory(ieFactory);
// = new IndoEuropeanTokenizerFactory();
char[] cs =text.toCharArray();
Tokenizer tokenizer = factory.tokenizer(cs, 0, cs.length);
String[] tokens = tokenizer.tokenize();
for (int i = 0; i < tokens.length; i++)
System.out.println(tokens[i]);
System.out.println("Total no. of Tokens: " +tokens.length );
}
private static void displayTokensWithLuceneAnalyzer(Analyzer analyzer, String text)
throws IOException {
System.out.println("Inside LuceneAnalyzer ");
TokenStream tokenStream = analyzer.tokenStream("contents",new StringReader(text) );
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
int length=0;
while (tokenStream.incrementToken()) {
int startOffset = offsetAttribute.startOffset();
int endOffset = offsetAttribute.endOffset();
String term = charTermAttribute.toString();
System.out.println("term->"+term+ " start:"+startOffset+" end:"+endOffset);
length++;
}
System.out.println("Total no. of Tokens: " + length);
}
}
Tuesday, January 31, 2012
Java Set Operations
As you know, for a while I’ve been playing with Solr. Basically implementing federated search framework to aggregate different ERP, CRM, PLM space data sources.
This post is not about federated search but I keep using bunch of set operations to compare the search results, comparing the unique doc id etc.
This post is not about federated search but I keep using bunch of set operations to compare the search results, comparing the unique doc id etc.
public class SetOperations {
public staticSet union(Set setA, Set setB) {
Settmp = new TreeSet (setA);
tmp.addAll(setB);
return tmp;
}
public staticSet intersection(Set setA, Set setB) {
Settmp = new TreeSet ();
for (T x : setA)
if (setB.contains(x))
tmp.add(x);
return tmp;
}
public staticSet difference(Set setA, Set setB) {
Settmp = new TreeSet (setA);
tmp.removeAll(setB);
return tmp;
}
public static void main(String[] args) {
SortedSets1= new TreeSet ();
s1.add("one");
s1.add("two");
SortedSets2= new TreeSet ();
s2.add("two");
s2.add("three");
s2.add("four");
SortedSetresult = (SortedSet ) union(s1,s2);
Iteratorit = result.iterator();
System.out.print("union result -->");
while (it.hasNext()) {
String value = it.next();
System.out.print(value+", ");
}
System.out.println("\n");
result = (SortedSet) intersection(s1,s2);
it = result.iterator();
System.out.print("intersection result-->");
while (it.hasNext()) {
String value = it.next();
System.out.print(value+ ", ");
}
System.out.println("\n");
result = (SortedSet) difference(s1,s2);
it = result.iterator();
System.out.print("difference result-->");
while (it.hasNext()) {
String value = it.next();
System.out.print(value+", ");
}
System.out.println("\n");
/*
SortedSeti1= new TreeSet ();
i1.add(new Integer("1"));
SortedSeti2= new TreeSet ();
i2.add(new Integer("2"));
SortedSetiresult = (SortedSet ) union(i1,i2);
Iteratoriit = iresult.iterator();
System.out.println("Integer union result");
while (iit.hasNext()) {
Integer value = iit.next();
System.out.println(value+",");
}
*/
}
}
Subscribe to:
Posts (Atom)