Wednesday, January 23, 2008

Integrate Full Text search functionality in to your applications with Lucene (indexing)

Part 1) Indexing your data.

Any Full Text search functionality involves indexing the data first. Lucene was no different in this approach. By indexing your data, it can perform high-performance full-text searching very fast. I did indexed 17,000 html files (my product documentation) in less than 5 minutes.

Creating Index writer & adding documents methods are key.
Rest f the methods for book keeping.
Following code indexes html, htm files in a folder. (It recursively iterates the nested folders & indexes each file)


////you data
public static final String dataDir = "D:\\webapps\\help";
//the directory that is used to store lucene index
private final String indexDir = "D:\\help_index";

public static String src1 = "";
public IndexWriter indexWriter;
public static int numF;
public static int numD;

public void openIndexWriter()throws IOException
{
Directory fsDirectory = FSDirectory.getDirectory(indexDir);
Analyzer analyzer = new StandardAnalyzer();
indexWriter = new IndexWriter(fsDirectory, true, analyzer);
indexWriter.setWriteLockTimeout(IndexWriter.WRITE_LOCK_TIMEOUT * 100 );
}

public void closeIndexWriter()throws IOException
{
indexWriter.optimize();
indexWriter.close();
}

public void indexFiles(String strPath) throws IOException
{
File src = new File(strPath);
if (src.isDirectory())
{
numD++;
String list[] = src.list();
try
{
for (int i = 0; i < list.length; i++)
{
src1 = src.getAbsolutePath() + File.separatorChar + list[i];
File file = new File(src1);
/*
* Try check like read/write access check etc.
*/
if ( file.isDirectory() )indexFiles(src1);
else
{
numF++;
if(src1.endsWith(".html") src1.endsWith(".htm")){
addDocument(src1, indexWriter);
}
}
}
}catch(java.lang.NullPointerException e){}
}
}
public boolean createIndex() throws IOException{
if(true == ifIndexExist()){
return true;
}
File dir = new File(dataDir);
if(!dir.exists()){
return false;
}
File[] htmls = dir.listFiles();
Directory fsDirectory = FSDirectory.getDirectory(indexDir);
Analyzer analyzer = new StandardAnalyzer();
IndexWriter indexWriter = new IndexWriter(fsDirectory, analyzer, true);
for(int i = 0; i < htmls.length; i++){
String htmlPath = htmls[i].getAbsolutePath();
if(htmlPath.endsWith(".html") htmlPath.endsWith(".htm")){
addDocument(htmlPath, indexWriter);
}
}
return true;
}
/**
* Add one document to the lucene index
*/
public void addDocument(String htmlPath, IndexWriter indexWriter){
//System.out.println("\n adding file to index "+htmlPath );
HTMLDocParser htmlParser = new HTMLDocParser(htmlPath);
String path = htmlParser.getPath();
String title = htmlParser.getTitle();
Reader content = htmlParser.getContent();
Document document = new Document();
document.add(new Field("path",path,Field.Store.YES,Field.Index.NO));
document.add(new Field("title",title,Field.Store.YES,Field.Index.TOKENIZED));
document.add(new Field("content",content));
try {
indexWriter.addDocument(document);
} catch (IOException e) {
e.printStackTrace();
}
}

im.openIndexWriter();
File src = new File(dataDir);
if(!src.exists()){
System.out.println("\n DATA DIR DOES NOT EXISTS" );
return;
}
long start = System.currentTimeMillis();
System.out.println("\n INDEXING STARTED" );
im.indexFiles(dataDir);
im.closeIndexWriter();
long end = System.currentTimeMillis();
long diff = (end-start)/1000;
System.out.println("\n Time consumed in Index the whole help=" +diff );
System.out.println("Number of files :\t"+numF);
System.out.println("Number of dirs :\t"+numD);
}

Friday, January 11, 2008

Populating MySql table from MS Excel { aka .csv} file

I am working on small project for a non profit organization.
{I felt Apache, PHP & MySQL combination fits their need.
I will explain about that application little later.}

I have already received some data in MS Excel file.
Strangely some trailing columns are missing in some records after saving Excel file in to csv file.

Following command fails saying column truncated MySQL errors.

mysql> load data infile 'C://bea//temple//dpexport.csv' into table donar
_info_4 fields terminated by ',' OPTIONALLY ENCLOSED BY '"' Lines terminated by
'\n';

After spend little more time with MySQL documentation,
I found out that IGNORE option does the magic &
I am able to load the csv files in my SQL table. Just add IGNORE next to infile.

Wednesday, January 02, 2008

Amazon Kindle is the next IPOD?

Few days back, I was little early to movie & I was waiting, sitting on a sofa, outside the theater.
Suddenly a young man sat next to me, seriously browsing web with his Amazon Kindle. (What made me curious was, he was reading one of my favorite web site, New York Times.) As I was paying attention to gadget , slowly he started talking all the good things about his new gadget & offered me to feel it. Quickly I checked the weight & look and feel of a blog & an e-book. Not so heavy & look and feel was very good & natural. I liked it a lot. I was so tempted after coming home, I checked Amazon website for Kindle.
(Little pricy, It was sold out & seems to be there is lot of demand for the gadget) Most of the reviews are positive. This incident remembers me another old incident with first experience with IPOD. Nearly4+ year's back, while coming back from my vacation, (India->London-> US), one person sitting next to me was explaining & talking & proud of owning a new IPOD. (If I remember correctly, it was the second month after IPOD launch.) Now I am seeing the same thing happening for Amazon's Kindle. I liked the way it was designed (automatic download content to the device)& convenience & thought process behind the Kindle. I was so surprised that Amazon came up with this kind of device. {After its initial launch as a major on line book store, this was the best thing from Amazon. My personal opinion. } Hoping that I will own Amazon Kindle very soon. It fits in to my taste.