Tuesday, May 24, 2011

Typical industry search input requirements/ patterns

An object string property contains all possible values.
Now I want to filter the list with different kind of input patterns (aka single field search) In particular accepting wildcards (*,? & + etc escaping is fun in java.)
sample code:

String str="*+PATAC+*";
Pattern pat=Pattern.compile(".*\\+*\\+.*");

Matcher matcher=pat.matcher(str);
boolean flag=matcher.find(); // true;

Logger.println("1) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);

str = "adjkfh+PATAC+ajdskfhhk";

matcher=pat.matcher(str);
flag=matcher.find(); // true;

Logger.println("2) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);


str = "PATAC";

matcher=pat.matcher(str);
flag=matcher.find(); // true;

Logger.println("3) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);

str = "adjkfh+PATAC+";

matcher=pat.matcher(str);
flag=matcher.find(); // true;

Logger.println("4) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);

str = "+PATAC+testingsuffixchars";

matcher=pat.matcher(str);
flag=matcher.find(); // true;

Logger.println("5) matcher result->"+flag);
if ( flag == true)
Logger.println("pattern found"+str);

Sample code to create SOLR document from CSV file

Just one guy stopped by office & asked to index this legacy data. So quick & dirty solution is the following. (May be Perl etc are there, but this one gives me more flexibility.



public class CsvToSolrDoc
{
public String columnName(int i)
{
//workarounds workarounds
if ( i == 0) return "id";
if ( i == 1) return "what ever you want as field name";
return null;
}
public void csvToXML(String inputFile, String outputFile) throws java.io.FileNotFoundException, java.io.IOException
{
BufferedReader br = new BufferedReader(new FileReader(inputFile));
StreamTokenizer st = new StreamTokenizer(br);
String line = null;
FileWriter fw = new FileWriter(outputFile);
// Write the XML declaration and the root element
fw.write("\n");
fw.write("\n");
while ((line = br.readLine()) != null)
{
String[] values = line.split(",");
fw.write(" \n");
int i = 1;
for ( int j=0;j it is length; J++)
{
String colName = "field name=\""+columnName(j)+"\"";
fw.write("<" + colName + ">");
fw.write(values[j].trim());
fw.write( "\n");
}
fw.write("
\n");
}
// Now we're at the end of the file, so close the XML document,
// flush the buffer to disk, and close the newly-created file.
fw.write("
\n");
fw.flush();
fw.close();
}

public static void main(String argv[]) throws java.io.IOException
{
CsvToSolrDoc cp = new CsvToSolrDoc();
cp.csvToXML("c:\\tmp\\m2.csv", "c:\\tmp\\m2.xml");
}

SOLR project stories. Lack of SOLR post filter support

For past few quarters, I am working on a project to implement security on object documents. My goals is Decorate every SOLR document with ACL field. This ACL field used to determine what users have access to this document. ACL syntax is something like +u (dave)-g(support) etc. My thoughts are process these ACL fields after search aka I want to subject the query component results via some kind of post filter. However SOLR is not offering any direct mechanism to specify this kind of post filter along with the search request. At Lucene level, there is an option to specify the filter however current AS IS implementation, it sucks.
It iterates entire document sets. For large documents this sucks. Also we need to SOLR distributed capabilities. Also during computing the ACL fields, I tried to encode users names etc with Base64, URLEncoder.encode etc. For small set of strings, this is working Ok. But for large sets, it is a pain. Ultimately affecting the search performance.
Another blocker.
Encode/decoder tes code.
startTime = System.currentTimeMillis();
String inputText = "Hello#$#%^#^&world";
for (int i =0;i<50000;i++)
{
String baseString = i+ " "+inputText;
encodedText = URLEncoder.encode(baseString,"UTF-8");
decodedText = URLDecoder.decode(encodedText, "UTF-8");
}
endTime = System.currentTimeMillis();

elapsedTime = endTime - startTime;

System.out.println( "\n URLEncoder/decoder Elapsed Time = " + elapsedTime + "ms");

>>>>
Elapsed Time = 2246ms