Today at Kiva we celebrate YOU! |
10 years ago today, you joined Kiva to change lives around the world. |
Daily I help teams with solution engineering aspect of connected vehicle data projects. (massive datasets & always some new datasets with new car models aka new technologies.) Lately in the spare time, applying some of the ML/Deep learning techniques on datasets (many are create based on observations of real datasets)To Share some thoughts on my work (main half of this blog) and the other half will be about my family and friends.
Tuesday, July 18, 2017
My Kiva updates after a decade
Friday, July 14, 2017
PDF to CSV file generator
few years back, when I was working on Apache TIKA POC/evaluation/bench mark for document indexing with SOLR,
At the same time, i was exploring pdf document parsing with apache.pdfbox framework.
At that time, i wrote sample code to see the PDF box capabilities.
Exactly last year, my niece asked some help in parsing the PDF file.
I fixed the java file for her need. Keeping it here for any later uses.
import java.io.*;
import java.time.Duration;
import java.time.Instant;
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
public class PdfToCsvGenerator {
public static void main(String[] args) {
String string = null;
BufferedWriter out = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
PDFTextStripper pdfStripper = null;
try {
System.out.println("Processing PDF file. please wait.");
Instant start = Instant.now();
PDFParser parser = new PDFParser(new FileInputStream("C:\\xxx\files\\final_result.pdf")); //TODO
parser.parse();
cosDoc = parser.getDocument();
FileWriter fstream = new FileWriter("C:\\xxx\\files\\out.csv"); ///TODO this is output file
out = new BufferedWriter(fstream);
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(3); //TODO for now parse only first two pages. once everything is ready..extend it for all
String parsedText = pdfStripper.getText(pdDoc);
String lineSep = pdfStripper.getLineSeparator();
String[] lines = parsedText.split(lineSep);
for ( String aLine : lines){
if ( aLine!= null){
String trimLine = aLine.trim();
if (trimLine.matches("^[A-Z].*$")) {
System.out.println("Ignoring input line:"+trimLine);
}else{
String htno=null,subcode=null, subname=null, internal=null, ext=null, credit = null;
StringBuffer buf = new StringBuffer();
if (trimLine.contains(" ")){
String[] fields = trimLine.split(" ");
boolean isSubject = true;
int i = 0;
for ( String s : fields) {
if (i == 0) {
htno = s;
i++;
continue;
}
if (i == 1) {
subcode = s;
i++;
continue;
}
if (isSubject == true) {
//StringBuffer buf = new StringBuffer();
if (s.matches("[A-Za-z-&/]+")) {
buf.append(s+" ");
i++;
continue;
} else {
subname = buf.toString();
isSubject = false;
internal = s;
i++;
continue;
}
}
ext = s;
credit = fields[fields.length - 1];
break;
}
out.write(htno + "," + subcode + "," + subname + "," + internal + "," + ext + "," + credit + "\n");
// System.out.println("htno:" + htno + " subcode:" + subcode + " subject name:" + subname + " internal:" + internal + " ext:" + ext + " credit:" + credit);
}else{
System.out.println("Ignoring 2ed stage input line:"+trimLine);
}
}
}
}
if (out != null) {
out.flush();
out.close();
}
System.out.println("processing is complete. Check output file.");
Instant end = Instant.now();
Duration timeElapsed = Duration.between(start, end);
System.out.println("Time taken: "+ timeElapsed.getSeconds() +" seconds");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
}
}
}
Few dumbest SOLR/Cassandra implementations
1 I was in the Architecture review board and a
project came for review.
It is kind of upgrade or tech refresh
project.
Core ides is they moving simple Web application
is moving old version of OS, WebLogic (yes. still companies use this king of heavy
weight containers
IT BOM contains SOLR too. (Moving from Solr
3.2 to Solr 6.2)
Since I am not well versed with business
domain, after meeting. I was asking
What is the motivation to go to Solr 6.x?
Answer was simple refresh & may be we
will use new features. Solr cloud etc.
I asked how it is deployed. Answer was SOLR
is running 3 different contains separately and they are load balanced. I was asking it is master/slave etc. Answer
is no.
We index all the content separately and monthly
we update the content on each web logic instance. I said, this is incorrect
thing. Are you going to fix with new Cloud architecture etc?
Answer is no. Still 3 isolated SOLR instances
running separately in separate Elephant container and load balanced. I was
shell shocked.
Pure dumbest to the core.
I will expand this post later with more dump roll-outs.
Walmart (Free Pickup + Discount) rocks
Lately I end-up buying too many items from Walmart with Free Pickup + Discount option.
For most of the standard products, walmart.com prices are good and above pickup/discount is too good.
Amazon became more of prime member only.
Few example buys:
1) Product Title
For most of the standard products, walmart.com prices are good and above pickup/discount is too good.
Amazon became more of prime member only.
Few example buys:
1) Product Title
Schlage FE595VCAM716ACC
Amazon is offering this one for prime members for ~$80 and Walmart
offered at same product
with free pickup/discount offer.
Similarly CATAN game.
( free pickup/discount is too good. in all my purchase
it is matching or beating amazon prices.
at the same time, it is fast for me.
Nowadays, Amazon became to cleaver for non prime members. ( for free shipping, it says 5 to 8 business days)
and on 7 or 8 days, it is shipping. Kind of luring to buy prime.
At this point, still i am debating myself. $100 for prime &
$100 for sam'sclub membership. this list goes on.
Amazon is offering this one for prime members for ~$80 and Walmart
offered at same product
( free pickup/discount is too good. in all my purchase
it is matching or beating amazon prices.
$100 for sam'sclub membership. this list goes on.
Subscribe to:
Posts (Atom)