Tuesday, July 18, 2017

My Kiva updates after a decade

Today at Kiva we celebrate YOU!
10 years ago today, you joined Kiva to change lives around the world.











Friday, July 14, 2017

PDF to CSV file generator

few years back, when I  was working on Apache TIKA POC/evaluation/bench mark for document indexing with SOLR,
At the same time, i was exploring pdf document parsing with apache.pdfbox framework.
At that time, i wrote sample code to see the PDF box capabilities.
Exactly last year, my niece asked some help in parsing the PDF file. 
I fixed the java file for her need. Keeping it here for any later uses. 


import java.io.*;
import java.time.Duration;
import java.time.Instant;

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;

public class PdfToCsvGenerator {

    public static void main(String[] args) {
        String string = null;
        BufferedWriter out = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        PDFTextStripper pdfStripper = null;
        try {
            System.out.println("Processing PDF file. please wait.");

            Instant start = Instant.now();

            PDFParser parser = new PDFParser(new FileInputStream("C:\\xxx\files\\final_result.pdf")); //TODO
            parser.parse();
            cosDoc = parser.getDocument();


            FileWriter fstream = new FileWriter("C:\\xxx\\files\\out.csv"); ///TODO this is output file
            out = new BufferedWriter(fstream);

            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(3); //TODO for now parse only first two pages. once everything is ready..extend it for all
            String parsedText = pdfStripper.getText(pdDoc);
            String lineSep = pdfStripper.getLineSeparator();
            String[] lines  = parsedText.split(lineSep);
            for ( String aLine : lines){
                if ( aLine!= null){
                    String trimLine = aLine.trim();
                    if (trimLine.matches("^[A-Z].*$")) {
                        System.out.println("Ignoring input line:"+trimLine);
                    }else{
                        String htno=null,subcode=null, subname=null, internal=null, ext=null, credit = null;
                        StringBuffer buf = new StringBuffer();
                        if (trimLine.contains(" ")){
                         String[] fields =    trimLine.split(" ");
                            boolean isSubject = true;
                            int i = 0;
                            for ( String s : fields) {
                                if (i == 0) {
                                    htno = s;
                                    i++;
                                    continue;
                                }
                                if (i == 1) {
                                    subcode = s;
                                    i++;
                                    continue;
                                }
                                if (isSubject == true) {
                                    //StringBuffer buf = new StringBuffer();
                                    if (s.matches("[A-Za-z-&/]+")) {
                                        buf.append(s+" ");
                                        i++;
                                        continue;
                                    } else {
                                        subname  = buf.toString();
                                        isSubject = false;
                                        internal = s;
                                        i++;
                                        continue;
                                    }
                                }
                                ext = s;
                                credit = fields[fields.length - 1];
                                break;
                            }
                            out.write(htno + "," + subcode + "," + subname + "," + internal + "," + ext + "," + credit + "\n");
                           // System.out.println("htno:" + htno + " subcode:" + subcode + " subject name:" + subname + " internal:" + internal + " ext:" + ext + " credit:" + credit);
                        }else{
                            System.out.println("Ignoring 2ed stage input line:"+trimLine);
                        }
                    }
                }
            }
            if (out != null) {
                out.flush();
                out.close();
            }
            System.out.println("processing is complete. Check output file.");
            Instant end = Instant.now();
            Duration timeElapsed = Duration.between(start, end);
            System.out.println("Time taken: "+ timeElapsed.getSeconds() +" seconds");
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {

        }

    }

}

Few dumbest SOLR/Cassandra implementations

1      I was in the Architecture review board and a project came for review.
It is kind of upgrade or tech refresh project.

Core ides is they moving simple Web application is moving old version of OS, WebLogic (yes. still companies use this king of heavy weight containers
IT BOM contains SOLR too. (Moving from Solr 3.2 to Solr 6.2)
Since I am not well versed with business domain, after meeting. I was asking
What is the motivation to go to Solr 6.x?
Answer was simple refresh & may be we will use new features. Solr cloud etc.
I asked how it is deployed. Answer was SOLR is running 3 different contains separately and they are load balanced.  I was asking it is master/slave etc. Answer is no.
We index all the content separately and monthly we update the content on each web logic instance. I said, this is incorrect thing. Are you going to fix with new Cloud architecture etc?

Answer is no. Still 3 isolated SOLR instances running separately in separate Elephant container and load balanced. I was shell shocked.  
Pure dumbest to the core.

I will expand this post later with more dump roll-outs.

Walmart (Free Pickup + Discount) rocks

Lately I end-up buying too many items from Walmart with Free Pickup + Discount option.
For most of the standard products, walmart.com prices are good and above pickup/discount is too good.
Amazon became more of prime member only.

Few example buys:
1) Product Title

Schlage FE595VCAM716ACC



Amazon is offering this one for prime members for ~$80 and Walmart
offered at same product 
with free pickup/discount offer.  

Similarly CATAN game.
( free pickup/discount is too good. in all my purchase
 it is matching or beating amazon prices.
at the same time, it is fast for me.

Nowadays, Amazon became to cleaver for non prime members. ( for free shipping, it says 5 to 8 business days)
and on 7 or 8 days, it is shipping. Kind of luring to buy prime.
At this point, still i am debating myself. $100 for prime &
$100 for sam'sclub membership. this list goes on.