SQL MR: documentParser

Blog Post created by jthuma on Nov 15, 2016

documentParser is a map function that pulls a variety of document files stored in HDFS (Hadoop Distributed File System) or Aster (as of 2013-04-25) and parses them using nCluster. The parsed files can be outputted in one of four modes by specifying a parseMode of 'text', 'tokenize', 'email', or 'image'. 'text' extracts the plain text portion of a given document and outputs it as a single varchar field. 'tokenize' is like text but it takes each word and outputs it into a separate line. 'email' parses out TO, FROM, CC, BCC, SUBJECT, and BODY fields and outputs each as a separate column. 'email' works on plain text RFC emails as well as Outlook .msg files. 'image' parses out EXIF metadata such as focal length, exposure time, ISO, etc.

In addition to emitting the above mentioned columns, all of the modes output a "filename" column. This contains the full path and filename of the HDFS file. This is populated for all three operating modes. Thus, 'text' emits one row and two columns ("filename" and "content") for each document being parsed; 'email' and 'image' emit one row and multiple columns for each document or image; 'tokenize' emits many rows and two columns ("filename" and "word") for each document being parsed.

Under the covers, documentParser uses Apache Tika to parse documents. This simplifies the MR implementation since Tika both detects the document format and extracts the plain text portion from the document. Tika also supports a wide variety of formats including all Microsoft Office documents (Office '97-2010 including doc, docx, ppt, pptx, xls, xlsx), Outlook msg files, Outlook pst files, PDF, HTML, ePub, RDF, plain text, Apple iWorks, image files (TIFF, jpg, etc), and more. Click here to see the complete list.



See attached PPTX for full documentation


See ncluster_pdfloader bash script for base64 encoding and ncluster loader steps.  NOTE:  this may take some refactoring to work in your environment.