omri.shiv@thinkbiganalytics.com

Underground: Document_From_Table (formerly Document_From_AFS)

Blog Post created by omri.shiv@thinkbiganalytics.com Champion on Jul 20, 2016


Document_From_Table is an Aster function for automatic text/metadata extraction from 300+ document formats. It relies on the Apache Tika library to do the bulk of the work in extracting the contents. The function also includes the ability to OCR images using Google's Tesseract library. This library must be installed on all workers in the cluster for this functionality to be enabled.

 

Installation

Download Document_From_Table, log into the aster cluster using ACT and \install the file.

 

Installation of Tesseract

The easiest way to install tesseract is to download the leptonica, tesseract and tesseract language data RPMs from the ifad repository. You can add a new repository to linux: http://download.opensuse.org/repositories/home:/vjt:/ifad/openSUSE_11.4/

 

and then 'zypper in tesseract' and it should install everything needed; otherwise, you will have to download the liblept5, leptonica, libtesseract, and tesseract-ocr rpms and install them manually on all the workers.

 

Usage

Document_From_Table takes a table of bytea files as input and extracts/ocrs the text from them. To load files into Aster in this format, please use the Aster Loader tool. After populating the table with files, you can then use the document_from_table function with the following sql-mr syntax:

 

SELECT * FROM document_from_table(

  ON public.file_load

  PARTITION BY filename

);

Outcomes