What is Data Science Pipeline?
Data science pipelines could be the next big big data thing as it is becoming a buzz word of late. At Teradata we've been engineering and refining technologies and frameworks for data science pipelines for last 6 years the least resulting in products like Aster, Listener, Presto, Hadoop Appliance, Teradata Cloud, Teradata QueryGrid™, Think Big, and Kylo.
So what is data science pipeline and why is embracing it important? No longer big data means just four V's: Volume for scale of data, Variety for different forms of data, Velocity for streaming and evolving data, and Veracity for uncertainty of data. There is fifth V that stands for Value - the ultimate promise of big data. Once technologies that tamed challenges represented by 4 V's acquired and deployed in the enterprise, the problem of extracting value out of such infrastructure irreducibly presents itself. Data science pipelines are both a framework and tool set to create, deploy, execute and maintain complete applications delivering real value by utilizing underlying big data technologies. They aim at making data science look and behave just like the rest of enterprise software: repeatable, consistent, testable, deployable, robust, scalable and functional.
Teradata's commitment to all five V's, proven big data track record, and support for open source created unique opportunity to deliver value using data science pipelines. By employing powerful big data technologies like Hadoop Appliance, Aster, Teradata Cloud, QueryGrid and others paired with thoroughly integrated open source R programming environment data scientists can now create pipelines across data sources, business channels, use cases and at scale to produce repeatable, meaningful and actionable results.
Teradata Aster R
R has huge following for many reasons - some historical of being first open source environment for statistical computing , some computational of being 1st class functional programming language and execution environment, and some for its unparalleled support in statistical and machine learning community (or communities if you like). Not incidentally major players like Microsoft and Teradata chose R for their de facto standard for integrating open source with their big data platforms. In case of Teradata we have both in-database and R desktop environments for Aster developed and supported by Teradata:
Package TeradataAsterR consists of a collection of functions available to R programmers, data scientists and analysts who can use the powerful features of in-database analytics available in the Teradata Aster Database. Data scientists can use this rich set of functions to store, manipulate and explore big data sets to solve business problems. As large data is processed using Teradata Aster Massive Parallel Processing (MPP) and shared-nothing architecture, the single-node processor/memory bottlenecks are eliminated when executing R code. The processing of data using Teradata Aster R package functions occurs on each vWorker where data resides and data is not moved to the client during processing. Due to the design and architecture of the Teradata Aster Database Platform and software support for in-database analytics, applications and queries written using Teradata Aster R package are scalable and perform well even on a very large volume of data. Data transformation and subset operations using the Teradata Aster R package do not result in multiple copies of data, and client memory is preserved as most processing is done on worker nodes.
Source: Teradata Aster R: Building Scalable R Applications
Thus, Teradata Aster R (or just Aster R), Aster database for big data analytics running on Hadoop Appliance or on its own, Teradata QueryGrid connected with other data sources offer the best of breed environment for running data science pipelines.
Building Data Science Pipeline
Consider a classification problem where training data (and later unlabeled data) is available but resides across multiple data sources. The following steps will be necessary and near sufficient to build a complete classification solution using supervised learning algorithm:
- connect to all data sources to bring data in
- transform data from sources to consistent format
- create analytical data set by merging data
- enhance existing and engineer new features on data
- prepare training, testing, and validation data sets
- run supervised classification that creates and evaluates model (repeat as many times as necessary to improve classification error)
- validate final model to have realistic error estimate
- deploy model in production to run on schedule to source data and score new results
For many data scientists these steps are both familiar but present multiple challenges of connecting and merging separate data sources, executing unrelated processes in different execution environments, developing and evolving multiple programs that result in diffusion of technologies, knowledge, and ultimately results. But not with Teradata Aster where all steps can be maintained and executed within integrated R and Aster environment connected to necessary data sources and feeding scoring system at the end. Defined and driven by R and executed in Aster database data scientist controls and maintains all logic with R language and stores data in systems like Hadoop or Aster appliances.
How-To's and Examples
This introduction to data science pipelines will be followed by more posts with discussion and building solutions within Teradata eco-system including workflows, main principles and steps with multiple examples. Meanwhile you can find illustrations and examples of the data science pipeline with Aster R in the following RPubs presentation. The examples will use Lahman baseball database available for Aster to download here. R source code is available as attachment.
Image source: Red Hat Developer Program