Skip navigation
All Places > Learn Aster > Blog
1 2 3 Previous Next

Learn Aster

269 posts

Harnessing an analytical technique known as text clustering, companies in multiple industries can analyze customer call center data to find key word trends and phrases that may quickly alert them to potential customer service problems, manufacturing defects or negative sentiment.

 

Video featuring Karthik.Guruswamy - Principal Consultant & Data Scientist

 


 

art_of_analytics, manufacturing, safety_cloud, text_analytics, sentiment_analysis, customer_satisfactio

Safety Cloud – a transformation of multiple types of text data through analytics.  A visualization leading to significant innovation.  Applying natural language processing to these analytical techniques allows for sentiment analysis. Giving businesses an insight without looking at every document the dots represent.

 

 

____________

Other Works by Karthik

The Star – lines thick and thin, seemingly simple but revealing critical insights and behaviors hidden amongst the data only discovered with analytics.

 

Using an analytical technique perfect for time-series data, Data Scientists used Hidden Markov Models to find hidden states. 

 

Michelle Tanco, Data Scientist

 

_______

Related Links:

Other Community Content by Michelle Tanco

Facebook_1200x620_Logo-Theme_A

The explosion of interest in Artificial Intelligence (AI) is triggering widespread curiosity about its importance as a driver of business value. Likewise, Deep Learning, a subset of AI, is bringing new possibilities to light. Can these technologies significantly reduce costs and drive revenue? How can enterprises use AI to enhance customer experiences, create more insight across the supply chain, and refine predictive analytics?

PARTNERS 2017 offers the curious visionary and the creative executive plenty of education on the pragmatic business value of AI.  Here are a few of the sessions on the topic:

Autonomous Decision-Making, ML, & AI: What It Means & Why Should You Care

We’ve entered a new era of analytics with machine learning and artificial intelligence algorithms beginning to deliver on the long-promised advancement into self-learning systems.  Their appetite for vast amounts of data and the ability to derive intelligence from diverse, noisy data allows us go far beyond the previous capabilities of what used to be called advanced analytics.  To succeed, we need to understand both capabilities and limitations – and develop new skills to harness the power of deep learning to create enterprise value. This session focuses on the future of AI, emerging capabilities today; and relevant techniques; the ‘Think Deep’ framework for automating the generation and deployment of models for machine learning & deep learning. Wednesday, October 25, 2:00: PM-3:00 PM

Fighting Financial Fraud at Danske Bank with Artificial Intelligence

Fraud in banking is an arms race with criminals using machine learning to improve their attack effectiveness. Danske Bank is fighting back with deep learning – and innovating with AI – to curb fraud in banking spanning topics such as model effectiveness, real-time integration, Tensor Flow vs Boosted Decision Trees predictive models, operational considerations in training and deploying models, and lessons learned. Monday, October 23, 11:30: AM-12:15 PM.

Artificial Intelligence: What’s Possible For Enterprises Today

The sci-fi notion of AI is still a long way off – that’s pure AI. However, Pragmatic AI technology is here today and enterprises are using AI building block technologies such as machine learning to achieve amazing business results. In this session, Forrester Research VP & Principal Analyst, Mike Gualtieri, will demystify AI and explain what enterprises use cases are possible today and how to get started. Tuesday, October 24, 9:00: AM-9:45 AM. Presenter: Mike Gualtieri, Principal Analyst, Forrester Research.

Artificial Intelligence and the Teradata Unified Data Architecture (UDA)

Artificial Intelligence has entered a renaissance. Underlying this progress is Deep Learning – driven by significant improvements in Graphic Processing Units and computational models inspired by the human brain that excel at capturing structures hidden in massive datasets. Learn how AI is impacting enterprise analytics today in applications like fraud detection, mobile personalization or predicting failures for IoT. Focus on ways to leverage and extend the Teradata Unified Data Architecture today – and a new AI reference architecture – to produce business benefits. Monday, October 23, 2:00: PM-3:00 PM.

Employing Deep Neural Nets for Recognition of Handwritten Check Payee Text

The handwritten check is a primary linchpin of the customer relationship at Wells Fargo. It represents an enormous personnel cost when the bank attempts to resolve the payee field and other transaction information in handwritten form. Currently, Automatic Teller Machines (ATM) operated by Wells Fargo can recognize monetary amounts (numerical digits) in cheques utilizing neural networks trained on a standard handwritten numeral dataset. This session details the latest in image recognition and deep learning techniques to extend recognition capability to the payee field and a new capability to deploy deep neural networks with Aster and Tensorflow, in a SQL interface. Tuesday, October 24, 11:30: AM-12:15 PM. Presenters: Gary Class, Wells Fargo, and Kyle Grove, Senior Data Scientist, Teradata.

Dig in Deep into a Data Fabric Implementation Using Teradata and SAS

Banco Itau-Unibanco S.A. is one of the largest banks in Latin America and a global Top 50 bank by market cap. It operates in the retail, wholesale, private and investment banking, private equity, asset management, insurance and credit card business. The session will outline a new data fabric platform based on Teradata and SAS integration – which brought new capabilities to the credit risk analysts, in terms of amount and complex data to be used in their models. With this platform the risk teams are able to manipulate, in a dynamic and productive way, different sources of data, higher volume (about 30 times more) and new algorithms (e.g. Neural Networks) to improve models performance. The results are amazing and will be shared in detail. Wednesday, October 25, 10:30: AM-11:15 AM. Presenters: Dalmer Sella, Data Engineer, Itau and Fabiano Yasuda, Credit Modeling Manager, Itaú-Unibanco S.A.

Please be sure to check out the Session Catalog for more, and try to register early to join the “Meet-Up” sessions!

 

Original Blog Post: AI and Deep Learning Session Highlights for PARTNERS 2017 - Data Points 

Data Science can be an adventure in every possible way - just ask your employee who has been to tasked to solve data science problems. Did you know that the whole zen of data science thrives on Trial/Error AND a culture of failing fast?

If we talk to people in an analytics department in a company, you are going to find people who approach business problems two ways:

  1. I will try stuff that I know and probably can produce a visual. I've done it before many times before and so it should work. However, I know I can try this new stuff, but I'm not sure what will come out of it. So I'd skip this crazy idea. Just by looking at the data I can tell the visual insight will suck or I will fail miserably.
  2. I wouldn't know the visual will look when I try this. I'm willing to go for it anyways! I want to see what happens and don't want to guess. The data doesn't look that interesting to me at the outset, but willing to create a visual just for the fun of it even if it comes out boring. What if it's something useful?

#2 approach is what makes up the Trial/Error and Fail Fast culture. Imagine putting a super fast iterative tool (Teradata Aster Analytics or Apache Spark) on the hands of the person who practices #2 above!

Trial/Error and Fail Fast culture doesn't mean data scientists are unwilling to try time tested methods. It only means they are willing to take a lot of 'quickfire' risks for better results!

Just a bit of luck and off to the next iteration of failing fast and keep building!

A bit more on Trial/Error and Fail Fast. What exactly is it?

Trial/Error and Fail Fast approach is trying things with very little expectations on the outcome. Of course, the business outcome should be the eventual goal. We are willing to try something quickly and expect not to be rewarded immediately. Also not giving up just because we failed to get an outcome that's interesting the first time. 9 out 10 times, we are fumbling, but willing to get lucky once without giving up- which often proves to most valuable and actually works. Most successful data scientists will choose a fail fast tool for their pursuit for doing trial and error. The more we allow ourselves to fail quickly, the sooner we are going to stumble into something incredibly useful.

© Can Stock Photo / leowolfert

Causality Detection vs Quantitative Modeling

From a 10K feet point of view, most data science problems have two aspects to it:

  • Causality Detection - find the root cause of the problem or situation.
  • Quantitative Modeling - try to predict a situation outcome after learning from a bunch of data. You don't need to know the cause of the problem for prediction, just modeling with different variables. Algorithms take care of mapping the outcome to inputs done correctly and will do robot prediction.

Both of the above require a bit of creativity. Causality Detection is probably the hardest and is 100 times harder as it requires a lot of domain knowledge and some cleverness. It's great to know that I can predict a part failure 8 of 10 times, but knowing why and getting to the root cause is a completely different animal. You can get away with not being a domain expert with Quantitative Modeling. With Causality Detection, only a domain expert can say A leads to B definitely.

Applying Trial/Error and Fail Fast approach to Quantitative Modeling means we are trying different algorithms, model parameters, features in the data, new sources iteratively until we reach your accuracy goal *REALLY QUICKLY*. There is a systematic method to some of the techniques now, but still, requires TRYING many things before something works.

Causality Detection as mentioned earlier is a bit different. We can try and fail fast on a few A/B testing approaches but requires careful navigation through multiple steps with each step taken ever so carefully and surely. Causality Detection is about eliminating uncertainty as we get really close to the root cause.

Working in an uncertain environment

On unknown situations or problems, most folks want a cookie cutter approach - unfortunately, data science brings a lot of uncertainty to a table. Even with techniques like Deep Learning which works out of the box with a startup random configuration, getting to the next level often seems to be challenging and tricky. As architectures become more complex, the science often depends on the trial/error art form solely dependent on the creative data scientist's efforts in addition to best practices developed over time.

Hope you enjoyed the blog post.

Using an agile approach, a cross-functional team of Doctors, Cancer Researchers, Data Scientists, Data Visualization Experts, and Technologists set out on a mission to understand over 1,000 genetic patterns of cancer in order to develop personalize medical treatments aligned to the genetic makeup of humans.  

 

Decoding the human genome is the next frontier in science and discovery in medicine. Today, the combination of data, analytics, and visualization tools are cutting edge innovation in life sciences.  View the video below. 

 

Genome World Window - Stephen Brobst and Andrew Cardno

Art of Analytics - Genome World Window - YouTube 

 

________________

Related Links:

Data Scientist

Data Visualization

Combining the collaborative expertise of data scientists, geophysicists and data visualization an integrated oil company developed new understandings of complex reservoir management with data and analytics. This business case easily transcends multiple industries focused on asset utilization and optimization.

 

The Sailor - Duncan Irving 

Art of Analytics: The Sailor - YouTube 

 

______

Related Links

Data Visualization

Fusing business acumen, data science, and creative visualization, the Burning Leaf of Spending enabled a major bank to detect anomalies in customer spending patterns that indicate major life events, and provided artful insights into the personalized service required to enhance the customer experience, improving lifetime value.

 

Burning Leaf of Spending - Tatiana Bokareva 

 

 

________________

Related Link:

Detecting Anomalies

Advanced analytic techniques, combined with innovative data visualization, identifies suspected claims fraud, enabling a property and casualty insurer to automate new business rules to improve the fraud investigation process and mitigate risk.

 

Text Mining

Graph Analysis

Cluster Analysis

Predictive Analytics

 

All techniques behind fraud invaders. Giving business value in risk mitigation to any industry to identify exposure across the enterprise.  Giving companies the ability to detect suspicious fraud at higher rates with new information sources and increased automation. Avoiding costs and liability.

 

________________

Related Links:

 

Text Mining 

Graph Analysis

Cluster Analysis

Predictive Analysis

Detecting Fraud

Using analytic techniques that normally follow the "Customer Journey," Teradata Think Big consultants and data scientists use data and analytics to visualize & identify ‘The Human Journey,” allowing Buttle UK to identify and fulfill needs for at risk.

 

The Teradata Certified Professional Program is excited to announce the release of the first Teradata Aster Certification Exam – Teradata Aster Basics 6.10. Passing this exam will earn you the Teradata Aster Certified Professional Credential. Start preparing today and you’ll be one of the first to get Teradata Aster Certified. Click here for details, exam objectives and recommended training courses.

 

 

Teradata Aster Analytics Basics Study GuideIntroducing the New Teradata Aster Basics Certification Study Guide

 

The Teradata Aster Basics Certification Study Guide, has been released!  Through simplified examples and explanations, this new guide helps certification candidates prepare for the Teradata Aster Basics 6.10 exam and achieve the Teradata Aster Certified Professional Certification.  The Teradata Certification Study Guides are the only Teradata authorized study publications. This guide is designed to complement Teradata training and deepen your knowledge in the defined certification objectives. All Certification Study Guides can be purchased at Amazon.com

 

Pursue Teradata Certification with Confidence

Aster's nPath function is a powerful tool for uncovering patterns in sequential data.  There are many examples here on the Aster Community for writing those queries and visualizing the output.  But that's not all Aster has to offer for path analysis.  There are more steps that you can take to create even more value.  I'm going to show you a few easy queries that will take the output from nPath and start to build out a model that may be used to predict the next step on a path.

 

I'll use the demo data set that ships with Aster Express, bank_web_clicks, though really any output from your own nPath analysis will work just as well.  Here's a simple Tree diagram that I've created from this data, showing a handful of paths that customers have taken on this web site.

 

 

These Tree diagrams are incredible useful for visualizing paths and helping to uncover patterns for deeper analysis.  Additionally, having an underlying table with metrics around parent nodes, child nodes, counts and percentages could begin to form the foundation for a model to predict likely outcomes for other customers on similar journeys, allowing you to answer such questions as "what's the next likely step my customer will take".  In a few steps, I'll show how to build out a table with rows like this showing Parent paths and Next steps with calculated probability:

 

ParentNextProbability
ACCOUNT SUMMARY, BILL MANAGER FORM
BILL MANAGER ENROLLMENT
1
ACCOUNT SUMMARY, FUNDS TRANSFER, CUSTOMER SUPPORT
FAQ
.59
ACCOUNT SUMMARY, FUNDS TRANSFER, ONLINE STATEMENT ENROLLMENT
ACCOUNT HISTORY
.54
ACCOUNT SUMMARY, FAQ, CUSTOMER SUPPORT
ACCOUNT HISTORY
.42

 

Introducing the Aster path_analyzer function.  This takes the output from your nPath query, with 'group by' counts for each path, and then breaks out all the paths and counts from each parent node, working its way down the full depth of the tree.  Getting the "top paths and counts" from nPath is a pretty standard part of our normal nPath workflow and done easily with a simple query.  Let's build a new table, called 'top_bank_paths'.  Here I'll limit this to the top 1000 paths.  (Since these tables are small, I'm using 'replication' as my distribution).

 

create table top_bank_paths
   distribute by replication
as
select path,count(*) as cnt
from bank_paths
group by 1
order by 2 desc
limit 1000
;

 

With results:

 

pathcnt
[ACCOUNT SUMMARY, FAQ]2854
[ACCOUNT SUMMARY, VIEW DEPOSIT DETAILS]2830
[ACCOUNT SUMMARY, FUNDS TRANSFER]2746
[ACCOUNT SUMMARY, ACCOUNT SUMMARY]2715

 

We will now use these as input to path_analyzer:

 

SELECT * FROM Path_Analyzer
(
    ON (SELECT 1)
    PARTITION BY 1
    InputTable ('top_bank_paths')
    OutputTable ('bank_path_analyzer_output')
    --SeqColumn ('path')  Aster v7.0
    --CountColumn ('cnt') Aster v7.0
    Seq('path')  --Aster v6.20
    Cnt('cnt')   --Aster v6.20
);

 

The syntax did change slightly for the Seq and Cnt parameters with Aster Analytics 7.0, so I've included both above. Simply comment/uncomment based on your library version.  Also, if you want to rerun path_analyzer, you'll need to first drop your output table.  In this example, that would be:

     drop table bank_path_analyzer_output;

 

The output is a bit verbose, so here are the important fields:

 

parentchildrensubpath_cntdepth
^,[ACCOUNT SUMMARY  [(^,[ACCOUNT SUMMARY, BILL MANAGER, ACCOUNT HISTORY]),(^,[ACCOUNT SUMMARY, BILL MANAGER, ACCOUNT SUMMARY]),(^,[ACCOUNT SUMMARY, BILL MANAGER, FAQ]),(^,[ACCOUNT SUMMARY, BILL MANAGER, FUNDS TRANSFER]),(^,[ACCOUNT SUMMARY, BILL MANAGER, VIEW DEPOSIT DETAILS])]702

 

With this 'not too complex' SQL, we can now build a table for each parent node with corresponding 'next child node' and percentage calculation.  I'm doing a little string manipulation; skipping those first 4 characters in the parent format and likewise in the 'next_node' field, and trimming the trailing ']' character, which represents 'end of path'.  I'm also using a window function, row_number(),  to rank each child by percentage so that we can later filter by top 1 or 2 child nodes:

 

create table top_bank_next_path
    distribute by replication
as
select substring(N.parent from 4) as parent
     ,P.depth
     ,replace(substring(N.node from (3 + length(N.parent))),']','') as next_node
     ,P.subpath_cnt as parent_cnt
     ,N.subpath_cnt as next_cnt
     ,(1.0 * N.subpath_cnt / P.subpath_cnt)::numeric(3,2) as probability
     ,row_number() over (partition by N.parent order by N.subpath_cnt desc) as rank
from bank_path_analyzer_output N
    ,bank_path_analyzer_output P
where P.depth > 0
and   N.parent = P.node
;

 

 

parentdepthnext_nodeparent_cntnext_cntprobabilityrank
ACCOUNT SUMMARY, ACCOUNT SUMMARY, FAQ, ONLINE STATEMENT ENROLLMENT4VIEW DEPOSIT DETAILS111111
ACCOUNT SUMMARY, FUNDS TRANSFER, FUNDS TRANSFER, FUNDS TRANSFER4ACCOUNT HISTORY32180.561
ACCOUNT SUMMARY, FUNDS TRANSFER, FUNDS TRANSFER, FUNDS TRANSFER4ACCOUNT SUMMARY32140.442
ACCOUNT SUMMARY, VIEW DEPOSIT DETAILS, FUNDS TRANSFER, ACCOUNT HISTORY4FUNDS TRANSFER48190.41
ACCOUNT SUMMARY, FAQ, FAQ, FAQ4ACCOUNT SUMMARY58220.381

 

And finally, drum roll, here are some answers to the question "what is my customer likely to do next based on where they are in their journey".  In this next query, I'm using some parameters like 'probability greater than 30%', 'more than 20 similar paths have occurred' and perhaps most importantly, 'where rank equals 1', which means its the mostly likely next step for each parent path.

 

select parent,next_node
     ,parent_cnt,probability
from demo.top_bank_next_path
where rank = 1
and probability > .3
and parent_cnt > 20
order by probability desc
limit 10
;

 

parentnext_nodeparent_cntprobability
ACCOUNT SUMMARY, BILL MANAGER FORMBILL MANAGER ENROLLMENT331.0
ACCOUNT SUMMARY, FUNDS TRANSFER, CUSTOMER SUPPORTFAQ290.59
ACCOUNT SUMMARY, FUNDS TRANSFER, FUNDS TRANSFER, FUNDS TRANSFERACCOUNT HISTORY320.56
ACCOUNT SUMMARY, ACCOUNT SUMMARY, ONLINE STATEMENT ENROLLMENTFUNDS TRANSFER270.56
ACCOUNT SUMMARY, CUSTOMER SUPPORT, ONLINE STATEMENT ENROLLMENTACCOUNT HISTORY270.56
ACCOUNT SUMMARY, PROFILE UPDATE, CUSTOMER SUPPORTACCOUNT HISTORY270.56
ACCOUNT SUMMARY, FUNDS TRANSFER, ONLINE STATEMENT ENROLLMENTACCOUNT HISTORY240.54
ACCOUNT SUMMARY, FAQ, CUSTOMER SUPPORTACCOUNT HISTORY430.42
ACCOUNT SUMMARY, VIEW DEPOSIT DETAILS, FUNDS TRANSFER, ACCOUNT HISTORYFUNDS TRANSFER480.4
ACCOUNT SUMMARY, PROFILE UPDATE, FUNDS TRANSFERFUNDS TRANSFER640.39

 

There it is, a pretty straight forward, and I think a very powerful new step to add to all your Aster nPath workflows to create a very useful new table to answer many question from your pathing data.

Enjoy!

Did you miss the live webinar?  You can listen to the replay featuring Matt Mazzarell, Data Scientist, Teradata. 

 

 

Unable to View Video here?  Use this link:  Combining Text and Graph Analytics for More Context in Your Networks featuring Matt Mazzarell - YouTube 

 


Video Details:

    Adding more techniques in text analytics can better inform you of the working relationships in a network. Topic analysis, sentiment extraction and predictive analytics have helped large organizations solve complex problems using large volumes of communication data from various sources. This webinar sessions explores some examples of how customers have used these techniques to add more context to their networks.

 

Matt Mazzarell is a Data Scientist in the Teradata Aster Solutions team with experience implementing analytic solutions for many clients in the Fortune 1000 who seek to solve business problems using large and varied data sets. He has experience collecting customer requirements and prescribing specific analytic techniques across all industry verticals. Matt is responsible for several pre-packaged solutions including Communications Compliance, Product Recommender, and Healthcare Fraud. He is dedicated to enabling the non-technical user to perform analytics at scale that drive significant value for the business.

 

 

 


Organization Network Analysis (ONA) In Action -  Connected Commons Webinar Series

Some of you at Partners 2016 wanted to play with the QGRAM function to do text matching besides the IdentityMatch() function. I explain here how to invoke it and you can add that to your environment and try it out. 

 

Algorithm Background

Read more about QGRAMs, Cosine Similarity etc., in my previous blog post to understand different text similarity techniques.

 

Data Science - Fuzzy Matching & Similarity Measures

 

In this blog post, I only explain a simple use case of matching short text, especially merchant names in transactions.

 

Overview

If a business has transaction data that has merchant names, it's often a common problem to 'disambiguate' those names and map that to a yellow page/registered business name.  If you have millions of business names that mutate over time, morph etc., in the data, it is almost next to impossible with traditional techniques such as string matching/regular expressions/SQL LIKEs etc.,This doc. here explains how we solved this using a technique called 'Weighted QGRAMS with Cosine similarity'. The SQL/MR function QGRAM was created by the Aster PS and combined that with the 6.10 Analytic function called VectorDistance() that computes the cosine similarity metric. It uses a probabilistic technique to break down the merchant names to Q-grams (or letter ngrams), weigh it based on the position in the string and then by applying Cosine Similarity to find the best match.

 

SQL/MR Functions Used:

 

QGRAM() must be installed in your cluster schema.  Download the SQL MR: QGRAM  and install it your schema or request your DBA to install in your cluster using:

 

\install qgram.zip

from act. 

 

VectorDistance() is available in the later Analytic foundation libraries and should be used in conjunction with QGRAM.

 

Inputs:

 

Yellow Page Dictionary (standardized names you want to rollup to):

 

id

merchant

1

Ace Mortgage Inc

2

Ace Credit Card Inc

3

Ace Credit Bank

4

Ceasar Frozen Yogurt LLC

5

8/12 Convienence Store

6

Teradata Corp

7

Walmart Super Market

8

Apple 

9

Wholefoods

10

San Mateo City

 

Merchant names that need to be 'disambiguated' or fuzzy joined with above table:

 

id

bizname

1

Ace MG

2

ACC

3

ACB

4

Ceasar Frozen Yogurt

5

8/12 Store

6

Teradata Corporation

7

Walmart Super M

8

apple

9

Whole Foods Market

10

San Mateo City

Final Output (after the Weighted QGRAM/Cosine Similarity Workflow next section):

 

 

bizname

merchant

cos_score (or fuzzy score/confidence)

Ace MG

Ace Mortgage Inc

0.435152

ACC

Ace Credit Card Inc

0.0549411

ACB

Ace Credit Bank

0.0149411

Ceasar Frozen Yogurt

Ceasar Frozen Yogurt LLC

0.960521

8/12 Store

8/12 Convienence Store

0.566962

Teradata Corporation

Teradata Corp

0.844375

Walmart Super M

Walmart Super Market

0.890216

apple

Apple 

0.757017

Whole Foods Market

Wholefoods

0.545401

San Mateo City

San Mateo City

1.0

SQL, SQL/MR Workflow code to try in your Teradata Studio:

 

drop table if exists bizname_yellow_pages;
create table bizname_yellow_pages
(
 id bigint,
 merchant varchar
) distribute by hash(id);

 

insert into bizname_yellow_pages values (1,'Ace Mortgage Inc');
insert into bizname_yellow_pages values (2,'Ace Credit Card Inc');
insert into bizname_yellow_pages values (3,'Ace Bank Inc');
insert into bizname_yellow_pages values (4,'Ceasar Frozen Yogurt LLC');
insert into bizname_yellow_pages values (5,'8/12 Convienence Store');
insert into bizname_yellow_pages values (6,'Teradata Corp');
insert into bizname_yellow_pages values (7,'Walmart Super Market');
insert into bizname_yellow_pages values (8,'Apple');
insert into bizname_yellow_pages values (9,'Wholefoods');
insert into bizname_yellow_pages values (10,'San Mateo City');

 

drop table if exists operational_biznames;
create table operational_biznames
(
 id bigint,
 bizname varchar
) distribute by hash(id);

 

insert into operational_biznames values(1,'Ace MG');
insert into operational_biznames values(2,'ACC');
insert into operational_biznames values(3,'ACB');
insert into operational_biznames values(4,'Ceasar Frozen Yogurt');
insert into operational_biznames values(5,'8/12 Store');
insert into operational_biznames values(6,'Teradata Corporation');
insert into operational_biznames values(7,'Walmart Super M');
insert into operational_biznames values(8,'apple');
insert into operational_biznames values(9,'Whole Foods Market');
insert into operational_biznames values(10,'San Mateo City');

 

drop table if exists qgram_yp;
create table qgram_yp distribute by hash(qgram)
as
select id, qgram, weight
   from qgram(
       on bizname_yellow_pages
       decay(0.95)
       gram(3)
);

drop table if exists qgram_ops;
create table qgram_ops distribute by hash(qgram)
as
select id, qgram, weight
 from qgram(
     on operational_biznames
     decay(0.95)
     gram(3)
);

drop table if exists cosine_similarity_match;
create table cosine_similarity_match distribute by hash(target_id)
as
select * from VectorDistance(
   on qgram_ops as target partition by id
   on qgram_yp as ref dimension
   targetIdColumns('id')
   targetFeatureColumn('qgram')
   targetValueColumn('weight')
   measure('Cosine')
   topK(1)
);

select b.id id, b.bizname, c.merchant, 1- a.distance as cosine_score
 from
cosine_similarity_match a
   inner join operational_biznames b on (a.target_id = b.id)
   inner join bizname_yellow_pages c on (a.ref_id = c.id)
order by id
;

 

Previously in this series

In my last post I started exploring concrete implementation of data science pipelines with Aster and R. We covered programming environment, loading data into Aster, and importance and easiness of testing every step in the pipeline. In this post we'll focus on perhaps the most prevalent part of every workflow - data manipulation. To make examples more translatable we will embrace dual approach of illustrating each technique with 2 equivalent examples: one using  TeradataAsterR and another using dplyr package. Finally, in the spirit of embracing "test everything" each example will result in comparing them for equivalency.

 

The Grammar of Data Manipulation

dplyr function design follows clear separation of concern of the split-apply-combine process and the grammar of data manipulation:

  • filter: keep rows with matching conditions;
  • select: select, drop or rename attributes by name;
  • arrange: order rows by attributes;
  • mutate: add new attributes;
  • summarise: reduce multiple attribute values to a single value;
  • group_by: group data by one or more variables to apply a function.

Graph demonstrating Split-Apply-Combine.

 

TeradataAsterR functions are the mix of 2 approaches: classic R and split-apply-combine process embraced by dplyr. So having examples for both will add clarity and understanding to how things get done in both cases.

 

But before jumping to the grammar of data manipulation functions we begin with complimentary operation that combines data from two data sets into one. 

 

Joining Tables

Aster is a relational data store (sorry if I never made this important point before) so its support for joining tables comes naturally with fully SQL-compliant SELECT statement that includes:

  • Inner join
  • Left, right, and full outer joins
  • Semi- and anti-join (same as filtering joins in dplyr)
  • Cross join (Cartesian product)

In Aster R this translates  into two  functions that both perform joins resulting in a virtual data frame of query type and both act on two virtual data frames:

  • ta.merge with R style merge parameters that will be translated to SQL
  • ta.join that offers SQL-like parameters including choice of join type

I suggest using the latter as it offers full functionality with better transparency and compatibility with actual operations taking place in Aster.

 

For our examples we need to combine batting stats in Batting with players attributes from Master. This requires inner join and for additional clarity we demonstrate how to accomplish this same operation using three methods: R base function merge(), dplyr function inner_join(), and Aster R ta.join():

# R
big.df = merge(Batting, Master, by="playerID")

# dplyr
big.tbl = Batting %>%
  inner_join(Master, by="playerID")

# Aster R
big.ta = ta.join(batting.ta, master.ta, by="playerID")
ta.colnames(big.ta)[c('x.row_names','x.playerID')] = c('row_names','playerID')

Aster R required some post-processing to remove prefixes in the attribute names (prefixed by ta.join() only to those that were found in both tables). Other than that all three look boringly similar  (if you ignore syntactic sugar of magrittr pipes. Yes, I had option of not using it but the goal was to have each example as clear as possible).

As always we test all 3 methods produced same results (notice that we ran 2 comparisons taking advantage of transitive property of the equivalency):

compareDataInRandAster(big.df, big.tbl, 
  key = c("playerID", "yearID", "stint"))

compareDataInRandAster(big.tbl, big.ta,
  key = c("playerID", "yearID", "stint"))

Note that playerID is both a primary key in Master and a foreign key in Batting which was sufficient in join. But resulting data set inherited the key from Batting that consists of 3 attributes: plyaerID, yearID, and stint.

 

We'll use big.xxx data sets to illustrate the grammar of data manipulation functionality as part of the pipeline workflow (we will limit examples to dplyr and TeradataAsterR functions from now on).

 

Transform

Feature engineering is one of top reasons new attributes get introduced in the data science pipelines and is often a key ingredient of successful models. For this purpose the grammar of data manipulation contains mutate function that adds new or changes existing attributes: dplyr keeps calling it mutate() while Aster R has ta.transform().

 

The following example creates two attributes:

  • key containing concatenated elements of composite table key that uniquely references each row,
  • age with player's age calculated using year of birth and current year:
# dplyr
big.prepped.1.tbl = big.tbl %>%
  mutate(key = paste(playerID, yearID, stint, sep="-"),
         age = yearID - birthYear)

# Aster R
big.prepped.1.ta = ta.transform(big.ta,
    key = paste(playerID, yearID, stint, sep="-"),
    age = yearID - birthYear)

Again, examples contain strikingly similar code for both. Still, underneath, there are some differences to be aware of. For example, while mutate() can immediately use newly introduced variables inside the call ta.transform() can't.

 

As before (and after as well) we test results of both functions for equivalency:

compareDataInRandAster(big.prepped.1.tbl, big.prepped.1.ta,
    key = c("key"))

Note how new attribute key being a unique reference was put immediately to work in the test.

 

Subset

Arguably the most common data manipulation operation is reducing data set to smaller one which may take two forms:

  • selecting attributes (vertical subset);
  • selecting rows (horizontal subset).

While dplyr uses functions select() and filter() respectively, Aster R combines conventional R to select attributes and function ta.subset() to select rows.

 We continue our example by selecting batting metrics identified by their names and stored in a variable batting_metrics while removing all rows before year 1995 (yearID >= 1995) and players without home runs (!is.na(HR) which Aster translates to IS NOT NULL operator):

batting_metrics = c("G", "AB", "R", "H", "HR", "BB", "SO")

# dplyr
big.prepped.2.tbl = big.prepped.1.tbl %>%
  select_(quote(c(key, playerID, yearID, stint, age)),
          .dots=batting_metrics) %>%
  filter(!is.na(HR) & yearID >= 1995)

# Aster R
big.prepped.2.ta = ta.subset(big.prepped.1.ta[,c("key","playerID","yearID","stint","age",
    batting_metrics, "row_names")],
    !is.na(HR) & yearID >= 1995)

One distinct feature of dplyr is its use of non-standard evaluation (NSE, see vignette("nse") for details) which is great for translating R to SQL and interactive use but is hard to program with. For that reason dplyr offers SE versions of its functions designated with suffix _. With Aster R we use NSE inside ta.submit() but not with attribute selection that uses data frame syntax.

 

And our standard test comparing results in dplyr and Aster R completes this example:

compareDataInRandAster(big.prepped.2.tbl, big.prepped.2.ta, key = "key")

Summarization

Data manipulation function that reduces multiple attribute values to a single value is usually referred to as summarization. With Aster R we have at least three approaches to data summarization:

  • aggregates using SQL GROUP BY, e.g. MAX, MIN, AVG, etc. with ta.summarise() or ta.summarise.each(),
  • SQL window functions available with ta.transform(),
  • and in-database R using ta.tapply() or ta.by() that offers the most flexible  option of pushing arbitrary R code to run in parallel over partitioned data inside the Aster database.

Note that summarizing doesn't necessarily imply collapsing of data as reduced values could  be attached to the original rows (the case for window functions). What summarizing always includes is dividing operation into two steps:

  1. a grouping of data based on the values of certain attribute(s) (GROUP BY clause in SQL or INDEX argument in tapply() and ta.tapply())
  2. a computation step that computes or reduces data with aggregation, window or arbitrary R functions within each group.

In fact, dplyr embraces this concept fully by always invoking a group_by() function first followed by appropriate value reduction function.

 

Using SQL GROUP BY

SQL GROUP BY clause should always come to mind first. Aster R implements it using ta.summarise() that encompasses both grouping and computation, while dplyr as mentioned before invokes these steps explicitly by composing the group_by() and summarise():

# dplyr
summary.1.tbl = big.prepped.2.tbl %>%
     group_by(playerID, yearID, age) %>%
     summarise_each_(funs(sum), batting_metrics)

# Aster R
summary.1.ta = ta.summarise(big.prepped.2.ta,
     group.by = c("playerID", "yearID", "age"),
     G=sum(G), AB=sum(AB), R=sum(R), H=sum(H), HR=sum(HR),
     BB=sum(BB), SO=sum(SO)) 

compareDataInRandAster(summary.1.tbl, summary.1.ta,
     key = c("playerID","yearID"))

In this example we consolidated the same player's records within single year into one record for simplicity. While it's not common that players change teams within the same season it does happen. And Lahman designates a record for each player's team stint, which for our purpose we consider unnecessary detail. So aggregates were used to consolidate stats to one player per year by summing up player's stats from all teams he played that season. Again, with dplyr we used SE version summarize_each_() to take advantage of existing list of all metrics in batting_metrics. With Aster R we included all aggregates explicitly in ta.summarise().

 

Using Window Functions

We could devote a whole blog post (or a series, or a book, easily) to window functions. But let me instead quote three sources - one from Aster, one from PostgreSQL, and lastly from dplyr vignette:

Window functions allow the calculation of aggregates and other values on a per-row basis as opposed to a per-set or per-group basis... For example, a window function may be used to compute a running sum, a moving average, a delta between values in neighboring rows, or to apply ranking and row numbering to a table.

- Teradata Aster Database User Guide

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

- PostgreSQL 9.1 Manual

A window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values. The output of a window function depends on all its input values, so window functions don’t include functions that work element-wise, like + or round(). Window functions include variations on aggregate functions, like cumsum() and cummean(), functions for ranking and ordering, like rank(), and functions for taking offsets, like lead() and lag().

- Window functions vignette

Besides its per-row application another important feature of window functions is support for order within groups, which allows for enumerating rows, computing lags, cumulative and other order-dependent functions. Thus, the syntax supports both grouping and ordering options depending on specific window function.

For our example we will compute following five values:

  • current season number for each player with row_number();
  • current career HR value (up to current year) for each player with cummax();
  • career HR for each player with max() (we could compute same with summarise() but not without collapsing groups into single row unlike with window function when each row is preserved);
  • player's career start year with min();
  • player's career end year with max().
# dplyr
summary.2.tbl = summary.1.tbl %>% group_by(playerID) %>%
  mutate(seasonsTotal=n(),
    currentSeason=with_order(order_by = yearID, fun = row_number, x = yearID),
    currentTopHR=with_order(order_by = yearID, fun = cummax, x = HR),
    topHR=max(HR), startCareerYear = min(yearID), endCareerYear = max(yearID))

# Aster R
summary.2.ta = ta.transform(summary.1.ta,
  seasonsTotal=n(partition = playerID),
  currentSeason=row_number(order = yearID, partition = playerID),
  currentTopHR=cummax(HR, order = yearID, partition = playerID),
  topHR=max(HR, partition = playerID),
  startCareerYear = min(yearID, partition=playerID),
  endCareerYear = max(yearID, partition=playerID))

compareDataInRandAster(summary.2.tbl, summary.2.ta,
  key = c("playerID", "yearID"))

Again, let's go over differences and similarities between two implementations. As noted before dplyr always divides operation into composition of group_by() and mutate() where the latter invokes window and aggregate functions inheriting grouping and retaining each row. Aster R involves ta.transform() and applies window and aggregate functions inside the call.

 

Using In-Database R

At last we arrived at the most powerful option available with Aster R: in-database R execution. Looking at previous examples one may think that R programming really played a limited role as there were not much R code. Indeed, our Aster R constructs closely resembled SQL or grammar of data manipulation with both SQL GROUP BY and window functions inside the grouping and aggregation (data reduction) steps. While the same constructs preside over the option of in-database R the logic transitions to R-centric paradigm based on R tapply() function and functional feature of R.

 

With dplyr not much changes as its functionality naturally supports functional R and doesn't leave client R environment (at least without involving extensions). On the other hand Aster R provides us with function ta.tapply() that does two important things:

  1. it expands reach of tapply() onto Aster tables by enabling parallel and distributed execution within Aster database;
  2. and it transports R code for execution from Aster R client to Aster R sandbox environments configured and available with proper installation on the Aster database.

Availability and version of R on the Aster database could be easily checked with commands:

> ta.is.R.installed()
[1] TRUE
> ta.R.Version()

Please consult with Teradata Aster R documentation for more details on how to configure and test R in-database environment.

 

For our example we want to answer the following question: for each player find his age when he had a career HR season (i.e. a season when player hit most HRs in his career). For simplicity we'll use topHR attribute from the previous example when calculating player's age when he hit most HRs in a season:

# dplyr
summary.3.tbl = summary.2.tbl %>%
  group_by(playerID) %>%
  summarize(
    seasons_total = seasonsTotal[[1]],
    max_hr = topHR[[1]],
    top_hr_age = age[HR == max_hr][[1]])

# Aster R
summary.3.ta = ta.tapply(summary.2.ta, INDEX=summary.2.ta$playerID,
  FUN=function(x){c(x$seasonsTotal[[1]], # seasons_total
                    x$topHR[[1]],        # max_hr
                    x$age[x$HR == x$topHR[[1]]][[1]])}, # top_hr_age
  out.tadf = list(columns=c("playerID", "seasons_total",
                            "max_hr", "top_hr_age")))

compareDataInRandAster(summary.3.tbl, summary.3.ta, key = "playerID")

Because Aster R version does things almost letter by letter like dplyr it is advisable to review the latter first. Our goal is to calculate player's stats so summarization will reduce number of rows for each player to 1, hence, we use group_by() with summarize() construct. Next, seasons_total and max_hr have the same values within each player's group so we always pick very first value for each. Things become R-stylish in the line 7 that assigns value to top_hr_age: given vector (array) age we pick its element per condition of corresponding HR value is equal to career maximum (i.e. equal to max_hr) . With Aster R we repeat the same logic but within tapply-style function call: each player's rows are designated by the variable x with anonymous function (line 11) assigned to argument FUN. Lastly, argument out.tadf (line 14) controls how Aster saves results in the database: in this case we explicitly rename column names using argument columns.

 

What really happened when we called ta.tapply() is radically different from all previous examples: the anonymous function assigned to FUN was serialized and sent over network to Aster worker nodes running R sandboxes where it was deserialized and loaded for execution. After that (you may think that was just a prep) Aster read the specification on how to run it from 2 arguments: virtual data frame (1st argument) referred to underling table and INDEX identified a column value to use to partition the data (line 10). At this point Aster has all ingredients to execute anonymous R function in-database on distributed data and in parallel. It will save results into new table and returns to Aster R that was waiting for ta.tapply(). After execution in Aster completes it returns a virtual data frame with results. Lastly see compareDataInRandAster() that tests for the same results with dplyr and Aster R.

 

Note that we could pass any R function (including references to 3d party packages given that they are installed on Aster as well) with FUN. This method elevates Aster R to both a framework and a platform for executing R code in parallel across distributed data in a manner consistent with R tapply() function. With proper configuration and algorithm design running parallel R should become a standard function in the daily data scientist routine with Aster. 

 

In the next post we shift gears to show elements of data science workflow such as exploratory analysis, principal component analysis (PCA) and predictive model.

 

Image sources:

Red Hat Developer Program

The Programming Historian

Originally published at Path + Text = Peanut Butter + Chocolate.

 

The text that your customers share when they chat online, post on social media, or call your call centers provides important context about their experiences as they traverse their so-called customer journeys.

 

Historically though, path (or journey) analysis and text analysis have been distinct domains. Valuable, yes, but difficult or impractical to do in concert. There has been a dearth of tools that could enable you to investigate text in the context of the journey or path from which it came.

 

That changes today. Yes, you can analyze text and paths together. It’s not only practical and powerful – it’s extremely easy!

 

The screenshots and copy below walk you through the new text functionality built into our Path Analysis Guided Analytics Interface.

 

Highlight a path to view text records

First, you click to “Highlight a Path” and then “View Text Records.”

 

Text on a path

In the Text tab, events or nodes that contain text are filled in orange in the path at the top of the page. The page also contains a table that displays customer ID’s, time stamps, and the text of the record.

 

Filter by account

You can filter the results by ID or term, and move between events that contain text.

 

Filter by term

If you filter by term, the term is highlighted in the resulting records, which should help you get through your analysis a bit quicker.

 

Top tokens

You may want to check out the “Top Tokens” tab to identify some keywords for filtering.

 

Hopefully the images and workflow above show how easy and powerful this contextual text analysis can be. As a user, you will feel like these text capabilities are seamlessly integrated into your path analytics with the guided analytics interface.

 

If you’d like to learn more or see a demo, please feel free to send a note to ryan.garrett@thinkbiganalytics.com.