Skip navigation
All Places > Learn Aster > Blog
1 2 3 Previous Next

Learn Aster

239 posts

This post outlines an approach to automatically recreate non-persistent objects in an Aster Execution Engine (AX) environment on startup.

 

Background

The Aster Execution Engine (also known as Aster-on-Hadoop) exists as a collection of YARN-managed compute services without any persistent user data - all user data is held in temporary or analytic tables only as long as the AX instance is running, and is only persisted by storing it back into the underlying Hadoop instance or onto some other system (eg: load_to_teradata()).

 

In addition to user data, there are a number of database objects that do NOT persist between restarts of an AX instance. Per the 7.00.00.01 User Guide:

Persistent ObjectsNon-Persistent Objects
• users
• roles
• databases
• schemas
• foreign server definitions
• packaged analytics models and functions
• granted privileges on persistable objects
• tables
• views
• constraints
• indexes
• R scripts installed on the server side
• user-installed files and SQL/MR functions
• user scripts for vacuum or daily jobs

 

To make life easier for myself, my idea is to detect when the AX instance has been restarted, and to recreate the objects that I want to be there every time. These might be views into my data on the HDFS, custom functions and anything else that I use regularly.

 

An Example Aproach

To make this happen, you need 2 things:

  1. A way to detect that the AX instance has restarted
  2. A way to create the objects

 

I believe that "Indecision is the Basis of Flexibility", so the pieces I created are full of user-customizable components.

  • To detect that the instance has restarted, AX provides a token that changes every time the instance restarts. To read this token, use the query:
         select * from nc_system.nc_instance_token;
  • Keep the token in a file, and when it changes, you've got a restart; in which case, you simply run a bunch of stuff that you can mix-and-match as needed

 

A sample script that does this:

 

#! /bin/sh
UPSTARTDIR=/root/upstart                 # the directory of this script
TOKENFILE=${UPSTARTDIR}/.currtoken       # where I store a copy of the instance token
SCANINT=10                               # how often to scan for a change (in seconds)
UPSTARTPROCDIR=${UPSTARTDIR}/upstart.d   # a dir of scripts to run when the instance restarts

source /home/beehive/config/asterenv.sh

while true ; do
     OLDTOKEN="$(cat $TOKENFILE 2>/dev/null)"

     TOKEN="$(act -U beehive -w password -q -A -t -c "/* NOAMC */ select * from nc_system.nc_instance_token" 2>/dev/null)"

     # unable to get new token - aster instance is not available
     [ "${TOKEN}" ] || {
          echo "### UNABLE TO GET TOKEN FROM ASTER SERVICE INSTANCE. WAITING..."
          sleep ${SCANINT}
          continue
     }

     # tokens do not mismatch (tokens are the same)
     [ "${TOKEN}" != "${OLDTOKEN}" ] || {
          sleep ${SCANINT}
          continue
     }

     echo "=== TOKENS HAVE CHANGED. ASTER SERVICE INSTANCE HAS RESTARTED"

     echo ">>> running executables from ${UPSTARTPROCDIR}"
     for F in $(ls ${UPSTARTPROCDIR} 2>/dev/null) ; do
          [ -x ${F} ] || {
               echo "--- ${F} is not executable. skipping..."
          }
          echo ">>> Running startup file ${F}"
          ${UPSTARTPROCDIR}/${F} 2>&1 | sed -e 's/^/ | /'
     done

     echo "=== STORING CURRENT TOKEN TO ${TOKENFILE}"
     echo "${TOKEN}" > ${TOKENFILE}

     sleep ${SCANINT}
done

Notes:

  • The script could be run in the system's inittab, as a startup process, or in any way that seems fitting. It does NOT need to run as root, but things you want it to do may need certain privileges.
  • You may want to make sure that there is only one copy of this script running.
  • The phrase '/* NOAMC */' prevents this query from showing up in the Aster Management Console process history. Use this with discretion.

When the script detects that the AX instance has restarted, it will look in the directory ${UPSTARTPROCDIR} and simply run any file there that is executable, in 'ls' order. I made it modular, so I can mix-and-match components, and made it run only the executables so I can turn off pieces just by toggling the execute bit.

These scripts may do anything you like on restart, using act, ncli or even other things in the environment. For example, I might want to maintain a set of views into Hadoop data sets; given a script that creates these views (create_sql_views.sql), a setup script 10-create_demo.sh might look like this: 

#!/bin/sh
. /home/beehive/config/asterenv.sh
act -U db_superuser -w db_superuser -d playbook \
-f /asterfs/poc_playbook/poc_scripts_kb/sqlh/create_sqlh_views.sql

 

There are plenty of ways to achieve this result. I hope this one proves helpful.

Some time ago I came across "Spurious Correlations", an interesting book by Tyler Vigen. This book is full of unusual and nonsensical examples where a strong (Pearson) correlation is showcased between unexpected variable combinations.

 

Placing blind trust in the correlation between Microsoft revenue and political action committees will tell you that Bill Gates has been controlling Congress for several years. Who knew that the number of UK citizens immigrating to the US correlates 87% with US uranium exports? The marriage rate in Wyoming apparently has a 97.6% correlation with the number of domestically produced passenger cars sold in the US. Plenty of other absurd examples can be created by performing automated comparisons between unrelated data sets. (Tip: Aster can do this with a single sql statement)

 

Many papers have been written about the Theory of the Stork, supported by various data sets that show significant correlation between baby births and the size of the stork population in a specific area such as Northern Europe. 

 

Image result for 1800 holland stork baby birth correlation

 

The storks typically fly south for the winter and return north in early spring. Babies born in March-April were typically conceived in June of the previous year.  Midsummer celebrations ("solstice") have a different format depending on culture and location. A common thread between all of them is a focus on fertility, family and a new beginning. This explains why so many weddings are scheduled in June and why there are so many kid birthday parties to attend in spring/early summer!    In this historic example the weather acts as a hidden variable and results in a non-causal correlation.

 

So what happened to the story behind our headline?  First we build a database table based on the number of movies that Bruce Willis starred in according to imdb.com and include boiler related fatal accident data available on wonder.cdc.gov 

 

Next we check the correlation between our two variables:

 

Correlation is symmetric (A is correlated with B and B is correlated with A).  Causality is much more interesting and useful (A causes B and B does not cause A).  

 

We can perform a statistical test that was developed by George Sugihara of the Scripps Institution of Oceanography to review causality. Convergent Cross Mapping (CCM) tests the cause and effect relationship between two time series variables. 

 

Takens' theorem is the delay embedding theorem by Floris Takens. In the study of dynamic systems, a delay embedding theorem specifies how a chaotic dynamical system can be reconstructed from a sequence of observations.

 

CCM leverages this approach to determine causality between two time series. If variable C is the cause of variable E then information in the time series C is also available in time series E.   Historical observations from time series E can be used to estimate the state of time series C.

 

Why is the algorithm called Convergent Cross Mapping?

  • CCM uses the concept of cross mapping. The process of using the historical record from one series to predict variable states in another series. 
  • CCM uses the property of convergence.   The first step of the algorithm is to choose a library of short time series from the effect variable. If the length of the time series (library size of 3 or 10 observations for example) increases and the cross-mapped estimates become more accurate we will see improved real world identification of causation.

 

Basic Steps:

  1. A library of short time series is constructed from time series E. This is called a "shadow manifold".
  2. The library is used to predict values of the cause variable using a k-nearest neighbors approach.
  3. The correlation between the predictions of the cause time series and actual values in time series C are computed.
  4. The size of the library is increased to check convergence and determine if there is a causal relationship.

 

We can use the Aster CCM function to determine the optimal value for "EmbeddingDimensions", the number of lags or past values that we will use to predict a given value in the time series. 

 

Note there are a number of requirements to allow the function to correctly determine the optimal lags:

  1. the cause and effect columns have the same value
  2. the SelfPredict argument is set to true
  3. the LibrarySize argument is not specified
  4. only a single cause and single effect column is allowed

 

We can specify one or more EmbeddingDimensions. If we omit the parameter the function will default to using two lags.

 

 

Result: the optimum number of lags is 2.

 

Now that we know the proper value for EmbeddingDimensions we can execute our CCM function against our input data. It is not required to specify the LibrarySize parameter. By default the function will try libraries of  size  "embedding dimension + 1" and "100" (assuming we have that many observations).

 

Result:

 

In the output the two columns of interest are:

  • correlation: correlation between the values predicted by the effect attribute and the actual value of the cause attribute. 
  • effect_size: estimated effect size of increasing library value from smallest value to the largest value. An effect_size greater than approximately 0.25 indicates a causal relationship.

 

Conclusion:

The effect size is greater than 0.25 for both cause-and-effect directions.  This indicates that there is causality between our two time series. The number of Bruce Willis movies are a more important cause for exploding boilers than the other direction due to the greater effect size (0.59 > 0.27).

 

Of course we have to keep in mind that intuition has to play a big role and we cannot blindly trust statistics. Our starting point was a spurious correlation!

 

 

Now that we have gone through a simple exercise to get a feel for the CCM function let us review a second data set that contains monthly sales of bathing suits and the average high temperatures for the Texas region.

 

First we check the correlation between sales and temperature.  

 

Next we determine the ideal number of EmbeddingDimensions:

 

And we execute CCM with 3 EmbeddingDimensions:

 

 

We find out that sales does not cause higher temperatures:  the effect size is only 0.011

The temperature does cause higher sales of bathing suits: the effect size is 0.39 which indicates we have convergence and causality. 

 

Try it out for yourself.  I always wondered about correlation and causality of the cost of bananas and the revenue generated by ski areas in the USA. 

 

 

 

 

 

 

References:

 

Spurious Correlations 

 

Detecting Causality in Complex Ecosystems
DOI: 10.1126/science.1227079
Science 338, 496 (2012);
George Sugihara et al.

 

Dynamical system tools and Causality analysis
Amir E. BozorgMagham,
Shane D. Ross,
Engineering Science and Mechanics (ESM), Virginia Tech

We know that understanding the customer journey is important, but how can we be sure we are telling that story to the best of our ability? It could be impossible to completely understand everything that affects your customer. BUT! Therein lies nearly unlimited ability to improve how we model the customer experience. Free-form text in customer surveys and reviews, for instance, can be tricky to analyze. Even so, it represents one of the clearest pictures of the customer's perspective attainable. In this example, We will be using customer survey data, topic modeling, and sentiment analysis to create a contextual customer touchpoint.

 

--Create new table from surveys in Hadoop. Compress and analyze.
CREATE VIEW tb.survey_raw
AS
SELECT * FROM load_from_hcatalog (
     ON public.mr_driver
     SERVER ('myserv.teradata.com')
     USERNAME ('hive')
     DBNAME ('default')
     TABLENAME ('web_surveys')
);

DROP TABLE IF EXISTS tb.survey;
CREATE TABLE tb.survey
DISTRIBUTE BY HASH(survey_id)
COMPRESS LOW
AS
SELECT survey_id
     , user_id
     , nps
     , survey_text
     , survey_timestamp
     FROM tb.survey_raw
;
ANALYZE tb.survey;


--Tokenize surveys with Text Parser
DROP TABLE IF EXISTS tb.survey_tp;
CREATE TABLE tb.survey_tp
DISTRIBUTE BY HASH (survey_id)
COMPRESS LOW
AS
SELECT * FROM Text_Parser (
ON tb.survey
Text_Column ('survey_text')
Case_Insensitive ('true')
Stemming ('true')
Total ('true')
Punctuation ('[\\\[.,?\!:;~()\\\]]+')
Remove_Stop_Words ('true')
Accumulate ('survey_id')
List_Positions ('true')
)
;
ANALYZE tb.survey_tp;


--Check top words
--Words that occur too frequently may not be useful in our analysis
--It is usually a good practice to remove these
--This will depend on your own use case and data
SELECT token, SUM(frequency)
FROM tb.survey_tp
GROUP BY 1
ORDER BY 2 DESC
LIMIT 200;


--Check lowest word counts
--Words that only occur very few times should also be considered for removal
SELECT token, SUM(frequency)
FROM tb.survey_tp
GROUP BY 1
HAVING SUM(frequency) <= 2500
ORDER BY 2 ASC
LIMIT 2000;


--Removing tokens with less than 2500 appearances
--or those in the top 19 appearances.
--Before we reach this point, stop words should
--already be removed.
DELETE FROM tb.survey_tp
WHERE token IN(
     SELECT token
     FROM(
          SELECT token, SUM(frequency)
          FROM tb.survey_tp
          GROUP BY 1
          HAVING SUM(frequency) <= 2500) a
     )
OR token IN(
     SELECT token
     FROM(
          SELECT token, SUM(frequency)
          FROM tb.survey_tp
          GROUP BY 1
          ORDER BY 2 DESC
          LIMIT 19) a
     )
;
VACUUM tb.survey_tp;
ANALYZE tb.survey_tp;


--Create LDA Model
--Training converged after 37 iterate steps with delta 9.028843880595204E-5
--There are 1274580 documents with 23101982 words in the training set, the perplexity is 317.822466
--Elapsed Time: 03:52.518
DROP TABLE IF EXISTS oap.tb_lda_model;
SELECT * FROM LDATrainer (
ON (SELECT 1) PARTITION BY 1
InputTable ('tb.survey_tp')
ModelTable ('tb.survey_lda_model')
TopicNumber ('20')
DocIDColumn ('survey_id')
WordColumn ('token')
CountColumn ('frequency')
Seed (3)
);
ANALYZE tb.survey_lda_model;

--Show model summary
--The model table is trained with the parameters: topicNumber:20, vocabularySize:1155, alpha:0.100000, eta:0.100000
--There are 1274580 documents with 23101982 words in the training set, the perplexity is 317.822466
SELECT * FROM LDATopicPrinter (
ON tb.survey_lda_model PARTITION by 1
ShowSummary ('true')
);

--Show model detail
--Create table of highly weighted words
DROP TABLE IF EXISTS tb.survey_lda_topics;
CREATE TABLE tb.survey_lda_topics
DISTRIBUTE BY REPLICATION
COMPRESS LOW
AS
SELECT * FROM LDATopicPrinter (
ON tb.survey_lda_model PARTITION by 1
ShowWordWeight ('true')
ShowWordCount ('true')
)
WHERE wordweight > .025
ORDER BY topicid, wordweight DESC;


--Test Model
--There are 424487 valid documents with 7712740 recognized words in the input, the perplexity is 318.322719
--Outputtable "oap"."tb_lda_inf" is created successfully.
DROP TABLE IF EXISTS tb.survey_lda_inf;
SELECT * FROM LDAInference (
ON (SELECT 1) PARTITION by 1
InputTable ('tb.survey_tp')
ModelTable ('tb.survey_lda_model')
OutputTable ('tb.survey_lda_inf')
DocIDColumn ('survey_id')
WordColumn ('token')
CountColumn ('frequency')
);
ANALYZE tb.survey_lda_inf;

--Find and explore strong topic matches
SELECT * FROM tb.survey_lda_inf LIMIT 500;

--Accumulate topic words for easy analysis
DROP TABLE IF EXISTS tb.survey_lda_acc;
CREATE TABLE tb.survey_lda_acc
DISTRIBUTE BY REPLICATION
COMPRESS LOW
AS
SELECT * FROM NPATH(
     ON tb.survey_lda_topics
     PARTITION BY topicid
     MODE (NONOVERLAPPING)
     PATTERN ('T*')
     SYMBOLS (
          TRUE AS T
          )
     RESULT (
          FIRST (topicid OF T) AS topicid,
          ACCUMULATE (word OF ANY(T)) AS words
     )
);

--Sometimes even if you include a customer satisfaction score
--in your survey, the actual customer text may tell a
--different story!
--We can join sentiment scores to LDA topic scores
--to infer positive and negative elements about
--a customer's experience.
--We categorize our documents here using the highest
--ranking LDA topic score.
--Add sentiment and topic columns
DROP TABLE IF EXISTS tb.survey_se;
CREATE TABLE tb.survey_se
DISTRIBUTE BY HASH(survey_id)
COMPRESS LOW
AS
SELECT se.*, lda.topicid, lda.topicweight FROM ExtractSentiment (
ON tb.survey
Text_Column ('survey_text')
Accumulate ('survey_id', 'user_id', 'nps', 'survey_timestamp')
Level ('DOCUMENT')
) se
, (SELECT docid, topicid, topicweight
     , RANK() OVER (PARTITION BY docid ORDER BY topicweight DESC)
     topicrank
     FROM tb.survey_lda_inf) lda
WHERE se.survey_id = lda.docid
AND lda.topicrank = 1;
ANALYZE tb.survey_se;

SELECT * FROM tb.survey_se LIMIT 500;

--What relationship does text sentiment have with NPS?
DROP TABLE IF EXISTS oap.tb_nps_se;
CREATE TABLE oap.tb_nps_se
DISTRIBUTE BY HASH(survey_id)
COMPRESS LOW
AS
SELECT survey_id
     , nps
     , out_polarity
     , CASE
          WHEN nps > 0
          AND out_polarity = 'POS'
          THEN 1
          ELSE 0
          END both_pos
     , CASE
          WHEN nps < 0
          AND out_polarity = 'NEG'
          THEN 1
          ELSE 0
          END both_neg
     , CASE
          WHEN nps > 0
          AND out_polarity = 'NEG'
          THEN 1
          WHEN nps < 0
          AND out_polarity = 'POS'
          THEN 1
          ELSE 0
          END conflict
FROM tb.survey_se;
ANALYZE oap.tb_nps_se;

SELECT * FROM oap.tb_nps_se LIMIT 200;

--How often do the metrics agree? Not taking into account neutral measures.
SELECT pos_total, (100 * pos_total / cnt)::DECIMAL(4,1) pos_pct,
     neg_total, (100 * neg_total / cnt)::DECIMAL(4,1) neg_pct,
     con_total, (100 * con_total / cnt)::DECIMAL(4,1) con_pct,
     cnt
FROM
(SELECT SUM(both_pos) pos_total, SUM(both_neg) neg_total, SUM(conflict) con_total, COUNT(*) cnt
FROM oap.tb_nps_se) a;

--Does the inferred topic of a document imply a change in sentiment?
SELECT a.*,b.words
FROM
(SELECT topicid, AVG(nps), AVG(out_strength)
FROM tb.survey_se
GROUP BY 1
ORDER BY 1) a,
oap.tb_lda_acc b
WHERE a.topicid = b.topicid;


--Finally, we can create a table of contextual touchpoints.
--These records can be easily be used with nPath logic
--to craft a complete customer experience.
DROP TABLE IF EXISTS tb.survey_events;
CREATE TABLE tb.survey_events
DISTRIBUTE BY HASH(user_id)
COMPRESS LOW
AS
SELECT survey_id
     , user_id
     , topicid
     , survey_text
     , out_polarity sen_polarity
     , survey_timestamp
FROM tb.survey_se
;
mt186048

Very Good Bad Model

Posted by mt186048 Apr 17, 2017

I hate to say this, but we build models and discover business insights off of faulty data. We like to think that the data we work with is pristine, flawless, perfect, but often it's just plain dirty. The exceptions to this rule are often similar to the Iris Data Set: complete and probably perfect, but minuscule in comparison to the size of modern-day data spaces. 

 

There are three main things that make data dirty:

  1. Missing Values 
  2. Incorrect Domain Values
  3. False Data

 

We can often identify the first two during data exploration. Nulls and instances like a work tenure of 2107 years jump out fairly easily. Sometimes we're able to update these with a ground truth from elsewhere in the data space. Other times we do data cleansing by removing rows or replacing values as appropriate. 

 

It is much harder to identify or fix the third type of dirty data: data that is non-observably incorrect. This may be fine in the day to day life of or data but is  especially bad when we think about supervised learning problems. 

 

Imagine that we poll 100 people, ask them them "Do you think you're happier than the average person?" and get the below distribution:

 

We have 71 hypothetical people that see themselves as more than averagely happy and 29 who see them selves as less than averagely happy. But how many of those 71 would actually just being uncomfortable saying they were unhappy in a survey? And are there any of our less-than average happy people who are just humble and would be uncomfortable saying they were better than average in a survey?

 

  What They Felt 
 YesNo 
What They SaidYes??71
No??29
  ?? 

 

We could build a model on this data set that is perfect in prediction and has a huge F-Score. But depending on how deceptive our poll-takers were feeling we may end up with a model that is bad at actually predicting happiness. Perhaps too existential of an example. But in our businesses do we want to predict the real answers or the answers skewed by false data?

 

This gives us something to think about in all of our predictive models. When we have a model with recall of 70 percent, is it that we are missing 30% of the category or that 30% of the category isn't actually that category? I hypothesize that it's likely somewhere in between.

 

So what can we do?

  1. Think about how much you trust your training data set. Do you think most of the values are right? Was anything done by hand when assigning those values? Are there are assumptions being made to assign the predictive variable?
  2. Cluster! Without including what you are trying to predict, cluster your data. Then look at how many of each predictive value end up in each cluster. Have a cluster with 90% churn customers and 10% non-churn? Those non-churners are probably worth looking into
  3. Look at what is going wrong. Digging into instances where your model and the ground truth disagree is a great  way to improve the model, improve the ground truth, and discover new business insights

Data Science is a pretty overloaded term. We have many platforms, languages, algorithms and as a practitioner, we all have our favorite tools etc., In this blog post, I will explore two of the main tenets that are behind every data science exercise. Sometimes, it could be intertwined, but it is super important call it out for clarity.

However, the two popular goals are:

  • Discovery & Finding Insights
  • Modeling & Prediction (dev or production/operationalization)

with analytics powering both. Every single project that we do is either/or or a mixed one. In some cases we do more discovery & almost no modeling/prediction. Sometimes modeling/prediction is done with known discovered elements, sometimes we go back and forth! Modeling/Prediction can happen both in Dev and Production.

Isn't Discovery part of the Modeling/Prediction exercise?

Yes and No. This is the confusing part for many folks. Pure Discovery, which deserves its own category is stumbling into something unexpected when you are looking (OR) not looking for a particular outcome. Great examples of discovery are Penicillin, Xrays, Finding America :) etc., The keen eye of the scientist or the explorer stumbles into something that is inexplicable outside a belief cloud.

Hmmm, that's odd. It shouldn't be there ...

Most CSI sitcoms capture the essence of discovery in an investigation and hence has a captive audience! Discovery & Finding is what makes life so interesting. In the Data Science world, it is usually about finding the early insights to a business problem.

Tools for Discovery & Finding Insights

We need tools for discovery to parse new things. Without an optical or a radio telescope, you cannot find planets. Even with the telescope, only a prepared mind would spot something that most people miss. However, a telescope is still necessary :)

Modeling and Prediction

After a scientist discovers one or more interesting outcomes, we now want to take it to the next level. Which is find more in an automated way! Modeling and Prediction allow us to do that. We know what we are looking for, so let's create a system to learn from that and find more of that. Create the conditions on how we can repeatedly find things, generalize and find more. Operationalize by automation ...

Some examples of Insights, augmented by Modeling and Prediction:

  • We use an Insight finding tool to conclude that people arrive with a particular browsing pattern seem to click on the 'buy' button. After the first 'aha' moment, we now cleanse the data, weed the noise and build path models to 'learn' using quantitative methods from historical data. For an unknown path, we can now predict confidently that they are most likely to click on the 'buy' button or not! Solving the last mile problem can improve conversions significantly. If we do not have the discovery tool, you may not even know where to look!
  • We use an Insight finding tool to find the first insight into customer call center comments. We find a cluster full of "similar" comments that talk about a topic that always seems to be referring to some interesting phrases and words. Once we have the starting point, we can now create quantitative techniques to capture more clusters or more comments with spelling mistakes and human errors, but it is easy because we know we want to do more of X.

Finding Insights without modeling/prediction:

It is a science project. You wowed the audience with what you found. Not quite useful after the announcement.

Modeling/prediction without a deep Insight element:

The business context is diluted. Great, you can predict things, but you cannot quite explain why that is interesting in the first place to the stakeholders.

How much time spent on each typically?

Discovery in a pure sense is often finding something by chance. The discovery in the day to day science world is really about finding the key drivers. Power to the data scientists who find the insights to a business problem and getting to the first AHA moment. Typically it takes a few weeks with the right platform & talent. Try this link first.

Modeling and Prediction from Dev to Production is usually the long tail as algorithm fitting and other things that go with that domain. It can go anywhere from a couple of weeks to a few months for model automation and it depends on data quality etc., to move to operationalization. Many options here available as well!

Aster has a ton of great text functions, but many of the text modeling functions require a large amount of data to get good results. LDA, for example, does not perform well on smaller data sets. A multigenre text approach with cosine similarity can give better results with less data.

 

In this example we are using survey results from a furniture retailer. The survey asked questions about demographics, personal style, and shopping patterns. These responses are easy to gain insights from, the data can easily be visualized in a BI tool. The challenging part is finding insights from the free form survey question, "Is there anything else you would like to add?" There are about 1000 responses with free form text and the customer wanted to cluster the responses into categories and get a nice visual representation of the clusters. 1000 responses is much too small to use LDA, and LDA doesn't easily lend itself to giving a nice visualization, so cosine similiarity was a good alternative solution. 

 

The cosine similarity function returns a similarity score for each pair of text responses. Since every pair will have some score, a minimum score threshold should be established for grouping responses into clusters. For example, the visualization below shows responses that have been grouped by a very high similarity score. 

 

 

The clusters are very interrelated and seperate, but the higher the similarity score, the less of the data will be included in analysis since the output will only include responses that have at least one other response with a similarity score above the threshold. If a response is very long, the longer response may not have a high similarity with a shorter response even though the responses have the same topic. In this case the longer response won't be included in the output if the threshold is too high. 

 

AppCenter is very useful when deciding the ideal threshold. In the app below a user can set a similarity score threshold and get back a visualization of the clusters, the top words in each cluster, and a table of all the responses and their cluster id so the user can manually check if the groupings make sense. 

 

While the visualization of the clusters does not look as nice as the high threshold picture, in the survey results example the low threshold was the best approach. Many of the text responses were long and responses with the same topic often had 15-25% similarity. 

 

From the top words in each cluster table it's now easy to give the clusters category names so the customer can get a good sense of the free form text content and the feedback can go to the appropirate department. 

 

 

 

 

 

With the holiday season set to kick off, and many people hitting the ski slopes, I thought now would be a good time to demonstrate the power of Aster. In this exercise we will harness the power of Aster and Trip Advisor to mine ski resort reviews (Being from the Western United States, I will mine a Colorado Ski Resort). Before we get started on mining the web we need to answer a question?

What are two of Aster’s biggest strengths? Though this can and will be debated by many Aster Enthusiasts; the two that come to my find first are time to value and ease of use. Aster is changing the paradigm of Extract, Load, Transform (ETL) data to Extract, Load, Transform (ELT). Being a discovery platform Aster can quickly ingest multi-structured or unstructured data such as: text, web logs, machine data, and social media.

In this post I will demonstrate both strengths by scrapping www.tripadvisor.com for reviews about a Ski Resort in Colorado, creating a table in Aster, populating that table with the Trip Advisor reviews, and using the SQL-MR function Text_Parser to put the reviews in a relational form. In subsequent posts I will further demonstrate the power of Aster by mining the reviews with popular easy to use SQL-MR functions such as: ExtractSentiment, NGrams, and TF_IDF.

The tools that I will use are: R, R Studio, ACT, Teradata Studio, WINSCP, PUTTY, and Aster Express. Each of these tools can be downloaded to your machine for free, so please follow along and post some insights from the ski resort you chose. (My assumptions on this exercise are that you are familiar with each of the aforementioned tools. If you need any assistance please post questions in the comment and I will get them answered)

Step 1:
Scrapping Trip Advisor

  1. Open your favorite browser and go to www.tripadvisor.com
  2. Search for your favorite ski resort and click on the reviews. My url looks like thishttp://www.tripadvisor.com/Attraction_Review-g33676-d125452-Reviews-Vail_Mountain_Resort-Vail_Colorado.html
  3. Copy your url and open up R Studio. Load the rvest package and run the code. You will need to replace my url in the url object with yours. Running this R program we will have 5 variables in our Trip Advisor reviews data set. They are: resort, id, quote, rating, date, and review. Here is the R code I ran to mine 1100 comments about Vail Mountain Resort:

library(rvest)

#Each subsequent webpage differs, in the middle of the address, by the terms in the following 'looping' vector. Change number of terms depending on case

looping<- c(1:1180)

n<-length(looping)

tableout <- data.frame()

 

for(i in looping){

 

  #Change URL address here depending on attraction for review

  url <- paste ("http://www.tripadvisor.com/Attraction_Review-g33676-d125452-Reviews-or",i,"0-Vail_Mountain-Vail_Colorado.html#REVIEWS",sep="")

  reviews <- url %>%

    html() %>%

    html_nodes("#REVIEWS .innerBubble")

 

  id <- reviews %>%

    html_node(".quote a") %>%

    html_attr("id")

 

  quote <- reviews %>%

    html_node(".quote span") %>%

    html_text()

 

  rating <- reviews %>%

    html_node(".rating .rating_s_fill") %>%

    html_attr("alt") %>%

    gsub(" of 5 stars", "", .) %>%

    as.integer()

  #rawdates <- reviews %>% html_node(".rating .ratingDate") %>% html_text("ratingDate")

  #datebits <- lapply(rawdates, function(x) unlist(strsplit(x, " |\n"))[2:4])

  #date <- unlist(lapply(datebits, function(x) paste0(paste(x[1:2], collapse=" "), paste(",", x[3]))))

  date <- reviews %>%

    html_node(".rating .ratingDate") %>%

    html_attr("title") %>%

    strptime("%b %d, %Y") %>%

    as.POSIXct()

 

  review <- reviews %>%

    html_node(".entry .partial_entry") %>%

    html_text()

 

  #get rid of \n in reviews as this stands for 'enter' and is confusing dataframe layout

  reviewnospace <- gsub("\n", "", review)

 

  temp.tableout <- data.frame(id, quote, rating, date, reviewnospace)

 

  tableout <- rbind(tableout,temp.tableout)

}

  1. write.csv(tableout, file="VailMountaintest")

 

  1. The R Output will be saved as a .csv file to our working directory. Now we need to make sure the reviews are saved in UTC-8 character encoding (Otherwise we will get an error when we try and populate our table in Aster). This can be done in notepad or at the command line. Now we are ready for step 2.

Step 2:

Creating a table in Aster

  1. Now connect to Aster Express with Teradata Studios or ACT and create a database called SkiResort.
  2. Create a table to hold our ski resort reviews. This is how mine looks

Step 3:

Transferring SkiResortReview.csv file to our Aster Express with WINSCP and populating the table with data.

  1. Transfer the SkiResortReview.csv to a directory on your Aster Express image using WINSCP
  2. Open up a PUTTY session and connect to Aster Express
  3. Navigate to folder where you place the SkiResortReview.csv file
  4. Now use Ncluster Loader to load the data into the table.

ncluster_loader –U db_superuser –w db_superuser –d SkiResort –skip-rows 1 –el-enabled –c public.resorttext SkiResortReview.csv

  1. Now are table should be populated with the reviews.

Step 4:

Now that the reviews are loaded in the table we are ready to use the SQL-MR function Text_Parser to give structure to our data. We will create a table with the Text_Parser function and remove stop words and punctuation. The syntax I used is:

 CREATE TABLE public.resort_tparser_out
DISTRIBUTE BY HASH(id) AS
SELECT * FROM Text_Parser( ON resorttext
TEXT_COLUMN('review')
CASE_INSENSITIVE('true')
STEMMING('true')
PUNCTUATION('\[.,?\!\]')
REMOVE_STOP_WORDS('true')
ACCUMULATE('id', 'review'))
ORDER BY id;

Now that we have our reviews in a structured format, we can start doing some fun analytics to see what others are saying about our ski resort of choice. This will be covered in a future blog post so stay tuned. 

Companies today use a variety of tools for their analytic needs. As an example, some of Teradata's customers using Aster also use Spark. They'll be happy to hear we have a new connector in Aster called RunOnSpark.

 

 

RunOnSpark

Several functions from Spark's MLlib and ML libraries can be called from RunOnSpark. An API and a set of pre-built wrappers are installed along with the RunOnSpark function, so it's easy to get started. What's especially nice for the Aster user is that the entire workflow is done in Aster. And for the Scala programmer, custom wrappers allow additional access to functionality on Spark.

 

 

Example with Spark ML's MultilayerPerceptronClassifier (MLP)

Process Overview:

 

Process Overview

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

To train a classification model using MLP, the RunOnSpark function calls MLPCTrainDF. There are a few required bits of information to include in the query parameters:

 

  • modelLocation - HDFS path where the model will be saved
  • layers - Four integer values representing:

Number of input variables

Number of elements in first hidden layer

Number of elements in second hidden layer

Number of classes as output

  • labelCol - Name of column containing class labels (this is optional if the default "label" is the column name)

 

 

There are a number of parameters that can either be specified in a configuration file or included in the query. They include details about the data transfer, Spark job management, and security. When the configuration file includes all of the mandatory information, the query syntax is short and simple:

 

MLP train query syntax

 

The training data contains the column with the class labels and the independent variables. Data from the Aster table is transferred to Spark and formatted as a data frame or RDD for input, depending on the function. MLPCTrainDF formats the data as a data frame and while running, the status can be viewed in Ambari like any regular Spark job:

 

RunOnSpark in Ambari

 

 

When the job finishes in Spark, MLPCTrainDF returns the total count of training records correctly predicted and incorrectly predicted to Aster.

 

MLPCTrainDF saves the trained classification model in the provided HDFS location. It can then be used to make predictions on new records, with the call to MLPCRunDF. In this query, only the modelLocation is required. MLPCRunDF returns the predicted class for each record in the table back to Aster.

 

Predict MLP query syntax

 

Just like other Aster functions, the query input for RunOnSpark can be parameterized in AppCenter for easier use. Here's an example with MLPCTrainDF:

 

AppCenter example

 

 

Custom wrappers

Functionality in Spark's machine learning libraries continues to grow, and there are other interesting libraries as well. The API installed with the function offers a set of classes and methods that allow a Scala programmer to write custom wrappers.

 

A simple example is shown here, where just the input data is returned back to Aster:

 

UserEcho custom wrapper scala

 

Custom wrappers need to be built and packaged using sbt or a similar program, and the resulting .jar file is referenced in RunOnSpark with the app_resource parameter.

 

Custom wrapper query syntax

 

Whether using pre-built or custom wrappers, RunOnSpark offers expanded functionality in a familiar Aster environment.

This week marks the beginning of Major League Baseball’s 2017 season. To celebrate, I thought it would be fun to explore common player movement patterns between MLB franchises using both Teradata Aster’s nPath function and a sigma chart visualization (I know…I’m weird like that). In other words, we want to visualize the most common patterns of franchises played for, for example: Franchise A followed by Franchise B followed by Franchise C.

 

The Data:

To investigate, we will look at the 2016 version of the Baseball Databank database, which can be downloaded here: http://seanlahman.com/baseball-archive/statistics/

 

Specifically, we will look at the Batting.csv, Teams.csv, and TeamFranchises.csv files.

 

Note from the accompanying README.txt file:

Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.

 

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.  For details see: http://creativecommons.org/licenses/by-sa/3.0/

 

Person identification and demographics data are provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com), from its Register of baseball personnel.

 

Player performance data for 1871 through 2014 is based on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.

 

The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet. This information is available free of charge from and is copyrighted by Retrosheet.  Interested parties may contact Retrosheet at http://www.retrosheet.org.

 

Data Loading/Prep Work:

Let’s imagine that we work for a company where the Batting.csv, Teams.csv, and TeamFranchises.csv files are loaded onto a Teradata Aster database and stored in tables named [schema].batting, [schema].teams, and [schema].teamfranchise .

 

First, we’ll isolate the records of interest in our data. To answer our question (i.e., what are the most common player movement patterns between Major League Baseball franchises), we will focus on players who played for franchises currently in existence.

 

Note: You might be wondering why we are considering franchises rather than the names of the teams that each player played for. This is because some franchises have changed their associated team’s name over the years at one time or another. Focusing on the name of the franchise prevents us from detecting false movement patterns where a player “moves” from one team to another when both teams actually represent the same franchise. For example, in 2005, the Montreal Expos relocated to Washington, D.C. and became the Washington Nationals, so a player cannot really have moved from the Expos to the Nationals, since these two teams are actually the same franchise entity.

 

The join below leaves us with just the season records of players who played for franchises that currently exist. For ease of coding later, we’ll create this join as a new table called [schema].players .

 

--Create table isolating players who played for franchises that currently exist
 
DROP TABLE IF EXISTS [schema].players;
 
CREATE TABLE [schema].players
DISTRIBUTE BY HASH (playerid)
COMPRESS LOW AS (
       SELECT a.*, b.franchid
       FROM [schema].batting a
       JOIN (
             SELECT yearid, teamid, franchid
             FROM [schema].teams
             WHERE franchid IN (
                    SELECT franchid
                    FROM [schema].teamfranchise
                    WHERE active = 'Y'
             )
       ) b
       ON a.teamid = b.teamid
       AND a.yearid = b.yearid
       ORDER BY a.yearid, a.stint ASC
);

 

Now, we will create a flag that marks the records from both the year that a player started his career and the years that he began playing for a new franchise. Note that players can move from one franchise to another and then back to the previous franchise; they can also move multiple times within the same season (as indicated by the “stint” variable). The following code sets up the flag that we need (for future coding simplicity, we will create this flag within a new table called [schema].firstyearsflagged :

 

--Set up flag for first year that a player played with a franchise
DROP TABLE IF EXISTS [schema].firstyearsflagged;
 
CREATE TABLE [schema].firstyearsflagged
DISTRIBUTE BY HASH (playerid)
COMPRESS LOW AS (
SELECT playerid, yearid, stint, franchid,
       CASE
             WHEN yearid = MIN(yearid) OVER (PARTITION BY playerid, franchid) OR
                     stint > 1 OR
                     yearid - (LAG(yearid, 1, 1) OVER (PARTITION BY playerid,
                                             franchid ORDER BY yearid)) > 1
             THEN 1
             ELSE 0
       END AS firstyear
FROM [schema].players
ORDER BY yearid, stint ASC
);

 

To see this flag in action, let’s look at the career of Greg Maddux. He began his major league career with the Chicago Cubs in 1986. He then moved to the Atlanta Braves in 1993 after being signed as a free agent, to the Cubs in 2004 after being signed as a free agent, to the Los Angeles Dodgers after being traded by the Cubs midseason 2006, to the San Diego Padres in 2007 after being signed as a free agent, and back to the Dodgers after being traded by the Padres midseason 2008 before retiring at the end of the 2008 season. The code below retrieves his records from the [schema].firstyearsflagged table and arranges them in order by yearid, then stint:

 

--Check firstyear flag 
SELECT *
FROM [schema].firstyearsflagged
WHERE playerid = 'maddugr01'
ORDER BY yearid, stint;

 

Output:

 

playerid

yearid

stint

franchid

firstyear

maddugr01

1986

1

CHC

1

maddugr01

1987

1

CHC

0

maddugr01

1988

1

CHC

0

maddugr01

1989

1

CHC

0

maddugr01

1990

1

CHC

0

maddugr01

1991

1

CHC

0

maddugr01

1992

1

CHC

0

maddugr01

1993

1

ATL

1

maddugr01

1994

1

ATL

0

maddugr01

1995

1

ATL

0

maddugr01

1996

1

ATL

0

maddugr01

1997

1

ATL

0

maddugr01

1998

1

ATL

0

maddugr01

1999

1

ATL

0

maddugr01

2000

1

ATL

0

maddugr01

2001

1

ATL

0

maddugr01

2002

1

ATL

0

maddugr01

2003

1

ATL

0

maddugr01

2004

1

CHC

1

maddugr01

2005

1

CHC

0

maddugr01

2006

1

CHC

0

maddugr01

2006

2

LAD

1

maddugr01

2007

1

SDP

1

maddugr01

2008

1

SDP

0

maddugr01

2008

2

LAD

1

 

As we can see, the records from the years 1986, 1993, 2004, 2006 (specifically, the stint with the Dodgers), 2007, and 2008 (again, the stint with the Dodgers) are all flagged with a 1.

 

Exploration:

Now, let’s find the different franchise patterns that exist. To do this, we will run Teradata Aster’s nPath function on the records from the [schema].firstyearsflagged table where firstyear = 1. Also, we only want to look at player moves that happened during or after 1998, since this was the year that the two newest franchises in Major League Baseball (the Arizona Diamondbacks and the Tampa Bay Rays) were created. This restriction controls for age of franchise; franchises that have been around longer naturally have a longer history of players moving to and from their team than newer franchises, so if we don’t control for age of franchise, it artificially inflates the “popularity” of the older franchises.

 

The code below runs nPath on our records of interest. We’ll also create our nPath results as a new table that we can feed into Teradata Aster AppCenter in order to visualize our results.

 

 

--run nPath on [schema].firstyearsflagged - are there common movement patterns 
--between franchises?
DROP TABLE IF EXISTS [schema].mlbmoves;
 
CREATE TABLE [schema].mlbmoves
DISTRIBUTE BY HASH (path)
COMPRESS LOW AS (
SELECT path, count(*) cnt FROM nPath (
       ON (SELECT * FROM [schema].firstyearsflagged WHERE firstyear = 1
             and yearid >= 1998)
       PARTITION BY playerid
       ORDER BY yearid, stint ASC
       Mode (NONOVERLAPPING)
       Pattern ('A.B+')
       Symbols ('TRUE' AS A,
                    franchid != LAG(franchid, 1) AS B)
       Result (ACCUMULATE(franchid OF ANY(A, B)) AS path)
                   
)
GROUP BY path
);

 

Here is a sample of records from our new table:

 

 

For example, we can see from this output that, between 1998 and 2016, three players began their major league careers with the Chicago Cubs and then moved to the Miami Marlins (denoted here by FLA), finishing their careers with the Marlins.

 

Visualization:

At this point, let’s visualize our results. We can do this by building an app in Teradata Aster AppCenter:

  • From the AppCenter home screen, click “Build an App”.
  • Fill out the following fields:
    • App Name
    • App Version
    • Developer Name
    • Developer ID
    • Description
  • Next, go to the Logic tab and click the “Generate Visualization Code…” button.
  • Fill out the following fields as shown below:
    • Data (Table/View): [schema].mlbmoves (Note: replace [schema] with the actual schema of your table)
    • Data Format: nPath output schema
    • Visualization Type: sigma
    • Visualization Title: Add a title here if you want
  • Click the “Add Visualization Statement” button. In the code that appears, make sure that your input table appears as “[schema]”.”mlbmoves”
  • Click “Save”.
  • Configure the app. On the Database Connection tab, connect to the database where the [schema].mlbmoves table was created. Click “Save”.

 

At this point, the app is complete. On the screen that now appears, fill out the “Title” field and click “Run”. When the app finishes, click on the little sigma chart icon that appears next to the report title in order to see the visualization that was produced. Our sigma chart should look like this:

 

 

The larger a point is in our sigma chart, the more likely it is to appear in a movement pattern. According to our diagram above, from the years 1998-2016, players were more likely to move to or from the San Diego Padres than any other franchise. Also, the darker a path is in this diagram, the more often it occurs. Using the toggles that come with the sigma chart visualization, we can hide the less-frequently occurring paths and get the new sigma chart below:

 

Here, we can see that, between 1998 and 2016, some of the most common player movement patterns involved going from the New York Yankees to the Washington Nationals, from the Baltimore Orioles to the Toronto Blue Jays, from the San Diego Padres to either the Oakland Athletics or the Boston Red Sox, from the Athletics to the Colorado Rockies, or from the Seattle Mariners to the Padres.

 

As a next step, we could take a closer look at the stats of the players who moved between these franchises during their careers. For example, we could see if batting averages tend to peak while players are with some franchises and not others. For pitchers (especially American League pitchers), we might follow their earned run average (ERA) over time instead of batting average to see if there is a correlation between ERA and franchise played for.

 

In this post, we have seen how to use Teradata Aster’s nPath function in conjunction with Teradata Aster AppCenter in order to visualize player movement between Major League Baseball franchises. If you have further questions, feel free to contact me at kate.phillips@teradata.com.

Great news!  Free Trial Version of Aster Analytics is now available on AWS Marketplace.

 

Everything you need to experience the power of multi-genre advanced analytics is now available on AWS.   Teradata Aster Express on AWS includes machine learning, text, path, pattern, statistics and graph analytics, all within a single Amazon Machine Image (AMI).     

 

The Aster Express on AWS includes a complete Aster cluster running on a single EC2 instance.  Everything available in Aster Express on-premises is available on AWS with one exception… the long download times. Now users can quickly launch an instance to start their test drive along the multi-genre analytics autobahn.   Enjoy the drive!   

 

A FREE subscription to Aster Express on AWS includes the following:

  • Right to use Aster Analytics Portfolio, Aster Client, Aster Database 6.20 and Aster AppCenter for evaluation purposes. Note: Aster AppCenter must be launched through another AMI
  • Supports 30 GB of active data for analysis
  • Support through the Aster Community Forum
  • Access to Aster on AWS resources including the Getting Started guide

 

Although the Aster Express software is free, AWS infrastructure components: EC2 and EBS costs still apply.  Aster Express is offered on T2.Xlarge instance of EC2.  T2.Xlarge is a lower cost instance of EC2 that offers 4 vCPUs and 16 GB.  Cost start at $0.188/hour and can go up based on location.  EBS cost to run Aster Express is $45/month which gives you up to 30 GB of active data.   These are just estimates based on general AWS pricing.  For detailed cost for your region, please refer to AWS Marketplace.

 

This subscription does not have a time limitation, but it is limited to 30 GB of active data and uses t2.xlarge instance of EC2.  If you need higher performance and scalability, we encourage you to subscribe to Teradata Aster Analytics AMI, also available on AWS Marketplace.  Aster Analytics can scale up to 32 workers and leverages M4.4Xlarge instance of EC2 with more vCPU, memory and EBS bandwidth.

 

If you have any questions regarding Aster in the Cloud, please feel free to contact Arlene Zaima.

Behavioral analytics is hard.

Unlike objective classification problems such as churn/stay or fraud/not fraud, behavioral analytics relies on appreciation and judgment.

Beauty, like madness, is in the eye of the beholder, and a person’s satire is another person’s fake news.

 

A central challenge of behavioral analytics is inferring a perceptual or psychological behavior from a series of quantifiable observations[1] under two constraints:

  • The data available is not “designed for purpose”, requiring an “observation to higher order features” model type
  • The quantity and quality of data and ground truth (when available) are too low to use certain types of learning systems[2]

 

To understand behavior form data, a proven machine learning approach is Hidden Markov Models (HMM). In essence, HMMs infer a specified number of (hidden) states from a sequence of observations.

Hidden Markov Models offer non-negligible advantages for behavioral analytics:

  • Provide classification without the need to specify higher order behaviors a priori
  • Provide evidence-based likelihood outputs that:
    • Allow for lifetime behavioral analysis (including change in behavior over time)
    • Easily adjustable to classify behaviors when business outcomes change (e.g., prioritizing precision over recall)
    • Competing hypotheses can be compared by training different models and testing the likelihood of an observation sequence for each model

Example

To illustrate the potential of HMMs for behavior prediction we describe a use case of problem gambling.

Problem Gambling is a behavioral disorder recognized by the American Psychiatric Association and characterized by an addictive relationship to gambling. Problem gambling is a serious issue in some jurisdictions; Australian government sources estimate that up to 2% of the Australian population may have a gambling problem.

 

Problem gambling cannot be quickly found from transactional data: a person’s addictive behavior may look similar to another person’s pastime and this differentiation can be subjective. Also, it is likely that gamblers visit more than one gambling venue regularly, making datasets incomplete.

In this example, the data available comprises betting transactions, bet outcomes, and account transactions. Because addiction-type behaviors develop over time, we design features with respect to an individual baseline instead of using global variables such as “amount of dollars wagered”. The customer activity is then aggregated over a time period, e.g., weekly; this aggregation serves as observation.

Customer

Week

Observation

CID1234

201445

One Deposit

CID1234

201446

Base Line Profile

CID1234

201447

Small Loss

CID1234

201448

No Activity

CID1234

CID1234

201510

No Activity

CID1234

201511

One Deposit

CID1234

201512

Decreased bets

CID1234

201513

Small Loss

CID1234

201514

Increased money

CID1234

201515

No Activity

CID1234

201516

Increased markets

CID1234

201517

Big Loss

 

We build two HMMs from these observations: one for customers who eventually “Self-Exclude” from the platform, and another for the ones who stay active. Self-exclusion is the best available proxy for problem gambling but is far from being optimal, as only a small proportion of problem gamblers seek help or self-exclude. This has important implications, as we cannot (and should not) optimize for 100% classification rate from the training data.

 

Hidden Markov Models in Aster are easy to implement. A model is created with the HMMUnsupervisedLearner function, and new data is scored with the HMMEvaluator function. The learner function calculates the composition of states and transition probabilities while the evaluator function expresses how likely a certain observation sequence is for a specific model.

--SELF EXCLUDING MODEL

SELECT * FROM HMMUnsupervisedLearner(

       ON self_exclude_train AS vertices

       PARTITION BY customer_id ORDER BY week

       HiddenStateNum('30') --# of hidden states, this can be optimised

       MaxIterNum('100') --convergence criterion

       Epsilon('0.01') --convergence criterion

       InitMethods('random')

       SeqColumn('customer_id')ObsColumn('keyword')

       OoutputTables(

              'init_state_prob'

              ,'transition_prob'

              ,'emission_prob'

       )

);

Code for creating the Self Excluding Hidden Markov Model

 

 

 

 

Hidden states and transitions for “Active” customers (with manual labels)

 

 

Customers are “scored” by both self-excluding and active model on a weekly basis, assigning a label for that week according to the highest likelihood. This allows us to see how customers’ behavior change over time. In the below example there are 5 weeks where the customer alternates between behaviors before consistently looking like an addictive gambler.

 

 

Customer

Week ID

Prediction

Likelihood

CID1234

1

Active

0.477027359

CID1234

2

Active

0.147745976

CID1234

Active

CID1234

13

Active

1.99E-04

CID1234

14

Self Excluding

4.61E-05

CID1234

15

Active

2.60E-05

CID1234

16

Active

1.83E-05

CID1234

17

Self Excluding

3.87E-06

CID1234

18

Active

2.40E-06

CID1234

19

Self Excluding

6.26E-07

CID1234

Self Excluding

CID1234

114

Self Excluding

5.53E-82

 

We can further refine the classification by using the likelihood of a classification to implement thresholds. Specifically, we can decide not to classify individuals whose likelihood for either model lies under a specific threshold, implicitly considering them non problem gamblers until more activity is recorded. The rationale is that addictive behaviors are more constrained than general behavior.

 

 

Overall, despite a severe imbalance in data (98% of the ground truth is “Active”, only 2% is “Self Excluding”) and the noise in the data, the results are remarkable, with an f-score of 0.79 when a 1e-100 threshold is applied.  Again, we are not trying to get an f-score of 1.0 as many problem gamblers have likely not self-excluded and some self-excluders have done so for other reasons.

 

 

Classification rates for various thresholds

Conclusion

Hidden Markov Models offer an easy, effective method to quantify and classify behaviors from noisy transactional data. With its efficient Aster implementation, we can accurately score millions of customer interactions over years’ worth of data

 

For more information on behavioral analytics, contact Clement Fredembach or Michelle Tanco

Originally published at /community/bso/blog/2017/03/20/share-this-new-path-analysis-interface-videos.

 

We’ve had a lot of fun lately helping companies get rapid insights into their customers’ journeys with the Path Analysis Guided Analytics Interface. But if you aren’t using the interface and haven’t seen a demo yet, you may be wondering just how easy it is to use.

 

For your viewing pleasure – and education – I submit two new animated gifs demoing different capabilities of the Path Analysis interface.

 

This first video walks through building a path analysis then shows several visualizations that are included in the interface. Those visualizations include tree, Sankey, sigma and sunburst diagrams, as well as the ever-valuable bar chart.

 

Building an analysis and seeing visualizations with the Path Analysis Guided Analytics Interface

 

The second video shows how you can use the tree diagram within the Path Analysis interface to build and export a list of customers on a specific path. In this case, we are analyzing an online banking demo data set, looking at customers who passed through a bill pay enrollment form but did not complete enrollment.

 

Building a customer list with the Path Analysis Guided Analytics Interface

 

These videos are both animated gifs, so they should be relatively easy to download with a right-click and share. If you need higher resolution versions, or you’d like to see a live demo or get your hands on a demo for yourself, email me at ryan.garrett@teradata.com.

Uncovering relationships within data has long been a cornerstone analytic with Aster.  As many of our CFilter tutorials have shown, we've found relationships in retail market baskets, online web searching and shopping, sensor data and even uncovered social networks with online gaming.  With the Aster Analytics 6.20 library, we now have the ability to uncover even deeper relationships within these data sets using the FPGrowth function.  This Frequent Pattern Growth algorithm performs deep association rule mining well beyond the 1-1 pair affinity of CFilter.  FPGrowth will detect combinations of items (called Antecendents) that result in combinations (called Consequences). 

 

To demonstrate the power of FPGrowth, here's a very simple grocery basket example visualizing the output using a pattern of  '1-2 Antecedents -> 1 Consequence'.  We'll highlight the pattern parameters when we look at the function syntax.  Notice the nodes that are comprised of pairs of items; for example, 'milk,butter' or 'cereal,paper towels'.FPGrowth example 1 - Grocery Items

 

With FPGrowth, these combination relationships, which may have been hidden before, are now easily uncovered.  In the graph above we see that milk and diapers together have an affinity score of 3.4 to bread, whereas milk alone to bread has a lower score of 2.0 and diapers alone to bread was only 2.1.  The interpretation is 'baskets with both milk and diapers are much more likely to include bread than those baskets with only milk or diapers".  We can then go deeper into these relationships by simply changing the search pattern parameters.  For example, let's look for combinations that include up to 3 Antecedents that lead to combinations of 1 and 2 Consequences:

 

 

Here we now see more complex combinations, such as 'milk,chips,diapers' as an Antecedent, and 'milk,cereal' as a 2-item Consequence.  This is very exciting!  I hope this simple example clearly shows how Aster can dig deep into these data sets to uncover potentially very valuable insights from these more complex affinity relationships.

 

Here is the syntax for FPGrowth, not much different than CFilter, with the addition of some threshold filters and the Antecedent/Consequence patterns.  It also can output 2 result tables; rules, patterns or both.  I find myself using the Rules table, as it splits Antecedents and Consequences into separate fields, as well as the usual statistical output expected, very much like CFilter.  The pattern in this example is '1-2 Antecendents leading to 1 Consequence'.

 

SELECT * FROM FPGrowth
(
    ON (SELECT 1)
    PARTITION BY 1
    InputTable ('grocery_baskets')
    OutputRuleTable ('grocery_baskets_fpgrowth_out_rule')
    OutputPatternTable ('grocery_baskets_fpgrowth_out_pattern')
    compress('low')
    droptable('true')
    TranItemColumns ('item')
    TranIDColumns ('entity_id','basket_id')
    MinSupport (0.001)
    MinConfidence (0.1)
    --MaxPatternLength (5)  -- can be used as a combination 'Antecedent+Consequences' pattern size limit
    AntecedentCountRange ('1-2')  -- range size for Antecedent pattern; 'min-max'
    ConsequenceCountRange ('1-1')  -- range size for Consequence pattern; 'min-max'
   -- PatternsOrRules ('rules')  -- syntax is 'rules', 'patterns', or 'both'.  Default is 'both'
);

 

Here is the output from the Rules table:

select * from grocery_baskets_fpgrowth_out_rule limit 10;

 

Note a few new statistics columns, notably 'conviction'.  A word of warning, this field is null for the highest affinity values because the formula divides by '1-confidence', which is a divide-by-zero.  So don't assume sorting by conviction desc will give you the highest matches.

 

antecedent_itemconsequence_itemcount_of_antecedentcount_of_consequencecntbcnt_antecedentcnt_consequencescoresupportconfidenceliftconvictionleveragecoveragechi_squarez_score
bread,buttercereal212290.220.081.002.670.050.083.64-0.03
bread,buttermilk2122120.170.081.002.000.040.082.18-0.03
cerealbread114970.250.170.441.521.280.060.381.631.27
milk,chipsbread211670.020.040.170.570.85-0.030.250.61-0.68
milk,cheesebeer2122130.150.081.001.850.040.081.85-0.03
milk,cheesechips2122100.200.081.002.400.050.083.05-0.03
beer,diapersbread211370.050.040.331.141.060.010.130.03-0.68
cerealcheese112990.050.080.220.590.80-0.060.381.43-0.03
milk,chipscheese212690.070.080.330.890.94-0.010.250.06-0.03
beer,diaperschips2123100.130.080.671.601.750.030.130.88-0.03

 

So there it is, a very much more powerful Affinity detection library function from Aster.  This can be used in so many scenarios beyond these shopping examples, from healthcare life saving pattern detection to sensor data manufacturing quality analytics.  And all with the ease and performance of Aster.  Have fun!

Teradata Warehouse Miner (TWM) is a PC-based application that allows users to perform data profiling and data mining in-database via a user-friendly GUI interface. As of version 5.4.0 TWM can now run on top of the Aster Database as well - allowing Aster users access to data profiling and ADS creation functions. 

 

How To Get and Setup TWM

 

A express version of TWM is free for download at the Teradata Developer Exchange (you will need to register for free to get access): Teradata Warehouse Miner - Express Edition | Teradata Downloads At the moment TWM is supported only on Windows PCs. 

 

Once you have installed TWM, you will also need to install the Aster ODBC Driver. You can find the ODBC driver among the Aster Client Tools For Windows: Aster Client Tools for Windows | Teradata Downloads For Aster Express - I'd recommend using the appropriate version (6.1) to ensure compatibility. Depending on your PC you may need either the 64-bit or the 32-bit nCluster ODBC driver. The 32-bit driver is among the "i386" packages while the 64-bit one is among the "x8664" packages. 

 

Once ODBC driver is set up you will need to setup your Aster DB as an ODBC data source using the Windows ODBC Manager. Search for ODBC on Windows and add Aster as a source then specify your DB credentials.  

 

 

Once done, on TWM, you need to activate the connection through "Tools">"Connection Properties" and specify schemas where TWM will be populating its work. I usually just use "twm_demo" as a default schema - but you might also want specific schemas for Metadata, Publish, and Advertise. You can also just specify "public" but that could clutter your DB later. 

 

(Note: if you are not using "public" you need to create these specific SCHEMAS in Aster DB beforehand, e.g. 'CREATE SCHEMA twm_demo;', etc. or TWM will not be able to find them.)

 

 

If this is the first time you are specifying schemas, go to the option "Tools">"Metadata Creation" as well as "Tools">"Advertise Tables Creation" so TWM can create its work tables. Moving forward you can run through the "Maintenance" options so TWM can update these. 

 

 

Finally, go to the start menu folder "Teradata Warehouse Miner" and click the option to "Install or Uninstall Aster Library" to install the Aster Profiler SQL-MR and follow the command line prompts to deploy the Profiler.Zip SQL-MR file into your Aster DB (make sure your Aster DB is up and running when you do this). There is also an option called "Load Aster Demo Data" which is a sample data set (a bank) to play with. 

 

After the above your TWM should be good to go. 

 

Using TWM with Aster


1. Data Profiling and ADS Creation

 

Warehouse Miner comes with a plethora of automated profiling and exploration tools complete with some visualizations.


 

Each of these tools are self-contained GUI-driven apps that run SQL behind the scenes to help you analyze and visualize your Aster tables! 


2. SQL-MR Code Helper

 

Those looking for another way to secretly be a SQL-MR expert without having the User Guide and Foundation Guide PDFs open on your desktop - look no further than the Variable Creation node on TWM. You can instantly get code examples of all up-to-date SQL-MR commands. 

 


In addition - those not familiar with the PostgreSQL syntax for some obscure windowing functions and regular expressions - you can either build them with the (initially confusing) GUI then later check the SQL for them too.

 

 

3. Project Explorer

 

TWM allows you to open projects by default -and each analysis you conduct becomes a folder in your project. Over time, especially with the prolific use of the "Free Form SQL" node and some smart SQL scripting - you can come up with rather complex workflows - which are all tagged, curated, and arranged any way you want. It's not the GUI-based workflow that everyone is clamoring for, but it's definitely a step-up from managing all your SQL scripting on a notepad. Best of all - each node can be executed in parallel - without bothering an existing query. Each node opens a separate connection to the DB when running.

 

 

If you want to create a slightly different version of a query - simply "clone" a node, rename it, and edit the contents! Nice and neat.

 

4. Metadata Repository

 

Best of all - all the above activity in TWM is stored as metadata in the AsterDB itself! This allows easy retrieval of your project regardless of the workstation you use - and also allows sharing of work amongst multiple users - if everyone points to the same metadata repository in their TWM instance. This can be a great way to collaborate on your Aster usage across teams. 

 

 

All in all TWM is a great tool that complements the power of Aster with a friendly front-end and data profiling capabilities. 

allura

Scalable data science with R

Posted by allura Mar 6, 2017

For more on this topic: Brian Kreeger and Roger Fried hosted a live webcast, Scalable Data Science with R.   Also, see related posts at the bottom of this article. 

 

You’ve got three options:  Scaling up, scaling out, or using R as an abstraction layer.

salable data science with r

R is among the top five data science tools in use today according O’Reilly research; the latest kdnuggets survey puts it in first, and IEEE Spectrum ranks it as the fifth most popular programming language.

 

The latest Rexer Data Miner survey revealed that in the past eight years, there has been an three-fold increase in the number of respondents using R, and a seven-fold increase in the number of analysts/scientists who have said that R is their primary tool.

 

Despite its popularity, the main drawback of vanilla R is its inherently “single threaded” nature and its need to fit all the data being processed in RAM. But nowadays, data sets are typically in the range of GBs and they are growing quickly to TBs. In short, current growth in data volume and variety is demanding more efficient tools by data scientists.

 

Every data science analysis starts with preparing, cleaning, and transforming the raw input data into some tabular data that can be further used in machine learning models.

 

In the particular case of R, data size problems usually arise when the input data do not fit in the RAM of the machine and when data analysis takes a long time because parallelism does not happen automatically. Without making the data smaller (through sampling, for example) this problem can be solved in two different ways:

 

  • Scaling-out vertically, by using a machine with more available RAM. For some data scientists leveraging cloud environments like AWS, this can be as easy as changing the instance type of the machine (for example, AWS recently provided an instance with 2TB of RAM). However most companies today are using their internal data infrastructure that relies on commodity hardware to analyze data—they’ll have more difficulty increasing their available RAM.
  • Scaling-out horizontally: In this context, it is necessary to change the default R behaviour of loading all required data in memory and access the data differently by using a distributed or parallel schema with a divide-and-conquer (or in R terms, split-apply-combine) approach like MapReduce.

 

While the first approach is obvious and can use the same code to deal with different data sizes, it can only scale to the memory limits of the machine being used. The second approach, by contrast, is more powerful but it is also more difficult to set up and adapt to existing legacy code.

 

There is a third approach—scaling-out horizontally can be solved by using R as an interface to the most popular distributed paradigms:

 

  • Hadoop: through using the set of libraries or packages known as RHadoop. These R packages allow users to analyze data with Hadoop through R code. They consist on rhdfs to interact with HDFS systems; rhbase to connect with HBase; plyrmr to perform common data transformation operations over large datasets; rmr2 that provides a map-reduce API; and ravro that writes and reads avro files.
  • Spark: with SparkR it is possible to use Spark’s distributed computation engine to enable large-scale data analysis from the R shell. It provides a distributed data frame implementation that supports operations like selection, filtering, aggregation, etc., on large data sets.
  • Programming with Big Data in R: (pbdr) is based on MPI and can be used on high-performance computing (HPC) systems, providing a true parallel programming environment in R.

 

Novel distributed platforms also combine batch and stream processing, providing a SQL-like expression language—for instance, Apache Flink. There are also higher levels of abstraction that allow you to create a data processing language, such as the recently open sourced project Apache Beam from Google. However, these novel projects are still under development, and so far do not include R support.

 

After the data preparation step, the next common data science phase consists of training machine learning models, which can also be performed on a single machine or distributed among different machines. In the case of distributed machine learning frameworks, the most popular approaches using R, are the following:

 

  • Spark MLlib: through SparkR, some of the machine learning functionalities of Spark are exported in the R package. In particular, the following machine learning models are supported from R: generalized linear model (GLM), survival regression, naive Bayes and k-means.
  • H2o framework: a Java-based framework that allows building scalable machine learning models in R or Python. It can run as standalone platform or with an existing Hadoop or Spark implementation. It provides a variety of supervised learning models, such as GLM, gradient boosting machine (GBM), deep learning, Distributed Random Forest, naive Bayes and unsupervised learning implementations like PCA and k-means.

 

Sidestepping the coding and customization issues of these approaches, you can seek out a commercial solution that uses R to access data on the front-end but uses its own big-data-native processing under the hood.

 

  • Teradata Aster R is a massively parallel processing (MPP) analytic solution that facilitates the data preparation and modeling steps in a scalable way using R. It supports a variety of data sources (text, numerical, time series, graphs) and provides an R interface to Aster’s data science library that scales by using a distributed/parallel environment, avoiding the technical complexities to the user. Teradata also has a partnership with Revolution Analytics (now Microsoft R) where users can execute R code inside of Teradata’s platform.

 

Teradata has also released an open source package in CRAN called toaster that allows users to compute, analyze, and visualize data with (on top of) the Teradata Aster database. It allows computing data in Aster by taking advantage of Aster distributed and parallel engines, and then creates visualizations of the results directly in R. For example, it allows users to execute K-Means or run several cross-validation iterations of a linear regression model in parallel.

 

Also related is MADlib, an open source library for scalable in-database analytics currently in incubator at Apache. There are other open source CRAN packages to deal with big data, such as biglm, bigpca, biganalytics, bigmemory or pbdR—but they are focused on specific issues rather than addressing the data science pipeline in general.

 

Big data analysis presents a lot of opportunities to extract hidden patterns when you are using the right algorithms and the underlying technology that will help to gather insights. Connecting new scales of data with familiar tools is a challenge, but tools like Aster R offer a way to combine the beauty and elegance of the R language within a distributed environment to allow processing data at scale.

 

This post is a collaboration between O'Reilly Media and Teradata.

 

Related Posts:

Scalable Data Science with R and Aster: O'Reilly Presentation 

From Excel to Aster in under 9 lines of R code 

How To Calculate Correlations on Big Data with Aster R 

Fast Track to Aster Graph Database with R and toaster 

Aster R Architecture Basics 

The Aster R Orange Book is a great resource for R Aster R workflows and sample code.  

The Aster R Orange Book: Best Practices for Aster R Coding