Skip navigation
All Places > Learn Aster > Blog
1 2 3 Previous Next

Learn Aster

260 posts

Aster's nPath function is a powerful tool for uncovering patterns in sequential data.  There are many examples here on the Aster Community for writing those queries and visualizing the output.  But that's not all Aster has to offer for path analysis.  There are more steps that you can take to create even more value.  I'm going to show you a few easy queries that will take the output from nPath and start to build out a model that may be used to predict the next step on a path.


I'll use the demo data set that ships with Aster Express, bank_web_clicks, though really any output from your own nPath analysis will work just as well.  Here's a simple Tree diagram that I've created from this data, showing a handful of paths that customers have taken on this web site.



These Tree diagrams are incredible useful for visualizing paths and helping to uncover patterns for deeper analysis.  Additionally, having an underlying table with metrics around parent nodes, child nodes, counts and percentages could begin to form the foundation for a model to predict likely outcomes for other customers on similar journeys, allowing you to answer such questions as "what's the next likely step my customer will take".  In a few steps, I'll show how to build out a table with rows like this showing Parent paths and Next steps with calculated probability:




Introducing the Aster path_analyzer function.  This takes the output from your nPath query, with 'group by' counts for each path, and then breaks out all the paths and counts from each parent node, working its way down the full depth of the tree.  Getting the "top paths and counts" from nPath is a pretty standard part of our normal nPath workflow and done easily with a simple query.  Let's build a new table, called 'top_bank_paths'.  Here I'll limit this to the top 1000 paths.  (Since these tables are small, I'm using 'replication' as my distribution).


create table top_bank_paths
   distribute by replication
select path,count(*) as cnt
from bank_paths
group by 1
order by 2 desc
limit 1000


With results:




We will now use these as input to path_analyzer:


SELECT * FROM Path_Analyzer
    ON (SELECT 1)
    InputTable ('top_bank_paths')
    OutputTable ('bank_path_analyzer_output')
    --SeqColumn ('path')  Aster v7.0
    --CountColumn ('cnt') Aster v7.0
    Seq('path')  --Aster v6.20
    Cnt('cnt')   --Aster v6.20


The syntax did change slightly for the Seq and Cnt parameters with Aster Analytics 7.0, so I've included both above. Simply comment/uncomment based on your library version.  Also, if you want to rerun path_analyzer, you'll need to first drop your output table.  In this example, that would be:

     drop table bank_path_analyzer_output;


The output is a bit verbose, so here are the important fields:




With this 'not too complex' SQL, we can now build a table for each parent node with corresponding 'next child node' and percentage calculation.  I'm doing a little string manipulation; skipping those first 4 characters in the parent format and likewise in the 'next_node' field, and trimming the trailing ']' character, which represents 'end of path'.  I'm also using a window function, row_number(),  to rank each child by percentage so that we can later filter by top 1 or 2 child nodes:


create table top_bank_next_path
    distribute by replication
select substring(N.parent from 4) as parent
     ,replace(substring(N.node from (3 + length(N.parent))),']','') as next_node
     ,P.subpath_cnt as parent_cnt
     ,N.subpath_cnt as next_cnt
     ,(1.0 * N.subpath_cnt / P.subpath_cnt)::numeric(3,2) as probability
     ,row_number() over (partition by N.parent order by N.subpath_cnt desc) as rank
from bank_path_analyzer_output N
    ,bank_path_analyzer_output P
where P.depth > 0
and   N.parent = P.node





And finally, drum roll, here are some answers to the question "what is my customer likely to do next based on where they are in their journey".  In this next query, I'm using some parameters like 'probability greater than 30%', 'more than 20 similar paths have occurred' and perhaps most importantly, 'where rank equals 1', which means its the mostly likely next step for each parent path.


select parent,next_node
from demo.top_bank_next_path
where rank = 1
and probability > .3
and parent_cnt > 20
order by probability desc
limit 10




There it is, a pretty straight forward, and I think a very powerful new step to add to all your Aster nPath workflows to create a very useful new table to answer many question from your pathing data.


Did you miss the live webinar?  You can listen to the replay featuring Matt Mazzarell, Data Scientist, Teradata. 



Unable to View Video here?  Use this link:  Combining Text and Graph Analytics for More Context in Your Networks featuring Matt Mazzarell - YouTube 


Video Details:

    Adding more techniques in text analytics can better inform you of the working relationships in a network. Topic analysis, sentiment extraction and predictive analytics have helped large organizations solve complex problems using large volumes of communication data from various sources. This webinar sessions explores some examples of how customers have used these techniques to add more context to their networks.


Matt Mazzarell is a Data Scientist in the Teradata Aster Solutions team with experience implementing analytic solutions for many clients in the Fortune 1000 who seek to solve business problems using large and varied data sets. He has experience collecting customer requirements and prescribing specific analytic techniques across all industry verticals. Matt is responsible for several pre-packaged solutions including Communications Compliance, Product Recommender, and Healthcare Fraud. He is dedicated to enabling the non-technical user to perform analytics at scale that drive significant value for the business.




Organization Network Analysis (ONA) In Action -  Connected Commons Webinar Series

Some of you at Partners 2016 wanted to play with the QGRAM function to do text matching besides the IdentityMatch() function. I explain here how to invoke it and you can add that to your environment and try it out. 


Algorithm Background

Read more about QGRAMs, Cosine Similarity etc., in my previous blog post to understand different text similarity techniques.


Data Science - Fuzzy Matching & Similarity Measures


In this blog post, I only explain a simple use case of matching short text, especially merchant names in transactions.



If a business has transaction data that has merchant names, it's often a common problem to 'disambiguate' those names and map that to a yellow page/registered business name.  If you have millions of business names that mutate over time, morph etc., in the data, it is almost next to impossible with traditional techniques such as string matching/regular expressions/SQL LIKEs etc.,This doc. here explains how we solved this using a technique called 'Weighted QGRAMS with Cosine similarity'. The SQL/MR function QGRAM was created by the Aster PS and combined that with the 6.10 Analytic function called VectorDistance() that computes the cosine similarity metric. It uses a probabilistic technique to break down the merchant names to Q-grams (or letter ngrams), weigh it based on the position in the string and then by applying Cosine Similarity to find the best match.


SQL/MR Functions Used:


QGRAM() must be installed in your cluster schema.  Download the SQL MR: QGRAM  and install it your schema or request your DBA to install in your cluster using:



from act. 


VectorDistance() is available in the later Analytic foundation libraries and should be used in conjunction with QGRAM.




Yellow Page Dictionary (standardized names you want to rollup to):





Ace Mortgage Inc


Ace Credit Card Inc


Ace Credit Bank


Ceasar Frozen Yogurt LLC


8/12 Convienence Store


Teradata Corp


Walmart Super Market






San Mateo City


Merchant names that need to be 'disambiguated' or fuzzy joined with above table:





Ace MG






Ceasar Frozen Yogurt


8/12 Store


Teradata Corporation


Walmart Super M




Whole Foods Market


San Mateo City

Final Output (after the Weighted QGRAM/Cosine Similarity Workflow next section):





cos_score (or fuzzy score/confidence)

Ace MG

Ace Mortgage Inc



Ace Credit Card Inc



Ace Credit Bank


Ceasar Frozen Yogurt

Ceasar Frozen Yogurt LLC


8/12 Store

8/12 Convienence Store


Teradata Corporation

Teradata Corp


Walmart Super M

Walmart Super Market





Whole Foods Market



San Mateo City

San Mateo City


SQL, SQL/MR Workflow code to try in your Teradata Studio:


drop table if exists bizname_yellow_pages;
create table bizname_yellow_pages
 id bigint,
 merchant varchar
) distribute by hash(id);


insert into bizname_yellow_pages values (1,'Ace Mortgage Inc');
insert into bizname_yellow_pages values (2,'Ace Credit Card Inc');
insert into bizname_yellow_pages values (3,'Ace Bank Inc');
insert into bizname_yellow_pages values (4,'Ceasar Frozen Yogurt LLC');
insert into bizname_yellow_pages values (5,'8/12 Convienence Store');
insert into bizname_yellow_pages values (6,'Teradata Corp');
insert into bizname_yellow_pages values (7,'Walmart Super Market');
insert into bizname_yellow_pages values (8,'Apple');
insert into bizname_yellow_pages values (9,'Wholefoods');
insert into bizname_yellow_pages values (10,'San Mateo City');


drop table if exists operational_biznames;
create table operational_biznames
 id bigint,
 bizname varchar
) distribute by hash(id);


insert into operational_biznames values(1,'Ace MG');
insert into operational_biznames values(2,'ACC');
insert into operational_biznames values(3,'ACB');
insert into operational_biznames values(4,'Ceasar Frozen Yogurt');
insert into operational_biznames values(5,'8/12 Store');
insert into operational_biznames values(6,'Teradata Corporation');
insert into operational_biznames values(7,'Walmart Super M');
insert into operational_biznames values(8,'apple');
insert into operational_biznames values(9,'Whole Foods Market');
insert into operational_biznames values(10,'San Mateo City');


drop table if exists qgram_yp;
create table qgram_yp distribute by hash(qgram)
select id, qgram, weight
   from qgram(
       on bizname_yellow_pages

drop table if exists qgram_ops;
create table qgram_ops distribute by hash(qgram)
select id, qgram, weight
 from qgram(
     on operational_biznames

drop table if exists cosine_similarity_match;
create table cosine_similarity_match distribute by hash(target_id)
select * from VectorDistance(
   on qgram_ops as target partition by id
   on qgram_yp as ref dimension

select id, b.bizname, c.merchant, 1- a.distance as cosine_score
cosine_similarity_match a
   inner join operational_biznames b on (a.target_id =
   inner join bizname_yellow_pages c on (a.ref_id =
order by id


Previously in this series

In my last post I started exploring concrete implementation of data science pipelines with Aster and R. We covered programming environment, loading data into Aster, and importance and easiness of testing every step in the pipeline. In this post we'll focus on perhaps the most prevalent part of every workflow - data manipulation. To make examples more translatable we will embrace dual approach of illustrating each technique with 2 equivalent examples: one using  TeradataAsterR and another using dplyr package. Finally, in the spirit of embracing "test everything" each example will result in comparing them for equivalency.


The Grammar of Data Manipulation

dplyr function design follows clear separation of concern of the split-apply-combine process and the grammar of data manipulation:

  • filter: keep rows with matching conditions;
  • select: select, drop or rename attributes by name;
  • arrange: order rows by attributes;
  • mutate: add new attributes;
  • summarise: reduce multiple attribute values to a single value;
  • group_by: group data by one or more variables to apply a function.

Graph demonstrating Split-Apply-Combine.


TeradataAsterR functions are the mix of 2 approaches: classic R and split-apply-combine process embraced by dplyr. So having examples for both will add clarity and understanding to how things get done in both cases.


But before jumping to the grammar of data manipulation functions we begin with complimentary operation that combines data from two data sets into one. 


Joining Tables

Aster is a relational data store (sorry if I never made this important point before) so its support for joining tables comes naturally with fully SQL-compliant SELECT statement that includes:

  • Inner join
  • Left, right, and full outer joins
  • Semi- and anti-join (same as filtering joins in dplyr)
  • Cross join (Cartesian product)

In Aster R this translates  into two  functions that both perform joins resulting in a virtual data frame of query type and both act on two virtual data frames:

  • ta.merge with R style merge parameters that will be translated to SQL
  • ta.join that offers SQL-like parameters including choice of join type

I suggest using the latter as it offers full functionality with better transparency and compatibility with actual operations taking place in Aster.


For our examples we need to combine batting stats in Batting with players attributes from Master. This requires inner join and for additional clarity we demonstrate how to accomplish this same operation using three methods: R base function merge(), dplyr function inner_join(), and Aster R ta.join():

# R
big.df = merge(Batting, Master, by="playerID")

# dplyr
big.tbl = Batting %>%
  inner_join(Master, by="playerID")

# Aster R
big.ta = ta.join(batting.ta, master.ta, by="playerID")
ta.colnames(big.ta)[c('x.row_names','x.playerID')] = c('row_names','playerID')

Aster R required some post-processing to remove prefixes in the attribute names (prefixed by ta.join() only to those that were found in both tables). Other than that all three look boringly similar  (if you ignore syntactic sugar of magrittr pipes. Yes, I had option of not using it but the goal was to have each example as clear as possible).

As always we test all 3 methods produced same results (notice that we ran 2 comparisons taking advantage of transitive property of the equivalency):

compareDataInRandAster(big.df, big.tbl, 
  key = c("playerID", "yearID", "stint"))

compareDataInRandAster(big.tbl, big.ta,
  key = c("playerID", "yearID", "stint"))

Note that playerID is both a primary key in Master and a foreign key in Batting which was sufficient in join. But resulting data set inherited the key from Batting that consists of 3 attributes: plyaerID, yearID, and stint.


We'll use data sets to illustrate the grammar of data manipulation functionality as part of the pipeline workflow (we will limit examples to dplyr and TeradataAsterR functions from now on).



Feature engineering is one of top reasons new attributes get introduced in the data science pipelines and is often a key ingredient of successful models. For this purpose the grammar of data manipulation contains mutate function that adds new or changes existing attributes: dplyr keeps calling it mutate() while Aster R has ta.transform().


The following example creates two attributes:

  • key containing concatenated elements of composite table key that uniquely references each row,
  • age with player's age calculated using year of birth and current year:
# dplyr
big.prepped.1.tbl = big.tbl %>%
  mutate(key = paste(playerID, yearID, stint, sep="-"),
         age = yearID - birthYear)

# Aster R
big.prepped.1.ta = ta.transform(big.ta,
    key = paste(playerID, yearID, stint, sep="-"),
    age = yearID - birthYear)

Again, examples contain strikingly similar code for both. Still, underneath, there are some differences to be aware of. For example, while mutate() can immediately use newly introduced variables inside the call ta.transform() can't.


As before (and after as well) we test results of both functions for equivalency:

compareDataInRandAster(big.prepped.1.tbl, big.prepped.1.ta,
    key = c("key"))

Note how new attribute key being a unique reference was put immediately to work in the test.



Arguably the most common data manipulation operation is reducing data set to smaller one which may take two forms:

  • selecting attributes (vertical subset);
  • selecting rows (horizontal subset).

While dplyr uses functions select() and filter() respectively, Aster R combines conventional R to select attributes and function ta.subset() to select rows.

 We continue our example by selecting batting metrics identified by their names and stored in a variable batting_metrics while removing all rows before year 1995 (yearID >= 1995) and players without home runs (! which Aster translates to IS NOT NULL operator):

batting_metrics = c("G", "AB", "R", "H", "HR", "BB", "SO")

# dplyr
big.prepped.2.tbl = big.prepped.1.tbl %>%
  select_(quote(c(key, playerID, yearID, stint, age)),
          .dots=batting_metrics) %>%
  filter(! & yearID >= 1995)

# Aster R
big.prepped.2.ta = ta.subset(big.prepped.1.ta[,c("key","playerID","yearID","stint","age",
    batting_metrics, "row_names")],
    ! & yearID >= 1995)

One distinct feature of dplyr is its use of non-standard evaluation (NSE, see vignette("nse") for details) which is great for translating R to SQL and interactive use but is hard to program with. For that reason dplyr offers SE versions of its functions designated with suffix _. With Aster R we use NSE inside ta.submit() but not with attribute selection that uses data frame syntax.


And our standard test comparing results in dplyr and Aster R completes this example:

compareDataInRandAster(big.prepped.2.tbl, big.prepped.2.ta, key = "key")


Data manipulation function that reduces multiple attribute values to a single value is usually referred to as summarization. With Aster R we have at least three approaches to data summarization:

  • aggregates using SQL GROUP BY, e.g. MAX, MIN, AVG, etc. with ta.summarise() or ta.summarise.each(),
  • SQL window functions available with ta.transform(),
  • and in-database R using ta.tapply() or that offers the most flexible  option of pushing arbitrary R code to run in parallel over partitioned data inside the Aster database.

Note that summarizing doesn't necessarily imply collapsing of data as reduced values could  be attached to the original rows (the case for window functions). What summarizing always includes is dividing operation into two steps:

  1. a grouping of data based on the values of certain attribute(s) (GROUP BY clause in SQL or INDEX argument in tapply() and ta.tapply())
  2. a computation step that computes or reduces data with aggregation, window or arbitrary R functions within each group.

In fact, dplyr embraces this concept fully by always invoking a group_by() function first followed by appropriate value reduction function.



SQL GROUP BY clause should always come to mind first. Aster R implements it using ta.summarise() that encompasses both grouping and computation, while dplyr as mentioned before invokes these steps explicitly by composing the group_by() and summarise():

# dplyr
summary.1.tbl = big.prepped.2.tbl %>%
     group_by(playerID, yearID, age) %>%
     summarise_each_(funs(sum), batting_metrics)

# Aster R
summary.1.ta = ta.summarise(big.prepped.2.ta, = c("playerID", "yearID", "age"),
     G=sum(G), AB=sum(AB), R=sum(R), H=sum(H), HR=sum(HR),
     BB=sum(BB), SO=sum(SO)) 

compareDataInRandAster(summary.1.tbl, summary.1.ta,
     key = c("playerID","yearID"))

In this example we consolidated the same player's records within single year into one record for simplicity. While it's not common that players change teams within the same season it does happen. And Lahman designates a record for each player's team stint, which for our purpose we consider unnecessary detail. So aggregates were used to consolidate stats to one player per year by summing up player's stats from all teams he played that season. Again, with dplyr we used SE version summarize_each_() to take advantage of existing list of all metrics in batting_metrics. With Aster R we included all aggregates explicitly in ta.summarise().


Using Window Functions

We could devote a whole blog post (or a series, or a book, easily) to window functions. But let me instead quote three sources - one from Aster, one from PostgreSQL, and lastly from dplyr vignette:

Window functions allow the calculation of aggregates and other values on a per-row basis as opposed to a per-set or per-group basis... For example, a window function may be used to compute a running sum, a moving average, a delta between values in neighboring rows, or to apply ranking and row numbering to a table.

- Teradata Aster Database User Guide

A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.

- PostgreSQL 9.1 Manual

A window function is a variation on an aggregation function. Where an aggregation function, like sum() and mean(), takes n inputs and return a single value, a window function returns n values. The output of a window function depends on all its input values, so window functions don’t include functions that work element-wise, like + or round(). Window functions include variations on aggregate functions, like cumsum() and cummean(), functions for ranking and ordering, like rank(), and functions for taking offsets, like lead() and lag().

- Window functions vignette

Besides its per-row application another important feature of window functions is support for order within groups, which allows for enumerating rows, computing lags, cumulative and other order-dependent functions. Thus, the syntax supports both grouping and ordering options depending on specific window function.

For our example we will compute following five values:

  • current season number for each player with row_number();
  • current career HR value (up to current year) for each player with cummax();
  • career HR for each player with max() (we could compute same with summarise() but not without collapsing groups into single row unlike with window function when each row is preserved);
  • player's career start year with min();
  • player's career end year with max().
# dplyr
summary.2.tbl = summary.1.tbl %>% group_by(playerID) %>%
    currentSeason=with_order(order_by = yearID, fun = row_number, x = yearID),
    currentTopHR=with_order(order_by = yearID, fun = cummax, x = HR),
    topHR=max(HR), startCareerYear = min(yearID), endCareerYear = max(yearID))

# Aster R
summary.2.ta = ta.transform(summary.1.ta,
  seasonsTotal=n(partition = playerID),
  currentSeason=row_number(order = yearID, partition = playerID),
  currentTopHR=cummax(HR, order = yearID, partition = playerID),
  topHR=max(HR, partition = playerID),
  startCareerYear = min(yearID, partition=playerID),
  endCareerYear = max(yearID, partition=playerID))

compareDataInRandAster(summary.2.tbl, summary.2.ta,
  key = c("playerID", "yearID"))

Again, let's go over differences and similarities between two implementations. As noted before dplyr always divides operation into composition of group_by() and mutate() where the latter invokes window and aggregate functions inheriting grouping and retaining each row. Aster R involves ta.transform() and applies window and aggregate functions inside the call.


Using In-Database R

At last we arrived at the most powerful option available with Aster R: in-database R execution. Looking at previous examples one may think that R programming really played a limited role as there were not much R code. Indeed, our Aster R constructs closely resembled SQL or grammar of data manipulation with both SQL GROUP BY and window functions inside the grouping and aggregation (data reduction) steps. While the same constructs preside over the option of in-database R the logic transitions to R-centric paradigm based on R tapply() function and functional feature of R.


With dplyr not much changes as its functionality naturally supports functional R and doesn't leave client R environment (at least without involving extensions). On the other hand Aster R provides us with function ta.tapply() that does two important things:

  1. it expands reach of tapply() onto Aster tables by enabling parallel and distributed execution within Aster database;
  2. and it transports R code for execution from Aster R client to Aster R sandbox environments configured and available with proper installation on the Aster database.

Availability and version of R on the Aster database could be easily checked with commands:

[1] TRUE
> ta.R.Version()

Please consult with Teradata Aster R documentation for more details on how to configure and test R in-database environment.


For our example we want to answer the following question: for each player find his age when he had a career HR season (i.e. a season when player hit most HRs in his career). For simplicity we'll use topHR attribute from the previous example when calculating player's age when he hit most HRs in a season:

# dplyr
summary.3.tbl = summary.2.tbl %>%
  group_by(playerID) %>%
    seasons_total = seasonsTotal[[1]],
    max_hr = topHR[[1]],
    top_hr_age = age[HR == max_hr][[1]])

# Aster R
summary.3.ta = ta.tapply(summary.2.ta, INDEX=summary.2.ta$playerID,
  FUN=function(x){c(x$seasonsTotal[[1]], # seasons_total
                    x$topHR[[1]],        # max_hr
                    x$age[x$HR == x$topHR[[1]]][[1]])}, # top_hr_age
  out.tadf = list(columns=c("playerID", "seasons_total",
                            "max_hr", "top_hr_age")))

compareDataInRandAster(summary.3.tbl, summary.3.ta, key = "playerID")

Because Aster R version does things almost letter by letter like dplyr it is advisable to review the latter first. Our goal is to calculate player's stats so summarization will reduce number of rows for each player to 1, hence, we use group_by() with summarize() construct. Next, seasons_total and max_hr have the same values within each player's group so we always pick very first value for each. Things become R-stylish in the line 7 that assigns value to top_hr_age: given vector (array) age we pick its element per condition of corresponding HR value is equal to career maximum (i.e. equal to max_hr) . With Aster R we repeat the same logic but within tapply-style function call: each player's rows are designated by the variable x with anonymous function (line 11) assigned to argument FUN. Lastly, argument out.tadf (line 14) controls how Aster saves results in the database: in this case we explicitly rename column names using argument columns.


What really happened when we called ta.tapply() is radically different from all previous examples: the anonymous function assigned to FUN was serialized and sent over network to Aster worker nodes running R sandboxes where it was deserialized and loaded for execution. After that (you may think that was just a prep) Aster read the specification on how to run it from 2 arguments: virtual data frame (1st argument) referred to underling table and INDEX identified a column value to use to partition the data (line 10). At this point Aster has all ingredients to execute anonymous R function in-database on distributed data and in parallel. It will save results into new table and returns to Aster R that was waiting for ta.tapply(). After execution in Aster completes it returns a virtual data frame with results. Lastly see compareDataInRandAster() that tests for the same results with dplyr and Aster R.


Note that we could pass any R function (including references to 3d party packages given that they are installed on Aster as well) with FUN. This method elevates Aster R to both a framework and a platform for executing R code in parallel across distributed data in a manner consistent with R tapply() function. With proper configuration and algorithm design running parallel R should become a standard function in the daily data scientist routine with Aster. 


In the next post we shift gears to show elements of data science workflow such as exploratory analysis, principal component analysis (PCA) and predictive model.


Image sources:

Red Hat Developer Program

The Programming Historian

Originally published at Path + Text = Peanut Butter + Chocolate.


The text that your customers share when they chat online, post on social media, or call your call centers provides important context about their experiences as they traverse their so-called customer journeys.


Historically though, path (or journey) analysis and text analysis have been distinct domains. Valuable, yes, but difficult or impractical to do in concert. There has been a dearth of tools that could enable you to investigate text in the context of the journey or path from which it came.


That changes today. Yes, you can analyze text and paths together. It’s not only practical and powerful – it’s extremely easy!


The screenshots and copy below walk you through the new text functionality built into our Path Analysis Guided Analytics Interface.


Highlight a path to view text records

First, you click to “Highlight a Path” and then “View Text Records.”


Text on a path

In the Text tab, events or nodes that contain text are filled in orange in the path at the top of the page. The page also contains a table that displays customer ID’s, time stamps, and the text of the record.


Filter by account

You can filter the results by ID or term, and move between events that contain text.


Filter by term

If you filter by term, the term is highlighted in the resulting records, which should help you get through your analysis a bit quicker.


Top tokens

You may want to check out the “Top Tokens” tab to identify some keywords for filtering.


Hopefully the images and workflow above show how easy and powerful this contextual text analysis can be. As a user, you will feel like these text capabilities are seamlessly integrated into your path analytics with the guided analytics interface.


If you’d like to learn more or see a demo, please feel free to send a note to

With all the news about Amazon and Whole Foods and the adoption of online grocery shopping, I just wrote a blog over in the Analytic Solutions section of this site about product recommendations for grocers. Check it out here - Grocers – Make The Right Recommendations, Or Get Left Behind.


Amazon plans to acquire Whole Foods. Retailers and grocers have only had a couple weeks to process the news, and they still have many questions as they plan their competitive positions and strategies.


That news was still fresh (no pun intended) in my mind as I read the eMarketer Retail headline “Online Grocery Shopping Is No Longer Just a Millennial Story.” Twenty-three percent of households have purchased groceries online in the last three months, according to a survey by TrendSource cited by the article. That statistic may not surprise you. But the fact that 21 percent of baby boomers claim to have done so really opened my eyes and made me think of how rapidly the grocery space is evolving.


There is no denying it — if you as a grocer are not making personalized product recommendations, your days are numbered...


Read more.

Previously in this series

In my first post I explained driving force (value), defined goals (repeatable, consistent, testable, deployable, robust, scalable and functional process) and outlined structure (workflow) behind data science pipelines. In this post we'll begin examining data science pipelines using concrete examples with Teradata Aster by building workflow in R. Here I focus on connecting to data store and analytical engine (Aster), loading data, and test-driven development of workflow steps to achieve production grade code.


Teradata Aster R Philosophy

Before we dive into the details let me clarify general approach underlying R development with Aster R. Any aspiring data scientist practicing R can immediately take advantage of Aster powerful data store with RODBC and client-server style of programming. But having said that we should realize that such approach would require:

  1. learning and practicing SQL with its Aster extensions like SQL/MR and SQL/GR;
  2. or/and moving data from Aster to R memory to utilize R functions and programming

With Aster R Teradata aims at overcoming these by introducing R functions that look and behave similarly to R but act on and execute inside Aster data store by translating each call to native SQL and its extensions. Such functions are designated with prefixes ta. and aa. by Teradata. But it went even further by expanding R apply-functionality with functions like ta.apply() and ta.tapply() to push and execute R code inside Aster nodes taking full advantage of its capabilities but demanding additional configuration and advanced skills.


Teradata Aster R Environment

R programming with Aster made possible with TeradataAsterR package which is unlike other R packages available from CRAN is downloaded directly from Teradata here. Hardly surprising it requires ODBC driver for Teradata Aster installed (also available from Teradata here as part of Client Tools for your platform). For the record, installing TeradataAsterR will also install its dependency - RODBC package - that maintains data access layer with Aster database. The following diagram illustrates both the data science pipeline execution environment and the Teradata Aster database:

Assuming we already configured ODBC datasource Aster_ODBC_dsn that points to Aster instance with database my_Aster_db initializing and connecting to Aster draws 2 lines of code:


ta.connect(dsn="Aster_ODBC_dsn", database="my_Aster_db")

From this point on TeradataAsterR implicitly maintains connection reviving and reconnecting to the database as necessary. For that reason no explicit connection object is created even though ta.connect does return one to hold on to. Upon success Aster R reports basic information about established connection:

RODBC Connection 1

Loading Data

With Aster R data stored in the Aster Database is accessed using the concept of virtual data frames (primarily as there are other types of virtual objects in Aster R that we won't discuss here). The virtual data frame concept's goal is transparent handling of data stored in Aster just as it were in R memory using Aster R functions.


In our examples we'll use various baseball datasets from the the excellent Lahman package. Our first task will be loading the datasets into Aster database, for example Master data frame with all baseball players:


ta.create(Master, table = "master", schemaName = "baseball",
          tableType = "fact", partitionKey = "playerID")

This example uses the function ta.create() to create a table named master in the baseball schema of the Aster Database. Lahman package contains more datasets we'll use later so we added new function createBaseballData() to save us some typing:

createBaseballData <- function(data, table, schema, 
                               partitionKey, analyzeFlag) {
  t = proc.time()
  ta.dropTable(tableName = table, schemaName = schema)
  ta.create(data, table = table, schemaName = schema,
            tableType = "fact",
            partitionKey = partitionKey,
            row.names = TRUE, analyze = analyzeFlag)
  runtime = proc.time() - t

# Master table
createBaseballData(Master, "master", "public", "playerID", TRUE)

# Batting table
createBaseballData(Batting, "batting", "public", "playerID", TRUE)

# Fielding table
createBaseballData(Fielding, "fielding", "public", "playerID", TRUE)

Note, that createBaseballData() is more involved: it records time it took to create each table, it makes sure the table didn't exist before, and it runs SQL ANALYZE command at the end if desired. Finally, it creates implicit row name column which would be a critical factor to let Aster R work with virtual data frames. With R we take advantage of its rich programming features and environment to create functional, robust and testable code as part of the overall workflow.


Worth mentioning the rest of functions to load data into the Aster Database:

  • ta.push() to load data from R into an existing Aster virtual data frame
  • ta.load.csv() to load data from a .csv file to an Aster virtual data frame
  • ta.reify() to materialize virtual data frame into temporary schema


Test, Test, Test

How to assure that ta.create() performed as intended that is it created exact copy of the Lahman data set in the Aster Database? The same way any program should - by running a test. For example, by comparing dimensions of the original data frame with newly created table:

# R and Lahman package
dim(Batting); dim(Master)

[1] 101332     22


[1] 18846    26

# Aster R
batting.ta ="batting")
master.ta ="master")

ta.dim(batting.ta); ta.dim(master.ta)

[1] 101332     23


[1] 18846    27 is our principal gateway to Aster R goodies - it creates a virtual data frame from an existing table, view or a SELECT query in the Aster Database - in this case we referenced tables batting and master. ta.dim() is to Aster R virtual data frames what dim() to R data frames. Each database table contains one additional column row_name created by Aster R to maintain functionality compatible with R data frames when using other data frame-like functions: 

  • ta.head()
  • ta.tail()
  • ta.length()
  • ta.names()
  • ta.ncol()
  • ta.nrow()
  • ta.colnames()
  • ta.dimnames()
  • ta.merge()
  • ta.cbind()
  • ta.rbind()
  • ta.order()
  • and more...


Our test compares dimensions between a data source (one of Lahman data sets) and its counter part table in Aster. Such rudimentary test may work ok while working on a prototype or unimportant tasks by minimizing network traffic between R and Aster. But it is less than inadequate for the production environment we target with the pipelines. To make workflow more bullet-proof one way is to bring data back from Aster into R, transform to a data frame and compare with the original: 

compareDataInRandAster <- function(df, ta, key, columns=names(df)) {

  columns = setdiff(unique(c(key, columns)), c("lgID","teamID","bats","throws"))

  data_in_R = df[, columns] %>% arrange_(.dots=setNames(key,key))

  data_from_Aster =[ , columns],
    stringsAsFactors = FALSE) %>% arrange_(.dots=setNames(key,key))

  all.equal(data_in_R, data_from_Aster)

compareDataInRandAster(Batting, batting.ta,
  key = c("playerID", "yearID", "stint"))

The function compareDataInRandAster() is near production quality test that validates any Lahman data set against its image loaded into Aster table. It utilizes:

  • package dplyr to order data (for convenience, we could use standard order() instead, of course) 
  • Aster R to convert virtual data frame to R data frame by moving data from Aster into R memory
  • all.equal() to validate that original data set and its image brought back from the database are exactly the same.

With R programming language building tests (using package like testthat) should be an integral part of your development process and with data science pipelines we seize this opportunity. In particular, replace all.equal() with expect_equal() to make compareDataInRandAster() compatible with the rest of the testthat suite.


In the next post we will review data manipulation techniques available with Aster R as part of data science pipeline and compare them with equivalent functions from dplyr package.



Image source: Red Hat Developer Program

Machine Learning (as limited AI) is here and now.

Some useful things Machine Learning offers today:

  • Predictive Analytics (on both events and numbers)

- Churn, Adoption, Fraud detection, Propensity to buy, Survival Analysis, Next best action, Estimation of margins, sales forecast, customer spend etc., Product recommendations.

  • Identifying objects and people in videos, pictures, and speech

- Scene detection, Find associated relationships with actors of interest

  • Analyze text and find nuggets/sentiments, topics, associations, translate etc.,
  • Large-scale clustering or grouping of customers using behaviors

All of the above on a boatload of data. As you might've heard before, some of the above require the machine learning models to be trained ahead of time.

ML Models can be trained on:

  • Historical data from transactional systems (holds indicators to past results)
  • Borrowed from the ML community/providers and then improvised/tweaked in your environment (transfer learning).

Popular applications in Machine Learning

As you can imagine, it's a green field on what you can do with above new capabilities. For a lot of businesses, generating top N lists of curated customers, prospects, recommendations is priceless and is an immediate application.

Customer Satisfaction and Retention is another area of interest. If businesses can follow the journey of a customer and offer products and services at the right time, can lead to stickiness.

© Can Stock Photo / aquir

Customer Churn - if only businesses knew the Top N list of customers who are in the last mile of switching out ... Predictions derived from a myriad # of variables, contexts.

Product recommendations - which products customers will most likely to click and how to position it in product search pages for maximum margin.

How not to make good the enemy of perfect

Most machine learning models use probabilities to calculate potential outcomes and thus quantifies uncertainty automatically. Obviously, we want the highest accuracy as possible and it comes with a price. As of today's state of affairs (which is changing as I type), there exists a tradeoff between accuracy and explainability. The more accurate the models are in prediction, the harder it is to explain the ML's decision - why is prospect A ahead of prospect B on the list or why product X is being recommended instead of product Y. The good part is you can have both high accuracy and low accuracy models side by side to get decent insights.

Other tradeoffs beyond accuracy and explainability: performance, the cost of deployment. See example chart for algorithm tradeoffs on a sequence prediction algorithm use case:

Traditional businesses are used to deterministic decisions and it's sometimes hard to communicate the algorithmic decisions which have high perplexity :). This is one of the reasons for slow adoption of ML and uncertainty that surrounds it - pun :)

For new emerging businesses in the last 5 years, you can see that letting ML make decisions is the norm and it's managed by a lot of A/B testing results to justify ML approach. It's often called the Champion->Challenger model approach, where better models have selected automatically in the ML setup depending on the deployment results of an A/B testing cycle. This works really well in the absence of historical data (also known as the cold start problem).

Moving to machine learning world is almost the same as us counting on the plane autopilot (which the pilots rely on) and more so with self-driving cars.

Technologies that can help businesses get there

Both proprietary and open source solutions provide ML options that businesses can leverage. Proprietary solutions such as Teradata's Aster or Open Source solutions around Apache Spark (including Deep Learning)/Python/R are available to quickly bring the businesses to the new age of machine learning/AI.

data science is a team sport - from finance to analytics

Just recently Ron Powell, independent analyst and expert with BeyeNETWORK interviewed a Senior Data Scientists of the Aster Strategy and Adoption Team at Teradata.  They discussed the importance of IT, business and data science working together to meet the needs of the business.


In this podcast, you'll hear discussion about the transition from finance to analytics.  And, from a data scientist perspective, you'll hear the challenges the modern organization silos have in working together.  Also, you'll hear discussion about the integration of open source R with Teradata Aster and how different silos show up in this area.


If you can spare 8 minutes and listen in on the conversation:




Leave your comments and questions below and we can continue this conversation.


 - Aster Community Team


Related Links:

Podcast:  Data Science is a Team Sport - From Finance to Analytics 

Podcast Script: The Transition from Finance to Analytics by Ron Powell - BeyeNETWORK


Read more Aster Community Blog Posts:

Blogs in the Learn Aster Community 

Kristen Saad

Percolation Centrality

Posted by Kristen Saad Jun 27, 2017

Centrality as a concept is largely subjective because it answers the question of what nodes are most important in a network.  Importance is heavily dependent on what question you're trying to answer: if you want to know who has the most connections in a network, you need to only count the degrees.  If you're trying to understand information transmission, Betweenness is great.  For cases where you want to understand the ability to transmit information over time and given specific nodal properties, however, you need something like Percolation Centrality (PC) - which is today's topic of discussion!


I want to talk about Betweenness first, as it's the basis for PC.  Betweenness as a concept assumes that information will travel over the "path of least resistance" - If your mother-in-law knows a joke, you're more likely to hear it from her than you are from the Pope - who may have told his friend, who tells his gardener, who tells her cousin, who tells her accountant, who finally tells you.  Likewise, Betweenness is calculated as the proportion of shortest paths that go through any given node.  If we make the assumption discussed above regarding shortest paths, then the proportion should describe the relative control a node has over the transmission of information, in this case, the telling of a joke.


The question then becomes: how does this control change over time, and given nodal states?  If your mother-in-law doesn't know the joke, she can't tell you the joke.  Therefore, her ability to transmit information (the joke) in this case would be 0.  This ability is something I'll refer to as a nodal state.  This is the one of the key additions that PC adds over Betweenness Centrality.  At each point in time, PC is computed as the proportion of percolated paths that go through a given node (where percolated paths refer to paths that originate from a percolated source).  If you're interested in learning more about the mathematical foundation of Percolation Centrality, please refer to the PLoS One publication by Piraveenan et. al found here


As part of my M.S. thesis, I developed logic to parallelize the computation of Percolation Centrality, and extended Aster's Betweennness code to accommodate the changes necessary.  The key files are attached to this post, however, if dependencies are needed, please contact Mike Rote ( for the complete software package.


The Percolation Centrality function takes the same arguments as the Betweenness function, with the exception that the "source" input is mandatory, and should be a table of "percolated" nodes (node ID is the only required column).  More information regarding the Aster implementation can be found in my thesis document ("Bulk Synchronous Parallel Implementation of Percolation Centrality for Large Scale Graphs", which will be published through OhioLink.


Percolation Centrality can be used in applications from diverse fields.  Examples include:

  • Epidemiology (identifying high-risk contagion points in a social)
  • Marketing (identifying potential recruiters in a network)
  • Transportation (issue tracking)





Several people asked me about the best way to install python and additional packages on their Aster environment.

Aster relies heavily on Python for its internal code and processing. For that reason we must never touch the base install in the /home/beehive/toolchain directories. This post describes the correct procedure for installing your own python including the deep learning packages theano, tensorflow and keras.


We leverage the standard Anaconda distribution because it includes around 500 python packages out of the box:

Anaconda package list | Continuum Analytics: Documentation 



Installation Process

1. Download Python 3.6 Anaconda from




2. Perform these installation steps on the queen and all workers


Note: while the queen does not require a python install, it is suggested to perform the same steps on all nodes




  • we will NOT use the root account to install python
  • anaconda will be installed under a new user "pythonu" to avoid any possible conflicts.
  • the pythonu account will not have a home directory.
  • anaconda will be installed in the /opt file system because other aster/teradata tools live there
  • the shell environment variables will not be modified
  • anaconda offers an unattended installation, please refer to website for instructions since our example uses the interactive installation method
  • the python directory structure will be owned by the extensibility account because the Aster MR functions execute under that low privilege user

Installation steps, to be repeated for all workers:


<login as root>

useradd pythonu                                                                        # create new user account to install python

cd /opt

mkdir anaconda

chown pythonu:users /opt/anaconda

ncli node clonefile /tmp/               # copy anaconda distribution to all worker nodes

su - pythonu

cd /tmp

bash                                           # perform  interactive installation




It is crucial to answer the prompts correctly:


Directory: /opt/anaconda/ana3

When prompted to update .bashrc with a new PATH always answer No



Sample Installation output:


Welcome to Anaconda3 4.3.1 (by Continuum Analytics, Inc.)


In order to continue the installation process, please review the license


Please, press ENTER to continue



Anaconda License



Copyright 2016, Continuum Analytics, Inc.


All rights reserved under the 3-clause BSD License:


Redistribution and use in source and binary forms, with or without

modification, are permitted provided that the following conditions are met:


* Redistributions of source code must retain the above copyright notice,

this list of conditions and the following disclaimer.


* Redistributions in binary form must reproduce the above copyright notice,

this list of conditions and the following disclaimer in the documentation

and/or other materials provided with the distribution.


* Neither the name of Continuum Analytics, Inc. nor the names of its

contributors may be used to endorse or promote products derived from this

software without specific prior written permission.














Notice of Third Party Software Licenses



Anaconda contains open source software packages from third parties. These

are available on an "as is" basis and subject to their individual license

agreements. These licenses are available in Anaconda or at . Any binary packages of these

third party tools you obtain via Anaconda are subject to their individual

licenses as well as the Anaconda license. Continuum reserves the right to

change which third party tools are provided in Anaconda.


In particular, Anaconda contains re-distributable, run-time, shared-library

files from the Intel (TM) Math Kernel Library ("MKL binaries"). You are

specifically authorized to use the MKL binaries with your installation of

Anaconda. You are also authorized to redistribute the MKL binaries with

Anaconda or in the conda package that contains them. If needed,

instructions for removing the MKL binaries after installation of Anaconda

are available at


Cryptography Notice


This distribution includes cryptographic software. The country in which you

currently reside may have restrictions on the import, possession, use,

and/or re-export to another country, of encryption software. BEFORE using

any encryption software, please check your country's laws, regulations and

policies concerning the import, possession, or use, and re-export of

encryption software, to see if this is permitted. See the Wassenaar

Arrangement <> for more information.


Continuum Analytics has self-classified this software as Export Commodity

Control Number (ECCN) 5D002.C.1, which includes information security

software using or performing cryptographic functions with asymmetric

algorithms. The form and manner of this distribution makes it eligible for

export under the License Exception ENC Technology Software Unrestricted

(TSU) exception (see the BIS Export Administration Regulations, Section

740.13) for both object code and source code.


The following packages are included in this distribution that relate to




The OpenSSL Project is a collaborative effort to develop a robust,

commercial-grade, full-featured, and Open Source toolkit implementing the

Transport Layer Security (TLS) and Secure Sockets Layer (SSL) protocols as

well as a full-strength general purpose cryptography library.



A collection of both secure hash functions (such as SHA256 and RIPEMD160),

and various encryption algorithms (AES, DES, RSA, ElGamal, etc.).



A thin Python wrapper around (a subset of) the OpenSSL library.


kerberos (krb5, non-Windows platforms)

A network authentication protocol designed to provide strong authentication

for client/server applications by using secret-key cryptography.



A Python library which exposes cryptographic recipes and primitives.


Do you approve the license terms? [yes|no]

>>> yes


Anaconda3 will now be installed into this location:



  - Press ENTER to confirm the location

  - Press CTRL-C to abort the installation

  - Or specify a different location below


[/home/pythonu/anaconda3] >>> /opt/anaconda/ana3


installing: python-3.6.0-0 ...

installing: _license-1.1-py36_1 ...

installing: alabaster-0.7.9-py36_0 ...

installing: anaconda-client-1.6.0-py36_0 ...

installing: anaconda-navigator-1.5.0-py36_0 ...

installing: anaconda-project-0.4.1-py36_0 ...

installing: astroid-1.4.9-py36_0 ...

installing: astropy-1.3-np111py36_0 ...

installing: babel-2.3.4-py36_0 ...

installing: backports-1.0-py36_0 ...

installing: beautifulsoup4-4.5.3-py36_0 ...

installing: bitarray-0.8.1-py36_0 ...

installing: blaze-0.10.1-py36_0 ...

installing: bokeh-0.12.4-py36_0 ...

installing: boto-2.45.0-py36_0 ...

installing: bottleneck-1.2.0-np111py36_0 ...

installing: cairo-1.14.8-0 ...

installing: cffi-1.9.1-py36_0 ...

installing: chardet-2.3.0-py36_0 ...

installing: chest-0.2.3-py36_0 ...

installing: click-6.7-py36_0 ...

installing: cloudpickle-0.2.2-py36_0 ...

installing: clyent-1.2.2-py36_0 ...

installing: colorama-0.3.7-py36_0 ...

installing: configobj-5.0.6-py36_0 ...

installing: contextlib2-0.5.4-py36_0 ...

installing: cryptography-1.7.1-py36_0 ...

installing: curl-7.52.1-0 ...

installing: cycler-0.10.0-py36_0 ...

installing: cython-0.25.2-py36_0 ...

installing: cytoolz-0.8.2-py36_0 ...

installing: dask-0.13.0-py36_0 ...

installing: datashape-0.5.4-py36_0 ...

installing: dbus-1.10.10-0 ...

installing: decorator-4.0.11-py36_0 ...

installing: dill-0.2.5-py36_0 ...

installing: docutils-0.13.1-py36_0 ...

installing: entrypoints-0.2.2-py36_0 ...

installing: et_xmlfile-1.0.1-py36_0 ...

installing: expat-2.1.0-0 ...

installing: fastcache-1.0.2-py36_1 ...

installing: flask-0.12-py36_0 ...

installing: flask-cors-3.0.2-py36_0 ...

installing: fontconfig-2.12.1-2 ...

installing: freetype-2.5.5-2 ...

installing: get_terminal_size-1.0.0-py36_0 ...

installing: gevent-1.2.1-py36_0 ...

installing: glib-2.50.2-1 ...

installing: greenlet-0.4.11-py36_0 ...

installing: gst-plugins-base-1.8.0-0 ...

installing: gstreamer-1.8.0-0 ...

installing: h5py-2.6.0-np111py36_2 ...

installing: harfbuzz-0.9.39-2 ...

installing: hdf5-1.8.17-1 ...

installing: heapdict-1.0.0-py36_1 ...

installing: icu-54.1-0 ...

installing: idna-2.2-py36_0 ...

installing: imagesize-0.7.1-py36_0 ...

installing: ipykernel-4.5.2-py36_0 ...

installing: ipython-5.1.0-py36_0 ...

installing: ipython_genutils-0.1.0-py36_0 ...

installing: ipywidgets-5.2.2-py36_1 ...

installing: isort-4.2.5-py36_0 ...

installing: itsdangerous-0.24-py36_0 ...

installing: jbig-2.1-0 ...

installing: jdcal-1.3-py36_0 ...

installing: jedi-0.9.0-py36_1 ...

installing: jinja2-2.9.4-py36_0 ...

installing: jpeg-9b-0 ...

installing: jsonschema-2.5.1-py36_0 ...

installing: jupyter-1.0.0-py36_3 ...

installing: jupyter_client-4.4.0-py36_0 ...

installing: jupyter_console-5.0.0-py36_0 ...

installing: jupyter_core-4.2.1-py36_0 ...

installing: lazy-object-proxy-1.2.2-py36_0 ...

installing: libffi-3.2.1-1 ...

installing: libgcc-4.8.5-2 ...

installing: libgfortran-3.0.0-1 ...

installing: libiconv-1.14-0 ...

installing: libpng-1.6.27-0 ...

installing: libsodium-1.0.10-0 ...

installing: libtiff-4.0.6-3 ...

installing: libxcb-1.12-1 ...

installing: libxml2-2.9.4-0 ...

installing: libxslt-1.1.29-0 ...

installing: llvmlite-0.15.0-py36_0 ...

installing: locket-0.2.0-py36_1 ...

installing: lxml-3.7.2-py36_0 ...

installing: markupsafe-0.23-py36_2 ...

installing: matplotlib-2.0.0-np111py36_0 ...

installing: mistune-0.7.3-py36_0 ...

installing: mkl-2017.0.1-0 ...

installing: mkl-service-1.1.2-py36_3 ...

installing: mpmath-0.19-py36_1 ...

installing: multipledispatch-0.4.9-py36_0 ...

installing: nbconvert-4.2.0-py36_0 ...

installing: nbformat-4.2.0-py36_0 ...

installing: networkx-1.11-py36_0 ...

installing: nltk-3.2.2-py36_0 ...

installing: nose-1.3.7-py36_1 ...

installing: notebook-4.3.1-py36_0 ...

installing: numba-0.30.1-np111py36_0 ...

installing: numexpr-2.6.1-np111py36_2 ...

installing: numpy-1.11.3-py36_0 ...

installing: numpydoc-0.6.0-py36_0 ...

installing: odo-0.5.0-py36_1 ...

installing: openpyxl-2.4.1-py36_0 ...

installing: openssl-1.0.2k-1 ...

installing: pandas-0.19.2-np111py36_1 ...

installing: partd-0.3.7-py36_0 ...

installing: ...

installing: pathlib2-2.2.0-py36_0 ...

installing: patsy-0.4.1-py36_0 ...

installing: pcre-8.39-1 ...

installing: pep8-1.7.0-py36_0 ...

installing: pexpect-4.2.1-py36_0 ...

installing: pickleshare-0.7.4-py36_0 ...

installing: pillow-4.0.0-py36_0 ...

installing: pip-9.0.1-py36_1 ...

installing: pixman-0.34.0-0 ...

installing: ply-3.9-py36_0 ...

installing: prompt_toolkit-1.0.9-py36_0 ...

installing: psutil-5.0.1-py36_0 ...

installing: ptyprocess-0.5.1-py36_0 ...

installing: py-1.4.32-py36_0 ...

installing: pyasn1-0.1.9-py36_0 ...

installing: pycosat-0.6.1-py36_1 ...

installing: pycparser-2.17-py36_0 ...

installing: pycrypto-2.6.1-py36_4 ...

installing: pycurl-7.43.0-py36_2 ...

installing: pyflakes-1.5.0-py36_0 ...

installing: pygments-2.1.3-py36_0 ...

installing: pylint-1.6.4-py36_1 ...

installing: pyopenssl-16.2.0-py36_0 ...

installing: pyparsing-2.1.4-py36_0 ...

installing: pyqt-5.6.0-py36_2 ...

installing: pytables-3.3.0-np111py36_0 ...

installing: pytest-3.0.5-py36_0 ...

installing: python-dateutil-2.6.0-py36_0 ...

installing: pytz-2016.10-py36_0 ...

installing: pyyaml-3.12-py36_0 ...

installing: pyzmq-16.0.2-py36_0 ...

installing: qt-5.6.2-3 ...

installing: qtawesome-0.4.3-py36_0 ...

installing: qtconsole-4.2.1-py36_1 ...

installing: qtpy-1.2.1-py36_0 ...

installing: readline-6.2-2 ...

installing: redis-3.2.0-0 ...

installing: redis-py-2.10.5-py36_0 ...

installing: requests-2.12.4-py36_0 ...

installing: rope-0.9.4-py36_1 ...

installing: ruamel_yaml-0.11.14-py36_1 ...

installing: scikit-image-0.12.3-np111py36_1 ...

installing: scikit-learn-0.18.1-np111py36_1 ...

installing: scipy-0.18.1-np111py36_1 ...

installing: seaborn-0.7.1-py36_0 ...

installing: setuptools-27.2.0-py36_0 ...

installing: simplegeneric-0.8.1-py36_1 ...

installing: singledispatch- ...

installing: sip-4.18-py36_0 ...

installing: six-1.10.0-py36_0 ...

installing: snowballstemmer-1.2.1-py36_0 ...

installing: sockjs-tornado-1.0.3-py36_0 ...

installing: sphinx-1.5.1-py36_0 ...

installing: spyder-3.1.2-py36_0 ...

installing: sqlalchemy-1.1.5-py36_0 ...

installing: sqlite-3.13.0-0 ...

installing: statsmodels-0.6.1-np111py36_1 ...

installing: sympy-1.0-py36_0 ...

installing: terminado-0.6-py36_0 ...

installing: tk-8.5.18-0 ...

installing: toolz-0.8.2-py36_0 ...

installing: tornado-4.4.2-py36_0 ...

installing: traitlets-4.3.1-py36_0 ...

installing: unicodecsv-0.14.1-py36_0 ...

installing: wcwidth-0.1.7-py36_0 ...

installing: werkzeug-0.11.15-py36_0 ...

installing: wheel-0.29.0-py36_0 ...

installing: widgetsnbextension-1.2.6-py36_0 ...

installing: wrapt-1.10.8-py36_0 ...

installing: xlrd-1.0.0-py36_0 ...

installing: xlsxwriter-0.9.6-py36_0 ...

installing: xlwt-1.2.0-py36_0 ...

installing: xz-5.2.2-1 ...

installing: yaml-0.1.6-0 ...

installing: zeromq-4.1.5-0 ...

installing: zlib-1.2.8-3 ...

installing: anaconda-4.3.1-np111py36_0 ...

installing: conda-4.3.14-py36_0 ...

installing: conda-env-2.6.0-0 ...

Python 3.6.0 :: Continuum Analytics, Inc.

creating default environment...

installation finished.

Do you wish the installer to prepend the Anaconda3 install location

to PATH in your /home/pythonu/.bashrc ? [yes|no]

[no] >>> no


You may wish to edit your .bashrc or prepend the Anaconda3 install location:


$ export PATH=/opt/anaconda/ana3/bin:$PATH


Thank you for installing Anaconda3!


Share your notebooks and packages on Anaconda Cloud!

Sign up for free:


3. Setup virtual python environment


Anaconda easily supports switching between multiple python versions. The setup depends on the network connectivity of the Aster environment.


If the Aster system has internet access on the workers:

/opt/anaconda/ana3/bin/conda create -n python36 python=3.6 anaconda               # create new environment and install all anaconda packages


If the Aster system has no internet access on the workers:

/opt/anaconda/ana3/bin/conda create -n python36 --clone root          # clone the root environment



You can install multiple versions, for example: /opt/anaconda/ana3/bin/conda create --name python27 python=2.7.13


4. Activate virtual environment



<login as pythonu>

/opt/anaconda/ana3/bin/conda info --envs                     # review available environments

source /opt/anaconda/ana3/bin/activate python36               # activate python 3.6 environment

/opt/anaconda/ana3/bin/conda list                                # list all installed packages





5. Reset ownership of python directory structure


This step is required to allow Aster MR functions to properly access the new python environment.


su - chown -R extensibility:extensibility /opt/anaconda 



6. Perform test


To test our new installation we will install a python test script and invoke a sql script using act or Teradata Studio.




import sys

import getopt

import numpy as np

import pickle from sklearn.ensemble

import RandomForestClassifier

import pandas as pd


print ('hello' +'\t' + 'bye')





select *

from stream (

     on (select 1)

     script('source /opt/anaconda/ana3/bin/activate python36; python')

     outputs('test1 varchar', 'test2 varchar')





Steps to test python install is correct:

  1. Save the python script in /tmp on the queen.
  2. Invoke act to install the file: act -U beehive -c "\install /tmp/"
  3. Invoke act to run the sql test: act -U beehive -f python_test.sql


Expected output:

test1 | test2


hello  |  bye (1 rows)



If you get permission or module loading errors, review the previous steps and verify that you have correctly set the permissions on the python directory structure. Also verify that you are using the correct paths in the sql script.



Do not forget to repeat these steps for all the workers.


Installing additional python packages 


Activate your python environment as shown earlier and run these commands to add key deep learning packages on each worker:

conda install -c conda-forge theano=0.9.0

conda install -c conda-forge tensorflow=1.1.0

conda install -c conda-forge keras=2.0.2





Downloading Python

Managing Python

Managing Packages



Recently I reviewed the sentiment analysis data provided by a customer and wanted to get a quick idea of the positive/negative words in their customer survey notes. To accomplish this I created a python script that runs in-database and takes the output from the standard Aster sentiment analysis function to provide more information about the contents of the survey notes.


If you do not have a favorite text data set I suggest downloading "Amazon fine food reviews" from kaggle.  See reference section below for a download link.






First we create a table to stage our text data and load with ncluster_loader:



ncluster_loader -U myuser -d mydb --skip-rows 1 -c amzn_reviews Reviews.csv


Now we can run the ExtractSentiment function. For simplicity we will use the built in dictionary instead of building our own model. 



The function outputs a standard POS/NEU/NEG polarity including a strength indicator. One benefit of using a dictionary to analyze the text data is that we get an extra field in the output: "out_sentiment_words".  This field gives us more information about the words used to determine sentiment,  their frequency in the sentence or document and a total positive and negative score.



Way back in 2013 I created a perl script to parse this data and store the output in an easy to use table. The perl script is quite simple and is run using the Aster stream function. Many people dislike perl for its syntax.  For this blog post I decided to quickly convert the script to python 2.x and show you how to run python in database. Note that you can accomplish a similar result with the regexp function in combination with the pivot SQL/MR.



What do we need to run python in-database?


  1. specify the input data
  2. a python script
  3. the script has to be installed in Aster.
  4. define the output what we want


Let's review these requirements in more detail:


  1. By default the stream function expects tab delimited input. As an example we will take the id field from our fine food reviews dataset as the first field and the out_sentiment_words as the second field to pass to the stream function.
  2. Our python script will be called "":
  3. The python script is installed in Aster (in the system database table) using the act command-line tool.

    act -U myuser -d mydb -w -c "\remove"

    act -U myuser -d mydb -w -c "\install"

  4. We specify the fields we want to see in the output and their type.



Python in-database


Now that we know how to run the stream function we can dig a bit deeper into the python script.


We use the default tab delimiter to parse the input data fed to the stream function. A loop reads the data from STDIN. When we encounter an empty line we assume all data has been read and exit the script:


I will not discuss the simple data manipulation done by the script. What is important is that we output our desired results:



And our output fields have to match our definition in the OUTPUTS parameter for the stream function.


The last line in the script is also important. We need to properly flush the buffer to make sure we obtain all the output.



Note that in python 3.x you can type  print (i, end='') which will automatically flush the buffer.



Our output


Finally we review out result in the expanded format that we wanted.  For each document id we get a summary of the total positive and negative score (a simple count of the words) and a list of all the detected words.  The word_score is always 1 for positive words and -1 for negative words.



To present the results we can generate a wordcloud of all the top words based on their regular frequency.  




To get a better idea of the importance of the words we can run TF/IDF against all documents and join the result with our amzn_detail_output table to allow filtering on only the positive (where word_score = 1)  or negative words  (where word_score = -1).




Top 100 Positive words based on tf/idf scores


Top 100 Negative words based on tf/idf scores



Building a new classification model?


One option is to rely on external and third party sentiment analysis models. For example scoring the call notes for a telecom customer using publicly available movie reviews will not be that effective or relevant. Ideally a lot of time and effort is spent on manually reviewing the data and assigning the correct categories for sentiment analysis.  For those situations where time is limited this approach could be a potential alternative.


Now that we have identified many of the common positive and negative words we can manually review the remaining documents/survey notes that have few identified words.  Once those leftover entries have been manually categorized we can build and fine tune a new sentiment analysis model.  






Sample data set: 


Belgian chocolates store display:

Verdonck | Winkel 

Here is the guide for getting started with Aster in the Azure Cloud.  Have fun! 

Great News!  Everything you need to experience the power of multi-genre advanced analytics is now available on Microsoft Azure Marketplace.  


Businesses can quickly provision an advanced analytics sandbox or a production analytic environment to address their most pressing business challenge.  Teradata Aster Analytics on Azure delivers machine learning, text, path and pattern, statistics and graph analytics to analyze multi-structured data and operationalize insights at scale.   The Aster Analytics on Azure includes everything you need to launch an analytic cluster with a Queen node and up to 32 workers. 


A subscription to Aster Analytics on Azure includes the following:


  • Right to use Aster Analytics Portfolio, Aster Client, Aster Database 6.20 and Aster AppCenter for analytic development and production purposes.  Note: Aster AppCenter must be launched through a separate VM.
  • Launch 2 to 32 Worker nodes to meet your performance and scalability requirements
  • Supports 2 to 256 TB of data for analysis
  • Support through the Teradata Premier Cloud Support
  • Access to Aster Analytics on Azure resources including the Getting Started Guide


Configuring and launching Aster Analytic for Azure is easy. Just follow these steps on the Azure Marketplace and you’ll be ready to discover the power and flexibility of Aster Analytics.


  1. Log on to Azure Marketplace and search for “Aster”.  You’ll find two Aster virtual machines (VM).  The first is the “Teradata Aster Analytics” solution template.  This template will guide you through the steps to configure and launch an Aster cluster.  The second is the “Teradata Aster AppCenter Hourly” VM that will launch Aster AppCenter.  AppCenter is optional.


log on to azure marketplace and search for aster


  1. Once you select the “Teradata Aster Analytics” option, Azure marketplace will display a brief description of the Aster software.  Click on “Create” to launch the solution template that will take you through a step-by-step process to configure an Aster Analytics cluster.
  2. You must first setup the basic user name and password for your Azure VM resource.  Next select the subscription account for software, storage and VM charges.  You must also enter an existing resource groups or create your own.    If you create your own, you must also select an Azure region.  Note: prices of VMs can vary based on location.


setup user credentials for azure vm resource


  1. Next step is to configure the Aster cluster:
    1. Enter a name that will be used as a prefix for all workers launched as part of this cluster. 
    2. Enter the number of workers you want to deploy.  Number of workers must be between 2 to 32.  We recommend a minimum of 3 workers. One Queen will be automatically configured.
    3. Select the Azure VM type to run Aster software.  The number of workers and the type of Azure VM will determine performance of analytic processing.   The supported VMs and their characteristics are listed below.  Note: The VMs listed in this table has been certified with Aster Analytics software and will be supported.Azure

VM type

CPU Cores

Memory GiB

Local storage GiB

Workload use case

Aster software price





Evaluation purposes






Entry level






Development and Production






High performance Production



    1. Enter the Aster database passwords.
    2. Select 1, 4 or 8 TB disk size options.  This size refers to the amount of premium storage allocated for EACH Worker and Queen, so to the overall storage available is “Number of workers” x “Disk option”.  Disk attached to the Queen is for processing and not included in the overall storage calculation. 
    3. Enter the time zone for the Aster cluster.


  1. Establish network settings
    1. Select an existing virtual network or create a new one to deploy the Aster cluster.
    2. Select or enter the name of the VM subnet and CIDR block for the public subnet.  The Aster solution template will create a new public subnet in the selected VNet using this CIDR address.


configure cidr address


  1. Now you’re just about ready to launch your new Aster cluster on Azure.   Azure marketplace will run through the final checks then prompt you to subscribe.


launch aster cluster on azure marketplace


  1. Once the cluster is provisioned, you can follow the instructions in the ‘Getting Started Guide’ to connect to the Queen node and the Aster Management Console (AMC)


Aster Analytics on Azure enables businesses to leverage the full power of advanced analytic at scale with a minimal investment.   Software used by the leading Global 1000 is now available on-demand for experimentation and analytics on multi-structure data. Now you are ready to build mind-blowing analytic models to drive high impact business outcomes.  If you have any questions regarding Aster Analytics in the Cloud, please feel free to contact Kiran Kamreddy or Arlene Zaima.



Aster Analytics on Azure Community:

Aster Analytics on Azure 


Related Blogs:

Back Ground:

According to some recent research, shoppers still prefer to buy at a physical store rather than from an on-line retailer; furthermore studies have shown that people prefer to use their senses when making a purchase. However many customers go to a retail location to make a purchase and decide to buy on-line rather than at the store because of two main drivers, selection and price. Shoppers are going to use their senses at stores because they trust them when making a purchase, but buy on-line because the product was neither in-stock or a cheaper price was found on-line. Wouldn't it be great if brick and mortar retailers had a way to analyze foot traffic patterns much the same way that on-line retailers do with web click streams?



Wi-Fi enabled smart phones, with a market penetration of over 50% of the population, provide retailers a great opportunity to capture more sales from customers, improve the overall customer experience, and understand foot traffic patterns in their stores. As long as the Wi-Fi atena is enabled on the smart phone, phones will continually request access to the network even while the phone is sleeping. In many cases a phone will request access to a network 5 times a minute. Retail locations that have a Wi-Fi network along with strategically placed Web Access Points(WAP) or beacons, can begin to understand the paths people are taking while in the store, and find affinities between store departments. The captured location data can be streamed to Aster via Teradata Listener.


Once the data has landed in Aster, the real opportunity begins with the Analytics of Things.  Aster has a wealth  of built in functions such as nPath, JSON Parser, Sessionize, and CFilter that will accelerate the discovery process so that actionable insights can be found and acted upon. Using the aforementioned functions several discoveries can be quickly found:

  • the most common paths people take
  • department and product affinities
  • times days and departments that are the busiest

Use Cases:

The use cases that are available depend upon which type of shopper is on the network, whether they are known or unknown. Known shoppers are the ones that connect to the network and sign in with their social media account or some other means to identify them.

Known Shopper Use Cases:

  • Deliver Personalized Offers when they  enter the location or department
  • Enhanced Customer Loyalty through timely and appropriate communication
  • Customized Product Recommendations while in location
  • Mapping Web Click Streams with Retail Location Data to improve customized inventory
  • Customer Paths by Demographic Information


Unknown Shopper Use Cases:

  • Most Common Customer Paths
  • Product and Department Affinity
  • Optimized Staffing Levels
  • Optimized Floor Layout and End Caps


Privacy and Transparency:

Like any data capture and analytics program two vital components are necessary for customer support and buy-in, privacy and transparency. Being transparent with customers about foot traffic analytics, what data is being captured and stored, and what is being done with the data is crucial. Most customers are happy to give their data, but they want to understand what is being done with it, and is it secure.