Skip navigation
All Places > Learn Aster > Blog
1 2 3 Previous Next

Learn Aster

250 posts



Several people asked me about the best way to install python and additional packages on their Aster environment.

Aster relies heavily on Python for its internal code and processing. For that reason we must never touch the base install in the /home/beehive/toolchain directories. This post describes the correct procedure for installing your own python including the deep learning packages theano, tensorflow and keras.


We leverage the standard Anaconda distribution because it includes around 500 python packages out of the box:

Anaconda package list | Continuum Analytics: Documentation 



Installation Process

1. Download Python 3.6 Anaconda from




2. Perform these installation steps on the queen and all workers


Note: while the queen does not require a python install, it is suggested to perform the same steps on all nodes




  • we will NOT use the root account to install python
  • anaconda will be installed under a new user "pythonu" to avoid any possible conflicts.
  • the pythonu account will not have a home directory.
  • anaconda will be installed in the /opt file system because other aster/teradata tools live there
  • the shell environment variables will not be modified
  • anaconda offers an unattended installation, please refer to website for instructions since our example uses the interactive installation method
  • the python directory structure will be owned by the extensibility account because the Aster MR functions execute under that low privilege user

Installation steps, to be repeated for all workers:


<login as root>

useradd pythonu                                                                        # create new user account to install python

cd /opt

mkdir anaconda

chown pythonu:users /opt/anaconda

ncli node clonefile /tmp/               # copy anaconda distribution to all worker nodes

su - pythonu

cd /tmp

bash                                           # perform  interactive installation




It is crucial to answer the prompts correctly:


Directory: /opt/anaconda/ana3

When prompted to update .bashrc with a new PATH always answer No



Sample Installation output:


Welcome to Anaconda3 4.3.1 (by Continuum Analytics, Inc.)


In order to continue the installation process, please review the license


Please, press ENTER to continue



Anaconda License



Copyright 2016, Continuum Analytics, Inc.


All rights reserved under the 3-clause BSD License:


Redistribution and use in source and binary forms, with or without

modification, are permitted provided that the following conditions are met:


* Redistributions of source code must retain the above copyright notice,

this list of conditions and the following disclaimer.


* Redistributions in binary form must reproduce the above copyright notice,

this list of conditions and the following disclaimer in the documentation

and/or other materials provided with the distribution.


* Neither the name of Continuum Analytics, Inc. nor the names of its

contributors may be used to endorse or promote products derived from this

software without specific prior written permission.














Notice of Third Party Software Licenses



Anaconda contains open source software packages from third parties. These

are available on an "as is" basis and subject to their individual license

agreements. These licenses are available in Anaconda or at . Any binary packages of these

third party tools you obtain via Anaconda are subject to their individual

licenses as well as the Anaconda license. Continuum reserves the right to

change which third party tools are provided in Anaconda.


In particular, Anaconda contains re-distributable, run-time, shared-library

files from the Intel (TM) Math Kernel Library ("MKL binaries"). You are

specifically authorized to use the MKL binaries with your installation of

Anaconda. You are also authorized to redistribute the MKL binaries with

Anaconda or in the conda package that contains them. If needed,

instructions for removing the MKL binaries after installation of Anaconda

are available at


Cryptography Notice


This distribution includes cryptographic software. The country in which you

currently reside may have restrictions on the import, possession, use,

and/or re-export to another country, of encryption software. BEFORE using

any encryption software, please check your country's laws, regulations and

policies concerning the import, possession, or use, and re-export of

encryption software, to see if this is permitted. See the Wassenaar

Arrangement <> for more information.


Continuum Analytics has self-classified this software as Export Commodity

Control Number (ECCN) 5D002.C.1, which includes information security

software using or performing cryptographic functions with asymmetric

algorithms. The form and manner of this distribution makes it eligible for

export under the License Exception ENC Technology Software Unrestricted

(TSU) exception (see the BIS Export Administration Regulations, Section

740.13) for both object code and source code.


The following packages are included in this distribution that relate to




The OpenSSL Project is a collaborative effort to develop a robust,

commercial-grade, full-featured, and Open Source toolkit implementing the

Transport Layer Security (TLS) and Secure Sockets Layer (SSL) protocols as

well as a full-strength general purpose cryptography library.



A collection of both secure hash functions (such as SHA256 and RIPEMD160),

and various encryption algorithms (AES, DES, RSA, ElGamal, etc.).



A thin Python wrapper around (a subset of) the OpenSSL library.


kerberos (krb5, non-Windows platforms)

A network authentication protocol designed to provide strong authentication

for client/server applications by using secret-key cryptography.



A Python library which exposes cryptographic recipes and primitives.


Do you approve the license terms? [yes|no]

>>> yes


Anaconda3 will now be installed into this location:



  - Press ENTER to confirm the location

  - Press CTRL-C to abort the installation

  - Or specify a different location below


[/home/pythonu/anaconda3] >>> /opt/anaconda/ana3


installing: python-3.6.0-0 ...

installing: _license-1.1-py36_1 ...

installing: alabaster-0.7.9-py36_0 ...

installing: anaconda-client-1.6.0-py36_0 ...

installing: anaconda-navigator-1.5.0-py36_0 ...

installing: anaconda-project-0.4.1-py36_0 ...

installing: astroid-1.4.9-py36_0 ...

installing: astropy-1.3-np111py36_0 ...

installing: babel-2.3.4-py36_0 ...

installing: backports-1.0-py36_0 ...

installing: beautifulsoup4-4.5.3-py36_0 ...

installing: bitarray-0.8.1-py36_0 ...

installing: blaze-0.10.1-py36_0 ...

installing: bokeh-0.12.4-py36_0 ...

installing: boto-2.45.0-py36_0 ...

installing: bottleneck-1.2.0-np111py36_0 ...

installing: cairo-1.14.8-0 ...

installing: cffi-1.9.1-py36_0 ...

installing: chardet-2.3.0-py36_0 ...

installing: chest-0.2.3-py36_0 ...

installing: click-6.7-py36_0 ...

installing: cloudpickle-0.2.2-py36_0 ...

installing: clyent-1.2.2-py36_0 ...

installing: colorama-0.3.7-py36_0 ...

installing: configobj-5.0.6-py36_0 ...

installing: contextlib2-0.5.4-py36_0 ...

installing: cryptography-1.7.1-py36_0 ...

installing: curl-7.52.1-0 ...

installing: cycler-0.10.0-py36_0 ...

installing: cython-0.25.2-py36_0 ...

installing: cytoolz-0.8.2-py36_0 ...

installing: dask-0.13.0-py36_0 ...

installing: datashape-0.5.4-py36_0 ...

installing: dbus-1.10.10-0 ...

installing: decorator-4.0.11-py36_0 ...

installing: dill-0.2.5-py36_0 ...

installing: docutils-0.13.1-py36_0 ...

installing: entrypoints-0.2.2-py36_0 ...

installing: et_xmlfile-1.0.1-py36_0 ...

installing: expat-2.1.0-0 ...

installing: fastcache-1.0.2-py36_1 ...

installing: flask-0.12-py36_0 ...

installing: flask-cors-3.0.2-py36_0 ...

installing: fontconfig-2.12.1-2 ...

installing: freetype-2.5.5-2 ...

installing: get_terminal_size-1.0.0-py36_0 ...

installing: gevent-1.2.1-py36_0 ...

installing: glib-2.50.2-1 ...

installing: greenlet-0.4.11-py36_0 ...

installing: gst-plugins-base-1.8.0-0 ...

installing: gstreamer-1.8.0-0 ...

installing: h5py-2.6.0-np111py36_2 ...

installing: harfbuzz-0.9.39-2 ...

installing: hdf5-1.8.17-1 ...

installing: heapdict-1.0.0-py36_1 ...

installing: icu-54.1-0 ...

installing: idna-2.2-py36_0 ...

installing: imagesize-0.7.1-py36_0 ...

installing: ipykernel-4.5.2-py36_0 ...

installing: ipython-5.1.0-py36_0 ...

installing: ipython_genutils-0.1.0-py36_0 ...

installing: ipywidgets-5.2.2-py36_1 ...

installing: isort-4.2.5-py36_0 ...

installing: itsdangerous-0.24-py36_0 ...

installing: jbig-2.1-0 ...

installing: jdcal-1.3-py36_0 ...

installing: jedi-0.9.0-py36_1 ...

installing: jinja2-2.9.4-py36_0 ...

installing: jpeg-9b-0 ...

installing: jsonschema-2.5.1-py36_0 ...

installing: jupyter-1.0.0-py36_3 ...

installing: jupyter_client-4.4.0-py36_0 ...

installing: jupyter_console-5.0.0-py36_0 ...

installing: jupyter_core-4.2.1-py36_0 ...

installing: lazy-object-proxy-1.2.2-py36_0 ...

installing: libffi-3.2.1-1 ...

installing: libgcc-4.8.5-2 ...

installing: libgfortran-3.0.0-1 ...

installing: libiconv-1.14-0 ...

installing: libpng-1.6.27-0 ...

installing: libsodium-1.0.10-0 ...

installing: libtiff-4.0.6-3 ...

installing: libxcb-1.12-1 ...

installing: libxml2-2.9.4-0 ...

installing: libxslt-1.1.29-0 ...

installing: llvmlite-0.15.0-py36_0 ...

installing: locket-0.2.0-py36_1 ...

installing: lxml-3.7.2-py36_0 ...

installing: markupsafe-0.23-py36_2 ...

installing: matplotlib-2.0.0-np111py36_0 ...

installing: mistune-0.7.3-py36_0 ...

installing: mkl-2017.0.1-0 ...

installing: mkl-service-1.1.2-py36_3 ...

installing: mpmath-0.19-py36_1 ...

installing: multipledispatch-0.4.9-py36_0 ...

installing: nbconvert-4.2.0-py36_0 ...

installing: nbformat-4.2.0-py36_0 ...

installing: networkx-1.11-py36_0 ...

installing: nltk-3.2.2-py36_0 ...

installing: nose-1.3.7-py36_1 ...

installing: notebook-4.3.1-py36_0 ...

installing: numba-0.30.1-np111py36_0 ...

installing: numexpr-2.6.1-np111py36_2 ...

installing: numpy-1.11.3-py36_0 ...

installing: numpydoc-0.6.0-py36_0 ...

installing: odo-0.5.0-py36_1 ...

installing: openpyxl-2.4.1-py36_0 ...

installing: openssl-1.0.2k-1 ...

installing: pandas-0.19.2-np111py36_1 ...

installing: partd-0.3.7-py36_0 ...

installing: ...

installing: pathlib2-2.2.0-py36_0 ...

installing: patsy-0.4.1-py36_0 ...

installing: pcre-8.39-1 ...

installing: pep8-1.7.0-py36_0 ...

installing: pexpect-4.2.1-py36_0 ...

installing: pickleshare-0.7.4-py36_0 ...

installing: pillow-4.0.0-py36_0 ...

installing: pip-9.0.1-py36_1 ...

installing: pixman-0.34.0-0 ...

installing: ply-3.9-py36_0 ...

installing: prompt_toolkit-1.0.9-py36_0 ...

installing: psutil-5.0.1-py36_0 ...

installing: ptyprocess-0.5.1-py36_0 ...

installing: py-1.4.32-py36_0 ...

installing: pyasn1-0.1.9-py36_0 ...

installing: pycosat-0.6.1-py36_1 ...

installing: pycparser-2.17-py36_0 ...

installing: pycrypto-2.6.1-py36_4 ...

installing: pycurl-7.43.0-py36_2 ...

installing: pyflakes-1.5.0-py36_0 ...

installing: pygments-2.1.3-py36_0 ...

installing: pylint-1.6.4-py36_1 ...

installing: pyopenssl-16.2.0-py36_0 ...

installing: pyparsing-2.1.4-py36_0 ...

installing: pyqt-5.6.0-py36_2 ...

installing: pytables-3.3.0-np111py36_0 ...

installing: pytest-3.0.5-py36_0 ...

installing: python-dateutil-2.6.0-py36_0 ...

installing: pytz-2016.10-py36_0 ...

installing: pyyaml-3.12-py36_0 ...

installing: pyzmq-16.0.2-py36_0 ...

installing: qt-5.6.2-3 ...

installing: qtawesome-0.4.3-py36_0 ...

installing: qtconsole-4.2.1-py36_1 ...

installing: qtpy-1.2.1-py36_0 ...

installing: readline-6.2-2 ...

installing: redis-3.2.0-0 ...

installing: redis-py-2.10.5-py36_0 ...

installing: requests-2.12.4-py36_0 ...

installing: rope-0.9.4-py36_1 ...

installing: ruamel_yaml-0.11.14-py36_1 ...

installing: scikit-image-0.12.3-np111py36_1 ...

installing: scikit-learn-0.18.1-np111py36_1 ...

installing: scipy-0.18.1-np111py36_1 ...

installing: seaborn-0.7.1-py36_0 ...

installing: setuptools-27.2.0-py36_0 ...

installing: simplegeneric-0.8.1-py36_1 ...

installing: singledispatch- ...

installing: sip-4.18-py36_0 ...

installing: six-1.10.0-py36_0 ...

installing: snowballstemmer-1.2.1-py36_0 ...

installing: sockjs-tornado-1.0.3-py36_0 ...

installing: sphinx-1.5.1-py36_0 ...

installing: spyder-3.1.2-py36_0 ...

installing: sqlalchemy-1.1.5-py36_0 ...

installing: sqlite-3.13.0-0 ...

installing: statsmodels-0.6.1-np111py36_1 ...

installing: sympy-1.0-py36_0 ...

installing: terminado-0.6-py36_0 ...

installing: tk-8.5.18-0 ...

installing: toolz-0.8.2-py36_0 ...

installing: tornado-4.4.2-py36_0 ...

installing: traitlets-4.3.1-py36_0 ...

installing: unicodecsv-0.14.1-py36_0 ...

installing: wcwidth-0.1.7-py36_0 ...

installing: werkzeug-0.11.15-py36_0 ...

installing: wheel-0.29.0-py36_0 ...

installing: widgetsnbextension-1.2.6-py36_0 ...

installing: wrapt-1.10.8-py36_0 ...

installing: xlrd-1.0.0-py36_0 ...

installing: xlsxwriter-0.9.6-py36_0 ...

installing: xlwt-1.2.0-py36_0 ...

installing: xz-5.2.2-1 ...

installing: yaml-0.1.6-0 ...

installing: zeromq-4.1.5-0 ...

installing: zlib-1.2.8-3 ...

installing: anaconda-4.3.1-np111py36_0 ...

installing: conda-4.3.14-py36_0 ...

installing: conda-env-2.6.0-0 ...

Python 3.6.0 :: Continuum Analytics, Inc.

creating default environment...

installation finished.

Do you wish the installer to prepend the Anaconda3 install location

to PATH in your /home/pythonu/.bashrc ? [yes|no]

[no] >>> no


You may wish to edit your .bashrc or prepend the Anaconda3 install location:


$ export PATH=/opt/anaconda/ana3/bin:$PATH


Thank you for installing Anaconda3!


Share your notebooks and packages on Anaconda Cloud!

Sign up for free:


3. Setup virtual python environment


Anaconda easily supports switching between multiple python versions. The setup depends on the network connectivity of the Aster environment.


If the Aster system has internet access on the workers:

/opt/anaconda/ana3/bin/conda create -n python36 python=3.6 anaconda               # create new environment and install all anaconda packages


If the Aster system has no internet access on the workers:

/opt/anaconda/ana3/bin/conda create -n python36 --clone root          # clone the root environment



You can install multiple versions, for example: /opt/anaconda/ana3/bin/conda create --name python27 python=2.7.13


4. Activate virtual environment



<login as pythonu>

/opt/anaconda/ana3/bin/conda info --envs                     # review available environments

source /opt/anaconda/ana3/bin/activate python36               # activate python 3.6 environment

/opt/anaconda/ana3/bin/conda list                                # list all installed packages





5. Reset ownership of python directory structure


This step is required to allow Aster MR functions to properly access the new python environment.


su - chown -R extensibility:extensibility /opt/anaconda 



6. Perform test


To test our new installation we will install a python test script and invoke a sql script using act or Teradata Studio.




import sys

import getopt

import numpy as np

import pickle from sklearn.ensemble

import RandomForestClassifier

import pandas as pd


print ('hello' +'\t' + 'bye')





select *

from stream (

     on (select 1)

     script('source /opt/anaconda/ana3/bin/activate python36; python')

     outputs('test1 varchar', 'test2 varchar')





Steps to test python install is correct:

  1. Save the python script in /tmp on the queen.
  2. Invoke act to install the file: act -U beehive -c "\install /tmp/"
  3. Invoke act to run the sql test: act -U beehive -f python_test.sql


Expected output:

test1 | test2


hello  |  bye (1 rows)



If you get permission or module loading errors, review the previous steps and verify that you have correctly set the permissions on the python directory structure. Also verify that you are using the correct paths in the sql script.



Do not forget to repeat these steps for all the workers.


Installing additional python packages 


Activate your python environment as shown earlier and run these commands to add key deep learning packages on each worker:

conda install -c conda-forge theano=0.9.0

conda install -c conda-forge tensorflow=1.1.0

conda install -c conda-forge keras=2.0.2





Downloading Python

Managing Python

Managing Packages



Recently I reviewed the sentiment analysis data provided by a customer and wanted to get a quick idea of the positive/negative words in their customer survey notes. To accomplish this I created a python script that runs in-database and takes the output from the standard Aster sentiment analysis function to provide more information about the contents of the survey notes.


If you do not have a favorite text data set I suggest downloading "Amazon fine food reviews" from kaggle.  See reference section below for a download link.






First we create a table to stage our text data and load with ncluster_loader:



ncluster_loader -U myuser -d mydb --skip-rows 1 -c amzn_reviews Reviews.csv


Now we can run the ExtractSentiment function. For simplicity we will use the built in dictionary instead of building our own model. 



The function outputs a standard POS/NEU/NEG polarity including a strength indicator. One benefit of using a dictionary to analyze the text data is that we get an extra field in the output: "out_sentiment_words".  This field gives us more information about the words used to determine sentiment,  their frequency in the sentence or document and a total positive and negative score.



Way back in 2013 I created a perl script to parse this data and store the output in an easy to use table. The perl script is quite simple and is run using the Aster stream function. Many people dislike perl for its syntax.  For this blog post I decided to quickly convert the script to python 2.x and show you how to run python in database. Note that you can accomplish a similar result with the regexp function in combination with the pivot SQL/MR.



What do we need to run python in-database?


  1. specify the input data
  2. a python script
  3. the script has to be installed in Aster.
  4. define the output what we want


Let's review these requirements in more detail:


  1. By default the stream function expects tab delimited input. As an example we will take the id field from our fine food reviews dataset as the first field and the out_sentiment_words as the second field to pass to the stream function.
  2. Our python script will be called "":
  3. The python script is installed in Aster (in the system database table) using the act command-line tool.

    act -U myuser -d mydb -w -c "\remove"

    act -U myuser -d mydb -w -c "\install"

  4. We specify the fields we want to see in the output and their type.



Python in-database


Now that we know how to run the stream function we can dig a bit deeper into the python script.


We use the default tab delimiter to parse the input data fed to the stream function. A loop reads the data from STDIN. When we encounter an empty line we assume all data has been read and exit the script:


I will not discuss the simple data manipulation done by the script. What is important is that we output our desired results:



And our output fields have to match our definition in the OUTPUTS parameter for the stream function.


The last line in the script is also important. We need to properly flush the buffer to make sure we obtain all the output.



Note that in python 3.x you can type  print (i, end='') which will automatically flush the buffer.



Our output


Finally we review out result in the expanded format that we wanted.  For each document id we get a summary of the total positive and negative score (a simple count of the words) and a list of all the detected words.  The word_score is always 1 for positive words and -1 for negative words.



To present the results we can generate a wordcloud of all the top words based on their regular frequency.  




To get a better idea of the importance of the words we can run TF/IDF against all documents and join the result with our amzn_detail_output table to allow filtering on only the positive (where word_score = 1)  or negative words  (where word_score = -1).




Top 100 Positive words based on tf/idf scores


Top 100 Negative words based on tf/idf scores



Building a new classification model?


One option is to rely on external and third party sentiment analysis models. For example scoring the call notes for a telecom customer using publicly available movie reviews will not be that effective or relevant. Ideally a lot of time and effort is spent on manually reviewing the data and assigning the correct categories for sentiment analysis.  For those situations where time is limited this approach could be a potential alternative.


Now that we have identified many of the common positive and negative words we can manually review the remaining documents/survey notes that have few identified words.  Once those leftover entries have been manually categorized we can build and fine tune a new sentiment analysis model.  






Sample data set: 


Belgian chocolates store display:

Verdonck | Winkel 

Here is the guide for getting started with Aster in the Azure Cloud.  Have fun! 

Great News!  Everything you need to experience the power of multi-genre advanced analytics is now available on Microsoft Azure Marketplace.  


Businesses can quickly provision an advanced analytics sandbox or a production analytic environment to address their most pressing business challenge.  Teradata Aster Analytics on Azure delivers machine learning, text, path and pattern, statistics and graph analytics to analyze multi-structured data and operationalize insights at scale.   The Aster Analytics on Azure includes everything you need to launch an analytic cluster with a Queen node and up to 32 workers. 


A subscription to Aster Analytics on Azure includes the following:


  • Right to use Aster Analytics Portfolio, Aster Client, Aster Database 6.20 and Aster AppCenter for analytic development and production purposes.  Note: Aster AppCenter must be launched through a separate VM.
  • Launch 2 to 32 Worker nodes to meet your performance and scalability requirements
  • Supports 2 to 256 TB of data for analysis
  • Support through the Teradata Premier Cloud Support
  • Access to Aster Analytics on Azure resources including the Getting Started Guide


Configuring and launching Aster Analytic for Azure is easy. Just follow these steps on the Azure Marketplace and you’ll be ready to discover the power and flexibility of Aster Analytics.


  1. Log on to Azure Marketplace and search for “Aster”.  You’ll find two Aster virtual machines (VM).  The first is the “Teradata Aster Analytics” solution template.  This template will guide you through the steps to configure and launch an Aster cluster.  The second is the “Teradata Aster AppCenter Hourly” VM that will launch Aster AppCenter.  AppCenter is optional.


log on to azure marketplace and search for aster


  1. Once you select the “Teradata Aster Analytics” option, Azure marketplace will display a brief description of the Aster software.  Click on “Create” to launch the solution template that will take you through a step-by-step process to configure an Aster Analytics cluster.
  2. You must first setup the basic user name and password for your Azure VM resource.  Next select the subscription account for software, storage and VM charges.  You must also enter an existing resource groups or create your own.    If you create your own, you must also select an Azure region.  Note: prices of VMs can vary based on location.


setup user credentials for azure vm resource


  1. Next step is to configure the Aster cluster:
    1. Enter a name that will be used as a prefix for all workers launched as part of this cluster. 
    2. Enter the number of workers you want to deploy.  Number of workers must be between 2 to 32.  We recommend a minimum of 3 workers. One Queen will be automatically configured.
    3. Select the Azure VM type to run Aster software.  The number of workers and the type of Azure VM will determine performance of analytic processing.   The supported VMs and their characteristics are listed below.  Note: The VMs listed in this table has been certified with Aster Analytics software and will be supported.Azure

VM type

CPU Cores

Memory GiB

Local storage GiB

Workload use case

Aster software price





Evaluation purposes






Entry level






Development and Production






High performance Production



    1. Enter the Aster database passwords.
    2. Select 1, 4 or 8 TB disk size options.  This size refers to the amount of premium storage allocated for EACH Worker and Queen, so to the overall storage available is “Number of workers” x “Disk option”.  Disk attached to the Queen is for processing and not included in the overall storage calculation. 
    3. Enter the time zone for the Aster cluster.


  1. Establish network settings
    1. Select an existing virtual network or create a new one to deploy the Aster cluster.
    2. Select or enter the name of the VM subnet and CIDR block for the public subnet.  The Aster solution template will create a new public subnet in the selected VNet using this CIDR address.


configure cidr address


  1. Now you’re just about ready to launch your new Aster cluster on Azure.   Azure marketplace will run through the final checks then prompt you to subscribe.


launch aster cluster on azure marketplace


  1. Once the cluster is provisioned, you can follow the instructions in the ‘Getting Started Guide’ to connect to the Queen node and the Aster Management Console (AMC)


Aster Analytics on Azure enables businesses to leverage the full power of advanced analytic at scale with a minimal investment.   Software used by the leading Global 1000 is now available on-demand for experimentation and analytics on multi-structure data. Now you are ready to build mind-blowing analytic models to drive high impact business outcomes.  If you have any questions regarding Aster Analytics in the Cloud, please feel free to contact Kiran Kamreddy or Arlene Zaima.



Aster Analytics on Azure Community:

Aster Analytics on Azure 


Related Blogs:

Back Ground:

According to some recent research, shoppers still prefer to buy at a physical store rather than from an on-line retailer; furthermore studies have shown that people prefer to use their senses when making a purchase. However many customers go to a retail location to make a purchase and decide to buy on-line rather than at the store because of two main drivers, selection and price. Shoppers are going to use their senses at stores because they trust them when making a purchase, but buy on-line because the product was neither in-stock or a cheaper price was found on-line. Wouldn't it be great if brick and mortar retailers had a way to analyze foot traffic patterns much the same way that on-line retailers do with web click streams?



Wi-Fi enabled smart phones, with a market penetration of over 50% of the population, provide retailers a great opportunity to capture more sales from customers, improve the overall customer experience, and understand foot traffic patterns in their stores. As long as the Wi-Fi atena is enabled on the smart phone, phones will continually request access to the network even while the phone is sleeping. In many cases a phone will request access to a network 5 times a minute. Retail locations that have a Wi-Fi network along with strategically placed Web Access Points(WAP) or beacons, can begin to understand the paths people are taking while in the store, and find affinities between store departments. The captured location data can be streamed to Aster via Teradata Listener.


Once the data has landed in Aster, the real opportunity begins with the Analytics of Things.  Aster has a wealth  of built in functions such as nPath, JSON Parser, Sessionize, and CFilter that will accelerate the discovery process so that actionable insights can be found and acted upon. Using the aforementioned functions several discoveries can be quickly found:

  • the most common paths people take
  • department and product affinities
  • times days and departments that are the busiest

Use Cases:

The use cases that are available depend upon which type of shopper is on the network, whether they are known or unknown. Known shoppers are the ones that connect to the network and sign in with their social media account or some other means to identify them.

Known Shopper Use Cases:

  • Deliver Personalized Offers when they  enter the location or department
  • Enhanced Customer Loyalty through timely and appropriate communication
  • Customized Product Recommendations while in location
  • Mapping Web Click Streams with Retail Location Data to improve customized inventory
  • Customer Paths by Demographic Information


Unknown Shopper Use Cases:

  • Most Common Customer Paths
  • Product and Department Affinity
  • Optimized Staffing Levels
  • Optimized Floor Layout and End Caps


Privacy and Transparency:

Like any data capture and analytics program two vital components are necessary for customer support and buy-in, privacy and transparency. Being transparent with customers about foot traffic analytics, what data is being captured and stored, and what is being done with the data is crucial. Most customers are happy to give their data, but they want to understand what is being done with it, and is it secure. 

At the beginning of May, Nike ran a campaign to break the marathon time. The runners wore Nike gear and ran on a Nike track, in this highly controlled environment the goal was to record a marathon time less than two hours (#Breaking2). The whole event was live streamed on the Nike facebook page. The fastest time that came out of this effort was 2:00:25 ran by Kenyon long distance runner Eliud Kipchoge.


Many are quick to point out that this seems like a big publicity stunt for Nike, especially since the event happened a few weeks before Nike was set to release a new line of running shoes. But do these doubts take away from the incredible result that came from this experiment?


With Aster we can mine the comments that were left on the live facebook videos from the event and look at the sentiment associated with the comments. A data set of 27,219 comments was retrieved for analysis.


The first step in finding sentiment from any social media data set is building a custom sentiment dictionary. For this event, traditionally negative words like, “break”, “insanity”, and “limits” don’t carry negative sentiment, and in most social media cases words like “dope”, “chill”, and “crazy” also don’t indicate negative comments.  Once a base social media dictionary is created it can be reused in other social media use cases.


After the sentiment dictionary is adjusted we can look at the breakdown of sentiment in this comment data set.

The results are not too surprising, most social media comment data sets have very large amount of neutral comments. These comments include things like people tagging their friends in the post and directly asking a question to Nike through their comment.



We can breakdown the sentiment further with Cosine Similarity to find out the kinds of things people talk about in positive and negative comments.



In the above visualization shows comments with positive sentiment in the data set, every comment is a node, and the edges indicate the similarity score between the comments. The labels are summarizations of the topic of the comments, in this case the labels are pretty much verbatim what the comments in the cluster included. The top positive comments are very generic with high similarity scores. There isn’t much information to be gleaned from looking at positive comments.



In the negative comments there are some generic negative clusters like lame and fail, but many of the top comment topics give helpful information. The labels are an overview of the topic of the comments, in the negative case many people had different things to say about various topics rather than the generic comments from the positive sentiment data set. Many of the comments were about the environment in which the race was set up and how this race can’t be counted as an official time; this can be seen in comments that mention pacing is cheating and the wind resistance of the pace car. There are also many comments about the entertainment aspect of the race, many people did not like the announcer in the live stream and thought Kevin Hart being included in this marathon effort was an odd choice.


Though there were a few comments about this being a marketing stunt, it wasn’t an overwhelming theme when examining the negative comments. There was a bigger set of people reacting to those comments than people actually making them (the “haters gonna hate” topic). People were quicker to complain about the logistics of the race rather than the idea of it, which is good news if Nike wants to host another #Breaking2 event.

Data Science is a challenging profession. As we have heard from the many blogs, it is not for the faint-hearted, but it also doesn't mean one needs to have a cool graduate level abstract math degree.

Often data scientists that are hired in organizations are tasked with solving business problems, often end up coding endlessly, without generating value quickly. It goes like this. IT does this checklist and hands off, stuff to the data scientist.

  • Python & R Version Check
  • Connectivity working
  • Spark Cluster connectivity is working with 3rd party libraries.
  • HDFS has one year of data loaded
  • Aster Database is running in the enterprise with connectors configured for DL.

The business then drops the problem like 'Can you guys find the churn problem in a week. I want to know why people are signing up and not using our website. Find me something that leads to a prescriptive action'.

The data scientist goes to work, parse through the logs, extract and creates features spends a month on building some models and gives you a confusion matrix/ROC/AUC plot at the end. Business glazes over that and asks - "So what are you telling me here - that your churn model works ?'. 

Data Scientist: 'Yeah, we are getting 95%+ accuracy in churn prediction. We are ready to deploy.'

Business: 'So what's the root cause of the churn prediction ?'

Data Scientist: 'Well the significant variables are related to average time spent on certain pages like FAQ, Terms of Service, etc.,'

Business: 'That is a great insight. Tks for finding that feature. However, why are they going there to FAQ and Terms of Service in the first place ?'

Data Scientist spends another few weeks to answer that question only to be lost in identifying the core problem/root cause. The project goes into a vortex with similar exchanges back and forth. Something is missing. Business moves on.

© Can Stock Photo / AlphaBaby

Thinking like a business person:

Most folks in the community would agree that it is not just the tool *alone* that solves the business problem ultimately given the choices we have today. It is understanding the business needs and data scientist's ability to put themselves in business's shoes.

You could blame on data, loading speed, time taken to iterate on discovery, the stability of the cluster, etc., but if a data scientist does not get the business problem, weeks can go by with no useful results.

Why is this gap so common?

Businesses rely on data scientists as their eyes and ears for the underlying data. Data Scientists by definition are also tasked to speak the language of business. However, it's quite possible the data scientists are spending a lot of their time with new tools and methods (shiny GitHub objects or the Algorithm of the day). Business may not know what to ask of the data, and Data Scientists often may not even know where to look!

If only the Data Scientist knew to ask relevant questions to tease out the use case ...

How can this be resolved?

  • Educating data scientists on business problems.
  • Investing in tools where data scientists can create portals that business can play and drive and complain about gaps. Creating a clear and value added interface to the data will only bring better problems to the data science teams.
  • Standardize on an Interface that can provide a workable Abstraction for both the Data Scientist and Business: This is the single biggest challenge when we tend to use technologies that require extensive coding as mean to get to the art of analytics aka communication to the business. My personal experience has been using less coding, more visuals and a way to give the knobs to the firm to turn and question the data.
Can you socialize the analytics and insights continously to the business ?
  • Talk to vendors who focus on business outcome and have done this many times over - all the way from managing a data lake to delivering models real time and guide you in talking to the business.

© Can Stock Photo / sn4ke

In other words, if your data scientist cannot create a continuous story or narrative or stream with the data that business understands, businesses will find a way to rationalize what is working. That is bad for data science and a losing proposition for the business.

What is Data Science Pipeline?

Data science pipelines could be the next big big data thing as it is becoming a buzz word of late. At Teradata we've been engineering and refining technologies and frameworks for data science pipelines for last 6 years the least resulting in products like Aster, Listener, Presto, Hadoop Appliance, Teradata Cloud, Teradata QueryGrid™, Think Big, and Kylo.


So what is data science pipeline and why is embracing it important? No longer big data means just four V's: Volume for scale of data, Variety for different forms of data, Velocity for streaming and evolving data, and Veracity for uncertainty of data. There is fifth V that stands for Value - the ultimate promise of big data. Once technologies that tamed challenges represented by 4 V's acquired and deployed in the enterprise, the problem of extracting value out of such infrastructure irreducibly presents itself. Data science pipelines are both a framework and tool set to create, deploy, execute and maintain  complete applications delivering real value by utilizing underlying big data technologies. They aim at making data science look and behave just like the rest of enterprise software: repeatable, consistent, testable, deployable, robust, scalable and functional.


Teradata's commitment to all five V's, proven big data track record, and support for open source created unique opportunity to deliver value using data science pipelines. By employing powerful big data technologies like Hadoop Appliance, Aster, Teradata Cloud, QueryGrid and others paired with thoroughly integrated open source R programming environment data scientists can now create pipelines across data sources, business channels, use cases and at scale to produce repeatable, meaningful and actionable results.

Teradata Aster R

R has huge following for many reasons - some historical of being first open source environment for statistical computing , some computational of being 1st class functional programming language and execution environment, and some for its unparalleled support in statistical and machine learning community (or communities if you like). Not incidentally major players like Microsoft and Teradata chose R for their de facto standard for integrating open source with their big data platforms. In case of Teradata we have both in-database and R desktop environments for Aster developed and supported by Teradata:


Package TeradataAsterR consists of a collection of functions available to R programmers, data scientists and analysts who can use the powerful features of in-database analytics available in the Teradata Aster Database. Data scientists can use this rich set of functions to store, manipulate and explore big data sets to solve business problems. As large data is processed using Teradata Aster Massive Parallel Processing (MPP) and shared-nothing architecture, the single-node processor/memory bottlenecks are eliminated when executing R code. The processing of data using Teradata Aster R package functions occurs on each vWorker where data resides and data is not moved to the client during processing. Due to the design and architecture of the Teradata Aster Database Platform and software support for in-database analytics, applications and queries written using Teradata Aster R package are scalable and perform well even on a very large volume of data. Data transformation and subset operations using the Teradata Aster R package do not result in multiple copies of data, and client memory is preserved as most processing is done on worker nodes.


Source: Teradata Aster R: Building Scalable R Applications


Thus, Teradata Aster R (or just Aster R), Aster database for big data analytics running on Hadoop Appliance or on its own, Teradata QueryGrid connected with other data sources offer the best of breed environment for running data science pipelines.

Building Data Science Pipeline

Consider a classification problem where training data (and later unlabeled data) is available but resides across multiple data sources. The following steps will be necessary and near sufficient to build a complete classification solution using supervised learning algorithm:

  • connect to all data sources to bring data in
  • transform data from sources to consistent format
  • create analytical data set by merging data
  • enhance existing and engineer new features on data
  • prepare training, testing, and validation data sets
  • run supervised classification that creates and evaluates model (repeat as many times as necessary to improve classification error)
  • validate final model to have realistic error estimate
  • deploy model in production to run on schedule to source data and score new results

For many data scientists these steps are both familiar but present multiple challenges of connecting and merging separate data sources, executing unrelated processes in different execution environments, developing and evolving multiple programs that result in diffusion of technologies, knowledge, and ultimately results. But not with Teradata Aster where all steps can be maintained and executed within integrated R and Aster environment connected to necessary data sources and feeding scoring system at the end. Defined and driven by R and executed in Aster database data scientist controls and maintains all logic with R language and stores data in systems like Hadoop or Aster appliances. 

How-To's and Examples

This introduction to data science pipelines will be followed by more posts with discussion and building solutions within Teradata eco-system including workflows, main principles and steps with multiple examples. Meanwhile you can find illustrations and examples of the data science pipeline with Aster R in the following RPubs presentation. The examples will use Lahman baseball database available for Aster to download here. R source code is available as attachment.


Image source: Red Hat Developer Program

When analyzing a time series data set we sometimes want to detect those points in time where there is a significant and abrupt change.  


Aster offers a ChangePointDetection function that does exactly that. The function looks back at the available data points and applies a binary segmentation search method. The algorithm executes these key steps:   


  1. Find the first change point in our time series.
  2. From that point, split the data into two parts.
  3. In each part find the change point with the minimum loss (as calculated by a cost function).
  4. Repeat until we have found all the change points.


Before we can learn more about this function we need a data set to explore. We can download the Online Retail Data Set from the UCI Machine Learning repository (link). 


Let's load the csv data into a new Aster table "retail_sales_cpd" and review an example.



Our data set includes 541,909 rows. We pick one sample customer and product:



In the output we see that a customer from the Netherlands tends to place very large orders for vintage spaceboy lunch boxes. The price is very static, except for one order.  



The quantity varies wildly. We see significant up and down changes (red boxes  throughout the order history.



Of course with large data sets we do not have time to manually sift through the data and create visual plots. Let's review what the ChangePointDetection function can do for us. 


Function Syntax:


Besides the normal function parameters there are a few additional parameters that we need to study more carefully:

  • We will partition the input data by customer and product and sort using the invoice date.  
  • The ValueColumn is the key field of interest where we want to detect changes. For our dataset we can pick qty, price or qty *price.  Note that we can only specify one single column.
  • Accumulate is where we specify the identifying columns that we used in the partition and order by clause (such as customer_id, product_id and invoice_date). These extra output columns will help us interpret the results.
  • SegmentationMethod allows us to choose normal_distribution or linear_regression. (default = normal_distribution) 
  • SearchMethod is always set to binary. This is the only option for the function. We do not have to explicitly specify this parameter for that reason.
  • MaxChangeNum specifies the maximum number of changepoints to detect (default = 10)
  • Penalty can be BIC, AIC or a specific static threshold.  The BIC and AIC criteria are used to evaluate the differences between the chosen change points are the original data. The penalty is included in the cost function as a guard for overfitting.  BIC is the default option.
  • OutputOption can be CHANGEPOINT, VERBOSE or SEGMENT. This option allows us to output only the
    • changepoints (cptid column)  This the default option.
    • changepoints and calculated differences  (between the estimations for the H1 and H0 hypotheses)
    • specific segments that have been detected. 


We invoke the ChangePointDetection function and use linear regression to perform the segmentation:



Note that while we can use the ACCUMULATE feature to output additional columns, I prefer to join with the source table to get a full picture.



Reviewing our basic line chart again. If we circle higher qty change points in red and lower qty change points in green we get this result:



Obviously the change points do not always correspond with straightforward highs and lows. If they did we would not need to have the function do all the calculations. A simple sql windowing approach could accomplish the same.



Change detection on retail data can highlight those customers that have unique requirements and shopping habits. Possibly this group of customers is at higher risk of churn or lower satisfaction and it is a good idea to perform further analysis using other techniques.


To quickly find those customers of interest and products with a higher number of change points we can aggregate our results. 




Since our example is using a retail data set one question comes to mind: does seasonality impact the results? Yes, change detection algorithms do have a harder time with time series that include seasonality. It is recommended to remove the seasonal component if your results are below expectations.



So what is the value then?


In our example we reviewed retail sales using the quantity sold.  We can apply the same technique to averages, counts, standard deviations. This opens the door to various use cases such as fraud or intruder and anomaly detection where a tangible corrective action is possible.


Another example could be a rise in call center complaints. A change detection analysis can pinpoint the time where one or more events triggered the increase in call volume.  


In manufacturing the strength of a part is affected by a change in the input materials.  

Image result for manufacturing part defect

Change point detection can go back in time , go through the historical sensor data and highlight the time stamps where changes occurred. Those time stamps can potentially be linked to a supplier switch,  different batch of input materials or a change in the operating environment.   








Online Retail Data Set (link)

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197–208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).


Change Point Detection: a powerful new tool for detecting changes (link)


Change Point Detection with seasonal time series (link)

The Teradata Certified Professional Program is excited to announce the release of the first Teradata Aster Certification Exam – Teradata Aster Basics 6.10. Passing this exam will earn you the Teradata Aster Certified Professional Credential. Start preparing today and you’ll be one of the first to get Teradata Aster Certified. Click here for details, exam objectives and recommended training courses.


Teradata Aster Analytics Basics Study GuideIntroducing the New Teradata Aster Basics Certification Study Guide


The Teradata Aster Basics Certification Study Guide, has been released!  Through simplified examples and explanations, this new guide helps certification candidates prepare for the Teradata Aster Basics 6.10 exam and achieve the Teradata Aster Certified Professional Certification.  The Teradata Certification Study Guides are the only Teradata authorized study publications. This guide is designed to complement Teradata training and deepen your knowledge in the defined certification objectives. All Certification Study Guides can be purchased at


Pursue Teradata Certification with Confidence

Video of slides from a recent presentation at Teradata Analytics Meet @ Linkedin, May 2017 in Sunnyvale. 



Note: Teradata Solutions are not limited to just these two algorithms illustrated in the video. These are among 10+ sequence prediction algos. available today ... 

This post outlines an approach to automatically recreate non-persistent objects in an Aster Execution Engine (AX) environment on startup.



The Aster Execution Engine (also known as Aster-on-Hadoop) exists as a collection of YARN-managed compute services without any persistent user data - all user data is held in temporary or analytic tables only as long as the AX instance is running, and is only persisted by storing it back into the underlying Hadoop instance or onto some other system (eg: load_to_teradata()).


In addition to user data, there are a number of database objects that do NOT persist between restarts of an AX instance. Per the User Guide:

Persistent ObjectsNon-Persistent Objects
• users
• roles
• databases
• schemas
• foreign server definitions
• packaged analytics models and functions
• granted privileges on persistable objects
• tables
• views
• constraints
• indexes
• R scripts installed on the server side
• user-installed files and SQL/MR functions
• user scripts for vacuum or daily jobs


To make life easier for myself, my idea is to detect when the AX instance has been restarted, and to recreate the objects that I want to be there every time. These might be views into my data on the HDFS, custom functions and anything else that I use regularly.


An Example Aproach

To make this happen, you need 2 things:

  1. A way to detect that the AX instance has restarted
  2. A way to create the objects


I believe that "Indecision is the Basis of Flexibility", so the pieces I created are full of user-customizable components.

  • To detect that the instance has restarted, AX provides a token that changes every time the instance restarts. To read this token, use the query:
         select * from nc_system.nc_instance_token;
  • Keep the token in a file, and when it changes, you've got a restart; in which case, you simply run a bunch of stuff that you can mix-and-match as needed


A sample script that does this:


#! /bin/sh
UPSTARTDIR=/root/upstart                 # the directory of this script
TOKENFILE=${UPSTARTDIR}/.currtoken       # where I store a copy of the instance token
SCANINT=10                               # how often to scan for a change (in seconds)
UPSTARTPROCDIR=${UPSTARTDIR}/upstart.d   # a dir of scripts to run when the instance restarts

source /home/beehive/config/

while true ; do
     OLDTOKEN="$(cat $TOKENFILE 2>/dev/null)"

     TOKEN="$(act -U beehive -w password -q -A -t -c "/* NOAMC */ select * from nc_system.nc_instance_token" 2>/dev/null)"

     # unable to get new token - aster instance is not available
     [ "${TOKEN}" ] || {
          sleep ${SCANINT}

     # tokens do not mismatch (tokens are the same)
     [ "${TOKEN}" != "${OLDTOKEN}" ] || {
          sleep ${SCANINT}


     echo ">>> running executables from ${UPSTARTPROCDIR}"
     for F in $(ls ${UPSTARTPROCDIR} 2>/dev/null) ; do
          [ -x ${F} ] || {
               echo "--- ${F} is not executable. skipping..."
          echo ">>> Running startup file ${F}"
          ${UPSTARTPROCDIR}/${F} 2>&1 | sed -e 's/^/ | /'

     echo "${TOKEN}" > ${TOKENFILE}

     sleep ${SCANINT}


  • The script could be run in the system's inittab, as a startup process, or in any way that seems fitting. It does NOT need to run as root, but things you want it to do may need certain privileges.
  • You may want to make sure that there is only one copy of this script running.
  • The phrase '/* NOAMC */' prevents this query from showing up in the Aster Management Console process history. Use this with discretion.

When the script detects that the AX instance has restarted, it will look in the directory ${UPSTARTPROCDIR} and simply run any file there that is executable, in 'ls' order. I made it modular, so I can mix-and-match components, and made it run only the executables so I can turn off pieces just by toggling the execute bit.

These scripts may do anything you like on restart, using act, ncli or even other things in the environment. For example, I might want to maintain a set of views into Hadoop data sets; given a script that creates these views (create_sql_views.sql), a setup script might look like this: 

. /home/beehive/config/
act -U db_superuser -w db_superuser -d playbook \
-f /asterfs/poc_playbook/poc_scripts_kb/sqlh/create_sqlh_views.sql


There are plenty of ways to achieve this result. I hope this one proves helpful.

Some time ago I came across "Spurious Correlations", an interesting book by Tyler Vigen. This book is full of unusual and nonsensical examples where a strong (Pearson) correlation is showcased between unexpected variable combinations.


Placing blind trust in the correlation between Microsoft revenue and political action committees will tell you that Bill Gates has been controlling Congress for several years. Who knew that the number of UK citizens immigrating to the US correlates 87% with US uranium exports? The marriage rate in Wyoming apparently has a 97.6% correlation with the number of domestically produced passenger cars sold in the US. Plenty of other absurd examples can be created by performing automated comparisons between unrelated data sets. (Tip: Aster can do this with a single sql statement)


Many papers have been written about the Theory of the Stork, supported by various data sets that show significant correlation between baby births and the size of the stork population in a specific area such as Northern Europe. 


Image result for 1800 holland stork baby birth correlation


The storks typically fly south for the winter and return north in early spring. Babies born in March-April were typically conceived in June of the previous year.  Midsummer celebrations ("solstice") have a different format depending on culture and location. A common thread between all of them is a focus on fertility, family and a new beginning. This explains why so many weddings are scheduled in June and why there are so many kid birthday parties to attend in spring/early summer!    In this historic example the weather acts as a hidden variable and results in a non-causal correlation.


So what happened to the story behind our headline?  First we build a database table based on the number of movies that Bruce Willis starred in according to and include boiler related fatal accident data available on 


Next we check the correlation between our two variables:


Correlation is symmetric (A is correlated with B and B is correlated with A).  Causality is much more interesting and useful (A causes B and B does not cause A).  


We can perform a statistical test that was developed by George Sugihara of the Scripps Institution of Oceanography to review causality. Convergent Cross Mapping (CCM) tests the cause and effect relationship between two time series variables. 


Takens' theorem is the delay embedding theorem by Floris Takens. In the study of dynamic systems, a delay embedding theorem specifies how a chaotic dynamical system can be reconstructed from a sequence of observations.


CCM leverages this approach to determine causality between two time series. If variable C is the cause of variable E then information in the time series C is also available in time series E.   Historical observations from time series E can be used to estimate the state of time series C.


Why is the algorithm called Convergent Cross Mapping?

  • CCM uses the concept of cross mapping. The process of using the historical record from one series to predict variable states in another series. 
  • CCM uses the property of convergence.   The first step of the algorithm is to choose a library of short time series from the effect variable. If the length of the time series (library size of 3 or 10 observations for example) increases and the cross-mapped estimates become more accurate we will see improved real world identification of causation.


Basic Steps:

  1. A library of short time series is constructed from time series E. This is called a "shadow manifold".
  2. The library is used to predict values of the cause variable using a k-nearest neighbors approach.
  3. The correlation between the predictions of the cause time series and actual values in time series C are computed.
  4. The size of the library is increased to check convergence and determine if there is a causal relationship.


We can use the Aster CCM function to determine the optimal value for "EmbeddingDimensions", the number of lags or past values that we will use to predict a given value in the time series. 


Note there are a number of requirements to allow the function to correctly determine the optimal lags:

  1. the cause and effect columns have the same value
  2. the SelfPredict argument is set to true
  3. the LibrarySize argument is not specified
  4. only a single cause and single effect column is allowed


We can specify one or more EmbeddingDimensions. If we omit the parameter the function will default to using two lags.



Result: the optimum number of lags is 2.


Now that we know the proper value for EmbeddingDimensions we can execute our CCM function against our input data. It is not required to specify the LibrarySize parameter. By default the function will try libraries of  size  "embedding dimension + 1" and "100" (assuming we have that many observations).




In the output the two columns of interest are:

  • correlation: correlation between the values predicted by the effect attribute and the actual value of the cause attribute. 
  • effect_size: estimated effect size of increasing library value from smallest value to the largest value. An effect_size greater than approximately 0.25 indicates a causal relationship.



The effect size is greater than 0.25 for both cause-and-effect directions.  This indicates that there is causality between our two time series. The number of Bruce Willis movies are a more important cause for exploding boilers than the other direction due to the greater effect size (0.59 > 0.27).


Of course we have to keep in mind that intuition has to play a big role and we cannot blindly trust statistics. Our starting point was a spurious correlation!



Now that we have gone through a simple exercise to get a feel for the CCM function let us review a second data set that contains monthly sales of bathing suits and the average high temperatures for the Texas region.


First we check the correlation between sales and temperature.  


Next we determine the ideal number of EmbeddingDimensions:


And we execute CCM with 3 EmbeddingDimensions:



We find out that sales does not cause higher temperatures:  the effect size is only 0.011

The temperature does cause higher sales of bathing suits: the effect size is 0.39 which indicates we have convergence and causality. 


Try it out for yourself.  I always wondered about correlation and causality of the cost of bananas and the revenue generated by ski areas in the USA. 









Spurious Correlations 


Detecting Causality in Complex Ecosystems
DOI: 10.1126/science.1227079
Science 338, 496 (2012);
George Sugihara et al.


Dynamical system tools and Causality analysis
Amir E. BozorgMagham,
Shane D. Ross,
Engineering Science and Mechanics (ESM), Virginia Tech

We know that understanding the customer journey is important, but how can we be sure we are telling that story to the best of our ability? It could be impossible to completely understand everything that affects your customer. BUT! Therein lies nearly unlimited ability to improve how we model the customer experience. Free-form text in customer surveys and reviews, for instance, can be tricky to analyze. Even so, it represents one of the clearest pictures of the customer's perspective attainable. In this example, We will be using customer survey data, topic modeling, and sentiment analysis to create a contextual customer touchpoint.


--Create new table from surveys in Hadoop. Compress and analyze.
CREATE VIEW tb.survey_raw
SELECT * FROM load_from_hcatalog (
     ON public.mr_driver
     SERVER ('')
     USERNAME ('hive')
     DBNAME ('default')
     TABLENAME ('web_surveys')

CREATE TABLE tb.survey
SELECT survey_id
     , user_id
     , nps
     , survey_text
     , survey_timestamp
     FROM tb.survey_raw
ANALYZE tb.survey;

--Tokenize surveys with Text Parser
DROP TABLE IF EXISTS tb.survey_tp;
CREATE TABLE tb.survey_tp
SELECT * FROM Text_Parser (
ON tb.survey
Text_Column ('survey_text')
Case_Insensitive ('true')
Stemming ('true')
Total ('true')
Punctuation ('[\\\[.,?\!:;~()\\\]]+')
Remove_Stop_Words ('true')
Accumulate ('survey_id')
List_Positions ('true')
ANALYZE tb.survey_tp;

--Check top words
--Words that occur too frequently may not be useful in our analysis
--It is usually a good practice to remove these
--This will depend on your own use case and data
SELECT token, SUM(frequency)
FROM tb.survey_tp
LIMIT 200;

--Check lowest word counts
--Words that only occur very few times should also be considered for removal
SELECT token, SUM(frequency)
FROM tb.survey_tp
HAVING SUM(frequency) <= 2500
LIMIT 2000;

--Removing tokens with less than 2500 appearances
--or those in the top 19 appearances.
--Before we reach this point, stop words should
--already be removed.
DELETE FROM tb.survey_tp
WHERE token IN(
     SELECT token
          SELECT token, SUM(frequency)
          FROM tb.survey_tp
          GROUP BY 1
          HAVING SUM(frequency) <= 2500) a
OR token IN(
     SELECT token
          SELECT token, SUM(frequency)
          FROM tb.survey_tp
          GROUP BY 1
          ORDER BY 2 DESC
          LIMIT 19) a
VACUUM tb.survey_tp;
ANALYZE tb.survey_tp;

--Create LDA Model
--Training converged after 37 iterate steps with delta 9.028843880595204E-5
--There are 1274580 documents with 23101982 words in the training set, the perplexity is 317.822466
--Elapsed Time: 03:52.518
DROP TABLE IF EXISTS oap.tb_lda_model;
InputTable ('tb.survey_tp')
ModelTable ('tb.survey_lda_model')
TopicNumber ('20')
DocIDColumn ('survey_id')
WordColumn ('token')
CountColumn ('frequency')
Seed (3)
ANALYZE tb.survey_lda_model;

--Show model summary
--The model table is trained with the parameters: topicNumber:20, vocabularySize:1155, alpha:0.100000, eta:0.100000
--There are 1274580 documents with 23101982 words in the training set, the perplexity is 317.822466
SELECT * FROM LDATopicPrinter (
ON tb.survey_lda_model PARTITION by 1
ShowSummary ('true')

--Show model detail
--Create table of highly weighted words
DROP TABLE IF EXISTS tb.survey_lda_topics;
CREATE TABLE tb.survey_lda_topics
SELECT * FROM LDATopicPrinter (
ON tb.survey_lda_model PARTITION by 1
ShowWordWeight ('true')
ShowWordCount ('true')
WHERE wordweight > .025
ORDER BY topicid, wordweight DESC;

--Test Model
--There are 424487 valid documents with 7712740 recognized words in the input, the perplexity is 318.322719
--Outputtable "oap"."tb_lda_inf" is created successfully.
DROP TABLE IF EXISTS tb.survey_lda_inf;
SELECT * FROM LDAInference (
InputTable ('tb.survey_tp')
ModelTable ('tb.survey_lda_model')
OutputTable ('tb.survey_lda_inf')
DocIDColumn ('survey_id')
WordColumn ('token')
CountColumn ('frequency')
ANALYZE tb.survey_lda_inf;

--Find and explore strong topic matches
SELECT * FROM tb.survey_lda_inf LIMIT 500;

--Accumulate topic words for easy analysis
DROP TABLE IF EXISTS tb.survey_lda_acc;
CREATE TABLE tb.survey_lda_acc
     ON tb.survey_lda_topics
     PARTITION BY topicid
     PATTERN ('T*')
     SYMBOLS (
          TRUE AS T
     RESULT (
          FIRST (topicid OF T) AS topicid,
          ACCUMULATE (word OF ANY(T)) AS words

--Sometimes even if you include a customer satisfaction score
--in your survey, the actual customer text may tell a
--different story!
--We can join sentiment scores to LDA topic scores
--to infer positive and negative elements about
--a customer's experience.
--We categorize our documents here using the highest
--ranking LDA topic score.
--Add sentiment and topic columns
DROP TABLE IF EXISTS tb.survey_se;
CREATE TABLE tb.survey_se
SELECT se.*, lda.topicid, lda.topicweight FROM ExtractSentiment (
ON tb.survey
Text_Column ('survey_text')
Accumulate ('survey_id', 'user_id', 'nps', 'survey_timestamp')
Level ('DOCUMENT')
) se
, (SELECT docid, topicid, topicweight
     , RANK() OVER (PARTITION BY docid ORDER BY topicweight DESC)
     FROM tb.survey_lda_inf) lda
WHERE se.survey_id = lda.docid
AND lda.topicrank = 1;
ANALYZE tb.survey_se;

SELECT * FROM tb.survey_se LIMIT 500;

--What relationship does text sentiment have with NPS?
CREATE TABLE tb_nps_se
SELECT survey_id
     , nps
     , out_polarity
     , CASE
          WHEN nps > 0
          AND out_polarity = 'POS'
          THEN 1
          ELSE 0
          END both_pos
     , CASE
          WHEN nps < 0
          AND out_polarity = 'NEG'
          THEN 1
          ELSE 0
          END both_neg
     , CASE
          WHEN nps > 0
          AND out_polarity = 'NEG'
          THEN 1
          WHEN nps < 0
          AND out_polarity = 'POS'
          THEN 1
          ELSE 0
          END conflict
FROM tb.survey_se;
ANALYZE oap.tb_nps_se;

SELECT * FROM oap.tb_nps_se LIMIT 200;

--How often do the metrics agree? Not taking into account neutral measures.
SELECT pos_total, (100 * pos_total / cnt)::DECIMAL(4,1) pos_pct,
     neg_total, (100 * neg_total / cnt)::DECIMAL(4,1) neg_pct,
     con_total, (100 * con_total / cnt)::DECIMAL(4,1) con_pct,
(SELECT SUM(both_pos) pos_total, SUM(both_neg) neg_total, SUM(conflict) con_total, COUNT(*) cnt
FROM tb.nps_se) a;

--Does the inferred topic of a document imply a change in sentiment?
SELECT a.*,b.words
(SELECT topicid, AVG(nps), AVG(out_strength)
FROM tb.survey_se
ORDER BY 1) a,
tb.lda_acc b
WHERE a.topicid = b.topicid;

--Finally, we can create a table of contextual touchpoints.
--These records can be easily be used with nPath logic
--to craft a complete customer experience.
DROP TABLE IF EXISTS tb.survey_events;
CREATE TABLE tb.survey_events
SELECT survey_id
     , user_id
     , topicid
     , survey_text
     , out_polarity sen_polarity
     , survey_timestamp
FROM tb.survey_se

Very Good Bad Model

Posted by mt186048 Apr 17, 2017

I hate to say this, but we build models and discover business insights off of faulty data. We like to think that the data we work with is pristine, flawless, perfect, but often it's just plain dirty. The exceptions to this rule are often similar to the Iris Data Set: complete and probably perfect, but minuscule in comparison to the size of modern-day data spaces. 


There are three main things that make data dirty:

  1. Missing Values 
  2. Incorrect Domain Values
  3. False Data


We can often identify the first two during data exploration. Nulls and instances like a work tenure of 2107 years jump out fairly easily. Sometimes we're able to update these with a ground truth from elsewhere in the data space. Other times we do data cleansing by removing rows or replacing values as appropriate. 


It is much harder to identify or fix the third type of dirty data: data that is non-observably incorrect. This may be fine in the day to day life of or data but is  especially bad when we think about supervised learning problems. 


Imagine that we poll 100 people, ask them them "Do you think you're happier than the average person?" and get the below distribution:


We have 71 hypothetical people that see themselves as more than averagely happy and 29 who see them selves as less than averagely happy. But how many of those 71 would actually just being uncomfortable saying they were unhappy in a survey? And are there any of our less-than average happy people who are just humble and would be uncomfortable saying they were better than average in a survey?


  What They Felt 
What They SaidYes??71


We could build a model on this data set that is perfect in prediction and has a huge F-Score. But depending on how deceptive our poll-takers were feeling we may end up with a model that is bad at actually predicting happiness. Perhaps too existential of an example. But in our businesses do we want to predict the real answers or the answers skewed by false data?


This gives us something to think about in all of our predictive models. When we have a model with recall of 70 percent, is it that we are missing 30% of the category or that 30% of the category isn't actually that category? I hypothesize that it's likely somewhere in between.


So what can we do?

  1. Think about how much you trust your training data set. Do you think most of the values are right? Was anything done by hand when assigning those values? Are there are assumptions being made to assign the predictive variable?
  2. Cluster! Without including what you are trying to predict, cluster your data. Then look at how many of each predictive value end up in each cluster. Have a cluster with 90% churn customers and 10% non-churn? Those non-churners are probably worth looking into
  3. Look at what is going wrong. Digging into instances where your model and the ground truth disagree is a great  way to improve the model, improve the ground truth, and discover new business insights