Skip navigation
All Places > Learn Aster > Blog
1 2 3 Previous Next

Learn Aster

281 posts

Originally published at Code-free Visual Path Analysis: Watch Now.

 

Marketers need to visually analyze customer paths. IT professionals should be able to visually analyze server logs. Healthcare professionals want to visually analyze treatment paths.

 

There is no reason any of these tasks should require advanced coding skills.

 

Check out these demo videos we recently put together for the Teradata Path Analysis Guided Analytics Interface. You’ll see how easy it is to visually explore paths without writing any code. You can export lists of customers (or servers, or patients) who have completed paths or are on specific paths. And you can investigate text associated with events on these paths. All you need to be able to do is specify a few parameters in the interface and click a few buttons.

 

Predictive Paths

In this demo, we use the predictive paths capabilities of the Path Analysis Interface to identify two sets of customers. One set of customers is at risk of churn. The other group is prospects we may be able to push across the line to conversion.

 

Cart Abandonment

In this video, we look at “cart abandonment” scenarios with an online banking data set and an eCommerce data set. Also, we showcase the “Add Drops” feature that makes it visually apparent where prospects and customers drop off paths within the Path Analysis Interface.

 

Leveraging Text

The text analytics capabilities of the Path Analysis Interface are very unique and also very powerful. In this demo, we use text to provide context around complaints within a multi-channel banking data set.

 

Healthcare Billing

Here, we are looking at healthcare billing data. We want to make it apparent that path analysis use cases are about much more than marketing. Healthcare professionals may also want to look at paths to certain procedures, paths around treatment and recoveries, or paths to specific diagnoses.

 

If you’re interested in visually exploring paths and patterns, please contact your Teradata account executive or send me a note at ryan.garrett@thinkbiganalytics.com. We can have you up and running with the Teradata Path Analysis Guided Analytics Interface on Teradata, Aster, or the Teradata Analytics Platform in no time!

Andrea Kress

XGBoost with SQL

Posted by Andrea Kress Champion Dec 7, 2017

XGBoost has gotten a lot of attention recently as the algorithm has been very successful in machine learning competitions. We in Aster engineering have been getting a lot of requests to provide this function to our customers. In AA 7.0, we’ve released an XGBoost/Gradient Boosting function.

 

The techniques of XGBoost can be used to improve the performance of any classifier. Most often, it’s used with decision trees, which is how we’ve built it in Aster.

 

Decision trees are a supervised learning technique that tries to develop rules (“decisions”) to predict the outcome associated with an observation. Each rule is a binary choice based on the value of a single predictor: the next binary choice depends on the value of that predictor, and so on, until a prediction can be made. The rules can be easily summarized and visualized as a tree, as shown below.  

Example of decision tree

In this tree, the outcome is 0, 1, 2, 3, or 4, where 0 indicates no heart disease, and 1 through 4 represent increasing severity of heart disease. The first “rule” is based on the value of the “Thal” column. If it is anything other than 6 or 7, the predicted outcome is 0. If the value in the Thal column is 6 or 7, the next step is to look at the value in the STDep column. If it is less than 0.7, the next step is to look at the value in the Ca column; if it is greater than or equal to 0.7, the next step depends on the value in the ChestPain column. To make a prediction for an observation, follow the rules down the tree until you reach a leaf node. The number at the leaf node is the predicted result for that observation.

 

A couple of techniques that can significantly improve the performance of decision trees are bagging and boosting. Bagging stands for “bootstrap aggregation”. Bootstrapping is a statistical technique where multiple datasets are created from a single dataset by taking repeated random samples, with replacement, from the original dataset. In this way you create a large number of slightly different datasets. Bagging starts by bootstrapping a large number of datasets and creating a decision tree for each one. Then, combine the trees by either majority vote (for classification problems) or averaging (for regression problems).

 

Random forest is a very popular variant of bagging. With random forests, you use bootstrapping to create new datasets as you do with bagging, but at each split, you only consider a subset of the predictors. This forces the algorithm to consider a wider range of predictors, creating a more diverse set of trees and a more robust model.

 

Boosting is a different approach. With boosting, you build trees sequentially. Each tree focuses specifically on the errors made by the previous tree. The idea is to gradually build a better model by improving the performance of the model at each step. This is different from bagging and random forest because at each stage you try to improve the model, by specifically looking at the points that the previous model didn’t predict correctly, instead of just creating a bunch of models and averaging them all together.

 

There are several approaches to boosting. XGBoost is based on gradient boosting.

 

The gradient boosting process starts by creating a decision tree to fit the data. Then, you use this tree to make a prediction for each observation and calculate the error for each prediction. Even though you’re predicting the same data that you used to build the tree, the tree is not a perfect model, so there will be some error. In the next iteration, this set of prediction errors becomes the new dataset. That is, each data point in the data set is replaced by the delta between the actual result and the predicted result. At each iteration, you replace the dataset with the errors made by the previous iteration. Then, you build a tree that tries to fit this new dataset of the deltas, make new predictions, and so on. When you add these trees together, the result should be closer to the original actual value that you were trying to fit, because you’re adding a model of the error. This process is repeated for a specified number of iterations.

 

Flow chart of gradient boosting process

Gradient boosting and XGBoost use a number of other optimizations to further improve performance.

 

Regularization is a common technique in machine learning. It refers to penalizing the number or the magnitude of the model parameters. It’s a way to prevent overfitting, or building a model that fits the training data so closely that it becomes unflexible and doesn’t perform well on different data.

 

When working with decision trees, regularization can be used to control the complexity of the tree, either by reducing the number of leaf nodes or the values assigned to each leaf node.

 

Typically in gradient boosting, when you add the trees together, each tree is multiplied, by a number less than 1 to slow the learning process down (boosting is often described as a way to “learn slowly”). The idea is that moving gradually toward an optimal solution is better than taking large steps which might lead you to overshoot the optimal result.

 

Subsampling is also a common technique in machine learning. It refers to building trees using only a subset of the rows or columns. The idea is to force the process to consider a more diverse set of observations (rows) or predictors (columns), so that it builds a more robust model.

 

The Aster XGBoost function also boosts trees in parallel. This is a form of row subsampling, where each vworker gets assigned a subset of the rows, and creates a set of boosted trees based on that data.

 

Stopping criteria are another important factor when building decision trees. In the Aster XGBoost function, you specify the exact number of boosting steps. The function also has stopping criteria that control the size of each tree; these arguments are analogous to those used in the other Aster decision tree functions Single_Tree_Drive, Forest_Drive, and AdaBoost_Drive.

 

Here’s the syntax of XGBoost_Drive. Refer to the Aster Analytics Foundation User Guide (Release 7.00.02, September 2017) for more information about the function arguments.

 

Syntax of XGBoost_Drive

Here’s an example. The dataset is available from the UCI Machine Learning Repository. It’s a set of fetal monitoring observations classified into 3 categories. There are 2126 observations and 21 numeric attributes. The first few rows are shown below.

dataset

As usual when training a model, we divide the dataset into training and test sets, and use the training set to build the model. Here’s a sample function call:

 

The function displays a message when it finishes:

 

We can use the XGBoost_Predict function to try out the model on the test dataset:

 

 

Here are the first few rows of the output:

select id, nsp, prediction from ctg_predict;

To conclude, we’re very excited to make this algorithm available to our customers. Try it out!

 

References:

James, G., Witten, D., Hastie, T., & Tibshirani R. (2013). An Introduction to Statistical Learning with Applications in R. Available at: http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Available at: https://web.stanford.edu/~hastie/Papers/ESLII.pdf

Friedman, J. “Greedy Function Approximation: A Gradient Boosting Machine.” IMS 1999 Reitz Lecture. Available at: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

Chen, T., Guestrin, C. “XGBoost: A Scalable Tree Boosting System.” KDD ’16, August 13-17, 2016, San Francisco, CA. Available at: http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf

 

Dataset used in example:

Cardiotocography Data Set. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

The blogosphere is full of sound bites, anecdotes, and cliches on Machine Learning/AI and it's important to discern its place in the larger picture of Advanced Analytics/Data Science. Especially for those in business & IT who are estranged from the 'religious experience' of using the different tools available.

What is Advanced Analytics?

© Can Stock Photo / Jag_cz

Machine Learning and AI methods like Deep Learning fall into a larger 'analytic' library of things you can do with your data. Also, the access methods such as SQL and languages like Python, R etc., The best analogy is a bar or kitchen stocked with the most exotic stuff. As a bartender or chef, you get to make the best cocktails or entree, drawing from an assemblage of options. You may have your own favorite, but it's also important to fine tune stuff to an actual need that someone might want. Here's the thing - you may have the most expensive single malt or ingredient in your bar/kitchen and it doesn't mean everyone will want it! So variety is the key to delivering precisely what the business/end user wants!

Some Examples of Advanced Analytics Choices:

I need to find the top 10 key phrases in product or show reviews that tell me characters and powerful emotions. There are several options:

  1. Start NGRAMing with 1,2,3 terms at a time. Run weighting on keywords with TF/IDF. Show the top 10 - 1,2,3 NGRAMs with the highest IDF value. We can do this with some clever SQL :)
  2. Run a CRF model trained on a big corpus from a POS (parts of speech) output, weight it and sort it. You get the benefit of interesting verb and noun phrases (intelligent)
  3. Run a Word2Vec (Deep Learning/GPU) pass on the data and try to construct a neural embedding model to discover the phrases.
  4. <Add your own recipe/cocktail>

I want to group all my call center text into clusters, so I can see what they are talking about:

  1. Do a term weighting and some distance metric like Euclidian or Cosine and run Graph Modularity to chop clusters and run Phrase detection on each of those. Use a percentile technique to decide # of significant clusters.
  2. Run an LSI (Latent Semantic Indexing) dimension reduction and run K-Means. Decide # of clusters after finding the "elbow of significance"
  3. Run an LDA model and specify how many clusters. Find the # of topic clusters iteratively until it makes sense.
  4. <add your own ingredients/mixology>

With some cleverness, some iterative and grouping techniques above avoids mainstream Machine Learning completely and gets to 80% of the answer with simple sophistication. Of course, using advanced ML techniques will increasingly get us to 95% of the answer - especially when it comes to Fraud and other mission-critical "fine-grained" use cases. Let's put that aside for a moment.

For a lot of simple use cases or basic hypothesis testing, 80% answer may directionally "good enough" stuff for business and that's ok! The key is to have many options all the way from simple to complex with well-understood tradeoffs, such as performance, tuning complexity etc.,

How to get everything in one place? See Sri's blog:

Does your advanced analytics platform has what it takes to access the same data and combine all of the analytics and run in-database ? - Check out Sri Raghavan's Blog

Here are some things users can do with above:

  • Come up with creative solutions that use best of the breed. Use SQL for what it's good at and run R in-database to do scoring or modeling using say ARIMA.
  • Or run LSTM deep learning for time series forecast on Tensorflow while using SQL to curate and organize the data in the same transaction.
  • Run large-scale PCA using SQL on Aster and do a logistic regression in a Spark cluster packaged in the platform. Put the results on a table for Tableau users to see the significant variables in a dashboard.
  • Run Aster's XGBoost on a churn analytic data set to create a model. Score users on the propensity to churn etc., Run Aster's Hidden Markov (Graph implementation) on the data set to find the latent transition matrix and emission probabilities.

Hope you enjoyed the blog post on the postmodern definition and sample usage of Advanced Analytics ;). More to come in my future blog posts on the science and art of using advanced analytics -> for business problems.

Using a popular analytic technique to understand behaviors and patterns, data scientists reveal a subtle but critical network of influence and competition, giving this gaming company the ability to attract and retain gamers in this $109B industry.

 

Insight that can only be found when you combine multiple sources of data with analytics.  With Teradata Aster® Analytics, users apply cFilter.  A function tailor-made for understanding behaviors and opinions.

 

Looking into the data, amazing patterns emerge.

 

Understanding the relationships that drive user behavior can help developers create better games to attract users, prevent churn, and determine how gamers influence each other.

 

Art of Analytics: The Sword - Sri Raghavan  

 

Understanding relationships and influence with a collaborative filter helps multiple industries. 

 

Media

Entertainment

Retail


They can understand their customers behaviors.  Then influence the customers who, in turn, influence their network.

 

 


Related Links:

________________

 

Customer Churn

cFilter

Art of Analytics

Predictive Analysis

Teradata Aster® Analytics

Originally published at /community/bso/blog/2017/11/16/what-happens-next-using-paths-to-predict-prevent-and-accelerate-behavior.

 

Do you know which customers are likely to churn? Which prospects are likely to convert?

 

Historical path analysis is a critical factor in such predictions. The problem is path analysis is hard. And even when companies have such capabilities, they often reside in the hands of a few specialists – or vendor consultants.

 

The business analysts, marketers and customer support professionals who could ultimately act on these predictive insights to improve customers’ and prospects’ journeys are effectively left out in the cold. Even the specialists are ultimately confined to the limits of their tools.

 

Ask anyone who has used a traditional business intelligence tool to understand customer paths. It requires significant time and patience to shoehorn this type of analysis into a tool that was not designed for it. To begin with, just manipulating the data to build an event table for a BI tool is a significantly high hurdle. And even at the end of such a project, organizations end up with a static, inflexible report on historical data that does little to help businesses prevent future churn or accelerate future conversions. (This is hardly a criticism of BI tools, as their benefits and value are well documented. I’m only pointing out that path analysis historically is not one of their strong suits.)

 

Other advanced approaches leverage statistical tools like R and programming languages like Python. They may incorporate sophisticated analysis techniques like Naïve Bayes text classification and Support Vector Machine (SVM) modeling. But, at the end of the day, these are not tools or techniques for businesspeople.

 

And at the end of the day, what matters is providing your business teams the opportunity to influence the customer experience in a manner that is positive for your business.

 

The solution is to bring path analysis – including predictive path analysis – to the business. For such a solution to succeed, it must be:

 

  • Visual. For marketers and business professionals, the ability to visually explore analytics results is critical. Tree diagrams are instantly understandable, as opposed to results tables that require the user to read through thousands of rows.
  • Intuitive. Most analysts and marketers are comfortable using business intelligence tools to understand their data. We use point-and-click interfaces to interact with information every day. But marketers are not comfortable directly manipulating data with SQL or applying advanced statistical models to that data for predictive results. Even predictive results must be returned with a few clicks.
  • Code-free. Your marketers are expert marketers. They shouldn’t need to be expert programmers to understand which customers are on negative paths and which prospects they can help push over the edge to convert.

 

The new Predictive Paths capability in the Teradata Path Analysis Guided Analytics Interface makes this interface a solution to consider.

 

Using the interface, marketers and analysts use a simple form to specify an event of interest – a churn event or conversion event, for example – and whether they want to see paths to or from that event. The interface returns results in the forms of several visualizations, including tree, sigma, Sankey and sunburst diagrams, as well as a traditional bar chart.

 

Within the tree diagram, users can select partial paths to their event of interest and create a list of users who have completed that partial path but not yet completed the final event. For example, if you are looking at an online banking data set and see that a path of “fee complaint, to fee reversal, to funds transfer” precedes a large number of churn events, in three clicks you can generate a list of customers who have completed the path “fee complaint, to fee reversal, to funds transfer” but not yet churned. Thus, you have just used Predictive Paths to identify potential churners without writing a line of code.

 


This video demo shows how marketers and business analysts can predict next steps for customers with the Path Analysis Guided Analytics Interface.

 

Watch this short video to see how Predictive Paths works within the Path Analysis interface. If you’re interested in bringing these capabilities to your business teams, please contact your Teradata account executive today.

A Slide Video from #TDPARTNERS17 presentation on "Machine Learning Vs Rules Based Systems". Adjust YouTube controls for play speed for following easily - 1.5 x, 0.5x etc.,

 

 

The original blog post is here =>Data Science - Machine Learning vs Rules Based Systems

Thanks to everyone who attended my presentation in Teradata Partners Innovation Forum. Here's the video of the deck I presented. I will post some code soon here for you to play with ...

 

Last week I spent at Anaheim Convention Center helping out with Aster demo on the Expo floor, participating in Advanced Analytics session and co-presenting business session on Sunday and presenting my data science session on Tuesday. Below I assembled all resources available online about these events.

 

Advanced Analytics Session (Sunday Morning)

Along with my deer friends data scientists Michael.Riordan@Teradata.com, michelle.tanco@thinkbiganalytics.com, matt.mazzarell@thinkbiganalytics.com we presented several use cases in various applications of analytics on Aster. Mine were couple of demos:

  • Customer Segmentation Based on Mobile Phone Usage Data Use Case (demo and github)
    Elbowing Method for Determining Number of Clusters with K-Means
  • Churn Use Case Using Survival Analysis and Cox PH (demo and github)
    Proportional Hazard Ratio by Service Plans

Sunday Afternoon Session

John Carlile prepared excellent session on text analytics use cases I helped him with on one of our POCs -  Rumpelstiltskin Analytics - turning text documents into insight gold with me as a co-presenter. Please contact John at john.carlile@thinkbiganalytics.com for more details (pdf attached). I would add the session covered analysis of user reviews of major hotel operator across many chains, "fake" review detection and unsupervised and supervised techniques such as LDA and logistic regression.

 

Tuesday Session

My presentation Building Big Data Analytic Pipelines with Teradata Aster and R (morning block) contained two parts:

  • overview: a  value proposition of data science pipelines, their enterprise application focus, main principles and components, and requirements;
  • a primer on building blocks and core technology with examples on Aster R platform: covering topics on environment, grammar of data manipulation, joins, exploratory analysis, PCA and predictive models (pdf attached, R presentation slides and github).

ROC Curve and AUC for predictive model

 

General session catalog for PARTNERS is available here.

Harnessing an analytical technique known as text clustering, companies in multiple industries can analyze customer call center data to find key word trends and phrases that may quickly alert them to potential customer service problems, manufacturing defects or negative sentiment.

 

Video featuring Karthik.Guruswamy - Principal Consultant & Data Scientist

 


 

art_of_analytics, manufacturing, safety_cloud, text_analytics, sentiment_analysis, customer_satisfactio

Safety Cloud – a transformation of multiple types of text data through analytics.  A visualization leading to significant innovation.  Applying natural language processing to these analytical techniques allows for sentiment analysis. Giving businesses an insight without looking at every document the dots represent.

 

 

____________

Other Works by Karthik

The Star – lines thick and thin, seemingly simple but revealing critical insights and behaviors hidden amongst the data only discovered with analytics.

 

Using an analytical technique perfect for time-series data, Data Scientists used Hidden Markov Models to find hidden states. 

 

Michelle Tanco, Data Scientist

 

_______

Related Links:

Other Community Content by Michelle Tanco

Facebook_1200x620_Logo-Theme_A

The explosion of interest in Artificial Intelligence (AI) is triggering widespread curiosity about its importance as a driver of business value. Likewise, Deep Learning, a subset of AI, is bringing new possibilities to light. Can these technologies significantly reduce costs and drive revenue? How can enterprises use AI to enhance customer experiences, create more insight across the supply chain, and refine predictive analytics?

PARTNERS 2017 offers the curious visionary and the creative executive plenty of education on the pragmatic business value of AI.  Here are a few of the sessions on the topic:

Autonomous Decision-Making, ML, & AI: What It Means & Why Should You Care

We’ve entered a new era of analytics with machine learning and artificial intelligence algorithms beginning to deliver on the long-promised advancement into self-learning systems.  Their appetite for vast amounts of data and the ability to derive intelligence from diverse, noisy data allows us go far beyond the previous capabilities of what used to be called advanced analytics.  To succeed, we need to understand both capabilities and limitations – and develop new skills to harness the power of deep learning to create enterprise value. This session focuses on the future of AI, emerging capabilities today; and relevant techniques; the ‘Think Deep’ framework for automating the generation and deployment of models for machine learning & deep learning. Wednesday, October 25, 2:00: PM-3:00 PM

Fighting Financial Fraud at Danske Bank with Artificial Intelligence

Fraud in banking is an arms race with criminals using machine learning to improve their attack effectiveness. Danske Bank is fighting back with deep learning – and innovating with AI – to curb fraud in banking spanning topics such as model effectiveness, real-time integration, Tensor Flow vs Boosted Decision Trees predictive models, operational considerations in training and deploying models, and lessons learned. Monday, October 23, 11:30: AM-12:15 PM.

Artificial Intelligence: What’s Possible For Enterprises Today

The sci-fi notion of AI is still a long way off – that’s pure AI. However, Pragmatic AI technology is here today and enterprises are using AI building block technologies such as machine learning to achieve amazing business results. In this session, Forrester Research VP & Principal Analyst, Mike Gualtieri, will demystify AI and explain what enterprises use cases are possible today and how to get started. Tuesday, October 24, 9:00: AM-9:45 AM. Presenter: Mike Gualtieri, Principal Analyst, Forrester Research.

Artificial Intelligence and the Teradata Unified Data Architecture (UDA)

Artificial Intelligence has entered a renaissance. Underlying this progress is Deep Learning – driven by significant improvements in Graphic Processing Units and computational models inspired by the human brain that excel at capturing structures hidden in massive datasets. Learn how AI is impacting enterprise analytics today in applications like fraud detection, mobile personalization or predicting failures for IoT. Focus on ways to leverage and extend the Teradata Unified Data Architecture today – and a new AI reference architecture – to produce business benefits. Monday, October 23, 2:00: PM-3:00 PM.

Employing Deep Neural Nets for Recognition of Handwritten Check Payee Text

The handwritten check is a primary linchpin of the customer relationship at Wells Fargo. It represents an enormous personnel cost when the bank attempts to resolve the payee field and other transaction information in handwritten form. Currently, Automatic Teller Machines (ATM) operated by Wells Fargo can recognize monetary amounts (numerical digits) in cheques utilizing neural networks trained on a standard handwritten numeral dataset. This session details the latest in image recognition and deep learning techniques to extend recognition capability to the payee field and a new capability to deploy deep neural networks with Aster and Tensorflow, in a SQL interface. Tuesday, October 24, 11:30: AM-12:15 PM. Presenters: Gary Class, Wells Fargo, and Kyle Grove, Senior Data Scientist, Teradata.

Dig in Deep into a Data Fabric Implementation Using Teradata and SAS

Banco Itau-Unibanco S.A. is one of the largest banks in Latin America and a global Top 50 bank by market cap. It operates in the retail, wholesale, private and investment banking, private equity, asset management, insurance and credit card business. The session will outline a new data fabric platform based on Teradata and SAS integration – which brought new capabilities to the credit risk analysts, in terms of amount and complex data to be used in their models. With this platform the risk teams are able to manipulate, in a dynamic and productive way, different sources of data, higher volume (about 30 times more) and new algorithms (e.g. Neural Networks) to improve models performance. The results are amazing and will be shared in detail. Wednesday, October 25, 10:30: AM-11:15 AM. Presenters: Dalmer Sella, Data Engineer, Itau and Fabiano Yasuda, Credit Modeling Manager, Itaú-Unibanco S.A.

Please be sure to check out the Session Catalog for more, and try to register early to join the “Meet-Up” sessions!

 

Original Blog Post: AI and Deep Learning Session Highlights for PARTNERS 2017 - Data Points 

Data Science can be an adventure in every possible way - just ask your employee who has been to tasked to solve data science problems. Did you know that the whole zen of data science thrives on Trial/Error AND a culture of failing fast?

If we talk to people in an analytics department in a company, you are going to find people who approach business problems two ways:

  1. I will try stuff that I know and probably can produce a visual. I've done it before many times before and so it should work. However, I know I can try this new stuff, but I'm not sure what will come out of it. So I'd skip this crazy idea. Just by looking at the data I can tell the visual insight will suck or I will fail miserably.
  2. I wouldn't know the visual will look when I try this. I'm willing to go for it anyways! I want to see what happens and don't want to guess. The data doesn't look that interesting to me at the outset, but willing to create a visual just for the fun of it even if it comes out boring. What if it's something useful?

#2 approach is what makes up the Trial/Error and Fail Fast culture. Imagine putting a super fast iterative tool (Teradata Aster Analytics or Apache Spark) on the hands of the person who practices #2 above!

Trial/Error and Fail Fast culture doesn't mean data scientists are unwilling to try time tested methods. It only means they are willing to take a lot of 'quickfire' risks for better results!

Just a bit of luck and off to the next iteration of failing fast and keep building!

A bit more on Trial/Error and Fail Fast. What exactly is it?

Trial/Error and Fail Fast approach is trying things with very little expectations on the outcome. Of course, the business outcome should be the eventual goal. We are willing to try something quickly and expect not to be rewarded immediately. Also not giving up just because we failed to get an outcome that's interesting the first time. 9 out 10 times, we are fumbling, but willing to get lucky once without giving up- which often proves to most valuable and actually works. Most successful data scientists will choose a fail fast tool for their pursuit for doing trial and error. The more we allow ourselves to fail quickly, the sooner we are going to stumble into something incredibly useful.

© Can Stock Photo / leowolfert

Causality Detection vs Quantitative Modeling

From a 10K feet point of view, most data science problems have two aspects to it:

  • Causality Detection - find the root cause of the problem or situation.
  • Quantitative Modeling - try to predict a situation outcome after learning from a bunch of data. You don't need to know the cause of the problem for prediction, just modeling with different variables. Algorithms take care of mapping the outcome to inputs done correctly and will do robot prediction.

Both of the above require a bit of creativity. Causality Detection is probably the hardest and is 100 times harder as it requires a lot of domain knowledge and some cleverness. It's great to know that I can predict a part failure 8 of 10 times, but knowing why and getting to the root cause is a completely different animal. You can get away with not being a domain expert with Quantitative Modeling. With Causality Detection, only a domain expert can say A leads to B definitely.

Applying Trial/Error and Fail Fast approach to Quantitative Modeling means we are trying different algorithms, model parameters, features in the data, new sources iteratively until we reach your accuracy goal *REALLY QUICKLY*. There is a systematic method to some of the techniques now, but still, requires TRYING many things before something works.

Causality Detection as mentioned earlier is a bit different. We can try and fail fast on a few A/B testing approaches but requires careful navigation through multiple steps with each step taken ever so carefully and surely. Causality Detection is about eliminating uncertainty as we get really close to the root cause.

Working in an uncertain environment

On unknown situations or problems, most folks want a cookie cutter approach - unfortunately, data science brings a lot of uncertainty to a table. Even with techniques like Deep Learning which works out of the box with a startup random configuration, getting to the next level often seems to be challenging and tricky. As architectures become more complex, the science often depends on the trial/error art form solely dependent on the creative data scientist's efforts in addition to best practices developed over time.

Hope you enjoyed the blog post.

Using an agile approach, a cross-functional team of Doctors, Cancer Researchers, Data Scientists, Data Visualization Experts, and Technologists set out on a mission to understand over 1,000 genetic patterns of cancer in order to develop personalize medical treatments aligned to the genetic makeup of humans.  

 

Decoding the human genome is the next frontier in science and discovery in medicine. Today, the combination of data, analytics, and visualization tools are cutting edge innovation in life sciences.  View the video below. 

 

Genome World Window - Stephen Brobst and Andrew Cardno

Art of Analytics - Genome World Window - YouTube 

 

________________

Related Links:

Data Scientist

Data Visualization

Combining the collaborative expertise of data scientists, geophysicists and data visualization an integrated oil company developed new understandings of complex reservoir management with data and analytics. This business case easily transcends multiple industries focused on asset utilization and optimization.

 

The Sailor - Duncan Irving 

Art of Analytics: The Sailor - YouTube 

 

______

Related Links

Data Visualization

Fusing business acumen, data science, and creative visualization, the Burning Leaf of Spending enabled a major bank to detect anomalies in customer spending patterns that indicate major life events, and provided artful insights into the personalized service required to enhance the customer experience, improving lifetime value.

 

Burning Leaf of Spending - Tatiana Bokareva 

 

 

________________

Related Link:

Detecting Anomalies