Karthik Guruswamy

Data Science - Advanced Analytics with Machine Learning & AI

Blog Post created by Karthik Guruswamy Moderator on Nov 22, 2017

The blogosphere is full of sound bites, anecdotes, and cliches on Machine Learning/AI and it's important to discern its place in the larger picture of Advanced Analytics/Data Science. Especially for those in business & IT who are estranged from the 'religious experience' of using the different tools available.

What is Advanced Analytics?

© Can Stock Photo / Jag_cz

Machine Learning and AI methods like Deep Learning fall into a larger 'analytic' library of things you can do with your data. Also, the access methods such as SQL and languages like Python, R etc., The best analogy is a bar or kitchen stocked with the most exotic stuff. As a bartender or chef, you get to make the best cocktails or entree, drawing from an assemblage of options. You may have your own favorite, but it's also important to fine tune stuff to an actual need that someone might want. Here's the thing - you may have the most expensive single malt or ingredient in your bar/kitchen and it doesn't mean everyone will want it! So variety is the key to delivering precisely what the business/end user wants!

Some Examples of Advanced Analytics Choices:

I need to find the top 10 key phrases in product or show reviews that tell me characters and powerful emotions. There are several options:

  1. Start NGRAMing with 1,2,3 terms at a time. Run weighting on keywords with TF/IDF. Show the top 10 - 1,2,3 NGRAMs with the highest IDF value. We can do this with some clever SQL :)
  2. Run a CRF model trained on a big corpus from a POS (parts of speech) output, weight it and sort it. You get the benefit of interesting verb and noun phrases (intelligent)
  3. Run a Word2Vec (Deep Learning/GPU) pass on the data and try to construct a neural embedding model to discover the phrases.
  4. <Add your own recipe/cocktail>

I want to group all my call center text into clusters, so I can see what they are talking about:

  1. Do a term weighting and some distance metric like Euclidian or Cosine and run Graph Modularity to chop clusters and run Phrase detection on each of those. Use a percentile technique to decide # of significant clusters.
  2. Run an LSI (Latent Semantic Indexing) dimension reduction and run K-Means. Decide # of clusters after finding the "elbow of significance"
  3. Run an LDA model and specify how many clusters. Find the # of topic clusters iteratively until it makes sense.
  4. <add your own ingredients/mixology>

With some cleverness, some iterative and grouping techniques above avoids mainstream Machine Learning completely and gets to 80% of the answer with simple sophistication. Of course, using advanced ML techniques will increasingly get us to 95% of the answer - especially when it comes to Fraud and other mission-critical "fine-grained" use cases. Let's put that aside for a moment.

For a lot of simple use cases or basic hypothesis testing, 80% answer may directionally "good enough" stuff for business and that's ok! The key is to have many options all the way from simple to complex with well-understood tradeoffs, such as performance, tuning complexity etc.,

How to get everything in one place? See Sri's blog:

Does your advanced analytics platform has what it takes to access the same data and combine all of the analytics and run in-database ? - Check out Sri Raghavan's Blog

Here are some things users can do with above:

  • Come up with creative solutions that use best of the breed. Use SQL for what it's good at and run R in-database to do scoring or modeling using say ARIMA.
  • Or run LSTM deep learning for time series forecast on Tensorflow while using SQL to curate and organize the data in the same transaction.
  • Run large-scale PCA using SQL on Aster and do a logistic regression in a Spark cluster packaged in the platform. Put the results on a table for Tableau users to see the significant variables in a dashboard.
  • Run Aster's XGBoost on a churn analytic data set to create a model. Score users on the propensity to churn etc., Run Aster's Hidden Markov (Graph implementation) on the data set to find the latent transition matrix and emission probabilities.

Hope you enjoyed the blog post on the postmodern definition and sample usage of Advanced Analytics ;). More to come in my future blog posts on the science and art of using advanced analytics -> for business problems.

Outcomes