Steve Noynaert

Bruce Willis and exploding boilers

Blog Post created by Steve Noynaert on Apr 19, 2017

Some time ago I came across "Spurious Correlations", an interesting book by Tyler Vigen. This book is full of unusual and nonsensical examples where a strong (Pearson) correlation is showcased between unexpected variable combinations.

 

Placing blind trust in the correlation between Microsoft revenue and political action committees will tell you that Bill Gates has been controlling Congress for several years. Who knew that the number of UK citizens immigrating to the US correlates 87% with US uranium exports? The marriage rate in Wyoming apparently has a 97.6% correlation with the number of domestically produced passenger cars sold in the US. Plenty of other absurd examples can be created by performing automated comparisons between unrelated data sets. (Tip: Aster can do this with a single sql statement)

 

Many papers have been written about the Theory of the Stork, supported by various data sets that show significant correlation between baby births and the size of the stork population in a specific area such as Northern Europe. 

 

Image result for 1800 holland stork baby birth correlation

 

The storks typically fly south for the winter and return north in early spring. Babies born in March-April were typically conceived in June of the previous year.  Midsummer celebrations ("solstice") have a different format depending on culture and location. A common thread between all of them is a focus on fertility, family and a new beginning. This explains why so many weddings are scheduled in June and why there are so many kid birthday parties to attend in spring/early summer!    In this historic example the weather acts as a hidden variable and results in a non-causal correlation.

 

So what happened to the story behind our headline?  First we build a database table based on the number of movies that Bruce Willis starred in according to imdb.com and include boiler related fatal accident data available on wonder.cdc.gov 

 

Next we check the correlation between our two variables:

 

Correlation is symmetric (A is correlated with B and B is correlated with A).  Causality is much more interesting and useful (A causes B and B does not cause A).  

 

We can perform a statistical test that was developed by George Sugihara of the Scripps Institution of Oceanography to review causality. Convergent Cross Mapping (CCM) tests the cause and effect relationship between two time series variables. 

 

Takens' theorem is the delay embedding theorem by Floris Takens. In the study of dynamic systems, a delay embedding theorem specifies how a chaotic dynamical system can be reconstructed from a sequence of observations.

 

CCM leverages this approach to determine causality between two time series. If variable C is the cause of variable E then information in the time series C is also available in time series E.   Historical observations from time series E can be used to estimate the state of time series C.

 

Why is the algorithm called Convergent Cross Mapping?

  • CCM uses the concept of cross mapping. The process of using the historical record from one series to predict variable states in another series. 
  • CCM uses the property of convergence.   The first step of the algorithm is to choose a library of short time series from the effect variable. If the length of the time series (library size of 3 or 10 observations for example) increases and the cross-mapped estimates become more accurate we will see improved real world identification of causation.

 

Basic Steps:

  1. A library of short time series is constructed from time series E. This is called a "shadow manifold".
  2. The library is used to predict values of the cause variable using a k-nearest neighbors approach.
  3. The correlation between the predictions of the cause time series and actual values in time series C are computed.
  4. The size of the library is increased to check convergence and determine if there is a causal relationship.

 

We can use the Aster CCM function to determine the optimal value for "EmbeddingDimensions", the number of lags or past values that we will use to predict a given value in the time series. 

 

Note there are a number of requirements to allow the function to correctly determine the optimal lags:

  1. the cause and effect columns have the same value
  2. the SelfPredict argument is set to true
  3. the LibrarySize argument is not specified
  4. only a single cause and single effect column is allowed

 

We can specify one or more EmbeddingDimensions. If we omit the parameter the function will default to using two lags.

 

 

Result: the optimum number of lags is 2.

 

Now that we know the proper value for EmbeddingDimensions we can execute our CCM function against our input data. It is not required to specify the LibrarySize parameter. By default the function will try libraries of  size  "embedding dimension + 1" and "100" (assuming we have that many observations).

 

Result:

 

In the output the two columns of interest are:

  • correlation: correlation between the values predicted by the effect attribute and the actual value of the cause attribute. 
  • effect_size: estimated effect size of increasing library value from smallest value to the largest value. An effect_size greater than approximately 0.25 indicates a causal relationship.

 

Conclusion:

The effect size is greater than 0.25 for both cause-and-effect directions.  This indicates that there is causality between our two time series. The number of Bruce Willis movies are a more important cause for exploding boilers than the other direction due to the greater effect size (0.59 > 0.27).

 

Of course we have to keep in mind that intuition has to play a big role and we cannot blindly trust statistics. Our starting point was a spurious correlation!

 

 

Now that we have gone through a simple exercise to get a feel for the CCM function let us review a second data set that contains monthly sales of bathing suits and the average high temperatures for the Texas region.

 

First we check the correlation between sales and temperature.  

 

Next we determine the ideal number of EmbeddingDimensions:

 

And we execute CCM with 3 EmbeddingDimensions:

 

 

We find out that sales does not cause higher temperatures:  the effect size is only 0.011

The temperature does cause higher sales of bathing suits: the effect size is 0.39 which indicates we have convergence and causality. 

 

Try it out for yourself.  I always wondered about correlation and causality of the cost of bananas and the revenue generated by ski areas in the USA. 

 

 

 

 

 

 

References:

 

Spurious Correlations 

 

Detecting Causality in Complex Ecosystems
DOI: 10.1126/science.1227079
Science 338, 496 (2012);
George Sugihara et al.

 

Dynamical system tools and Causality analysis
Amir E. BozorgMagham,
Shane D. Ross,
Engineering Science and Mechanics (ESM), Virginia Tech

Attachments

Outcomes