A path to causal inference

The aim of data analytics is to infer the relationships between variables in a system in order to predict and/or control said system. For example, we may wish to understand the relationship between a stock’s return and its volatility in order to profit from changes in these variables or to reduce risk. For instance if we knew that when volatility went down the stock price would go up we could profit by buying the stock when we first saw signs of a drop in stock volatility.

Unfortunately, it is often not possible to directly infer causation because we are (usually) unable to directly perturb a given system; the only way to unambiguously determine causation. Consequently, we often settle for analysing simple correlations which act as a crude proxy for causation. This can be seen in the chart from Spurious Correlations which shows the total revenue generated by arcades in the US versus the number of computer science doctorates. Whilst there is a 98.5% correlation there is no plausible causal mechanism.

from http://www.tylervigen.com/spurious-correlations
from http://www.tylervigen.com/spurious-correlations

There are alternative approaches to inferring causation such as developing a mechanistic model as is done in epidemiology or building an experiment as is done is behavioural economics. The first of these requires a good understanding of the underlying process whilst the latter requires strict controls to prevent external influences and to make it as realistic as possible. Another approach is to consider causation in a statistical sense as is done with Granger causality.

Granger causality is based upon the premise that the process X strictly Granger causes another process Y if future values of Y can be better predicted using the past values of X rather than only the past values of Y. This notion was originally introduced by Wiener (1956) and later formalized in terms of linear auto-regression by Granger (1969). As stated by Barnett et al. (2009), “identifying Granger causality is not identical to identifying a physically instantiated causal interaction in a system; this can only be unambiguously identified by perturbing the system. Instead, it is a causal relation in a statistical sense.” A major problem with Granger causality is that most real problems, such as the return–volume relation, are nonlinear (Hiemstra and Jones, 1994; Chuang et al., 2009). This led researchers such as Baek and Brock (1992) and Hiemstra and Jones (1994) to develop nonlinear extensions to Granger causality. The Hiemstra and Jones (1994) test is now the most commonly used method among practitioners in finance and economics. Unfortunately, Diks and Panchenko (2005) show that this measure may not actually test Granger causality and identify numerous situations in which the test actually fails. An alternative approach is to use a truly nonlinear and nonparametric method such as information theory.

Information theory was originally developed to examine the properties in signal processing, such as data compression, by Shannon (1948). However, it is now widely used in the physical sciences for problems such as statistical inference due to its ability to analyse nonlinear statistical dependencies and higher order moments of the distribution.

Mutual information (MI) is a popular measure in the field of information theory. MI gives the mutual reduction in uncertainty of one variable given another. For example, one can calculate the reduction in uncertainty of the daily return at time, t, by knowing the daily volume at time, t. If there is no reduction in uncertainty, then the daily returns and volumes are statistically independent. Unfortunately, since this measure is symmetric under the exchange of variables, it is only able to determine if two variables are related. However, if one wishes to imply causation one can simply add a time lag to one variable; this assumes that the causal effect cannot back propagate through time. For example, one can find the reduction in uncertainty in the daily return at time, t+1, given the daily volume at time, t, and vice versa. If there is only a reduction in uncertainty in one direction or one is substantially larger, then one variable must be strongly influencing or causing the changes in the other variable.

One can use an asymmetric measure such as Transfer Entropy (TE) (Schreiber, 2000), an information theoretic measure of time-directed information transfer between jointly dependent processes. Barnett et al. (2009) state that TE is not framed in terms of prediction but in terms of resolution of uncertainty. The TE from Y to X is the degree to which Y disambiguates the future of X beyond the degree to which X already disambiguates its own future. This parallels the notion of Granger causality. In fact, Barnett et al. (2009) show that TE is equivalent to Granger causality for Gaussian distributed variables and Hlaváčková-Schindler (2011) extended this to variables distributed as exponential Weinman’s, log-normal’s and certain parametrizations of Generalized Gaussian’s.

This article is based upon a recent paper I published in the Journal of Applied Economics entitled “An information theoretic analysis of stock returns, volatility and trading volumes”; Ong (2015). In this paper I used information theory to show that the observed negative correlation between a stock’s returns and its volatility (known as the Leverage Effect, Black (1976)) is driven by trading volumes; this is supportive of previous research by Avramov et al (2006). This is important for trading and risk management purposes and supports the idea of a behavioural based explanation for the Leverage Effect.

References

Avramov, D., Chordia, T. and Goyal, A. (2006) The impact of trades on daily volatility, Review of Financial Studies, 19, 1241–77

Baek, E. and Brock, W. (1992) A general test for nonlinear Granger causality: bivariate model, Working Paper, Iowa State University and University of Wisconsin at Madison

Barnett, L., Barrett, A. B. and Seth, A. K. (2009) Granger causality and transfer entropy are equivalent for Gaussian variables, Physical Review Letters, 103, 238701

Black, F. (1976) Studies in stock market volatility changes, in Proceedings of the 1976 Meeting of the Business and Economics Statistics Section, American Statistical Association, Alexandria, VA, pp. 177–81

Chuang, -C.-C., Kuan, C.-M. and Lin, H.-Y. (2009) Causality in quantiles and dynamic stock return-volume relations, Journal of Banking & Finance, 33, 1351–60

Diks, C. and Panchenko, V. (2005) A note on the Hiemstra-Jones test for Granger non-causality, Studies in Nonlinear Dynamics and Econometrics, 9, 1558–3708.

Hiemstra, C. and Jones, J. D. (1994) Testing for linear and nonlinear Granger causality in the stock price-volume relation, The Journal of Finance, 49, 1639–64.

Hlaváčková-Schindler, K. (2011) Equivalence of Granger causality and transfer entropy: a generalization, Applied Mathematical Sciences, 5, 3637–48

Ong, M. (2015) An information theoretic analysis of stock returns, volatility and trading volumes, Applied Economics, 47, 36, 3891-3906

Schreiber, T. (2000) Measuring information transfer, Physical Review Letters, 85, 461–4

Shannon, C. E. (1948) A note on the concept of entropy, Bell System Technical Journal, 27, 379–423

Complexity Science – the hard thing about data science

From my perspective there are two main challenges in data science. The first is the Big Data problem and the second is the complex system problem. The Big Data problem refers to how we store and process large quantities of data. At its simplest, this requires developing faster and more efficient methods of storage and computation. This has been the main focus of industry and one in which we have made great strides. For example, we can now sequence the human genome in just a few hours. In this article we will focus on the complex system problem, whose solutions are still in their infancy and provides the greatest challenge to data science. (I say this because whilst more information is generally better it is the quality of information that is important for a data scientist rather than the quantity).

The complex system problem refers to how we analyse data from a system which has non-linear interactions and feedback loops; this characterises most real world systems. It is important to distinguish between a complex system and a complicated system. Just because a system is complicated it does not necessarily mean it is complex. A complicated system may have many elements but if they interact in a linear manner (without feedback loops) they are, in general, very simple to breakdown and analyse.

Many naive approaches to data science ignore the distinction between complicated and complex systems. They simply apply linear techniques, such as regression analysis, to problems that are inherently non-linear. This is extremely dangerous as it can lead to erroneous conclusions. For example, we can accurately measure all of the variables which drive the weather such as a temperature, humidity, rainfall, air pressure and wind speed but we find it very difficult to predict the weather beyond a few days. This is because the weather is a chaotic system which means that just a small change in one part of the system can grow to have a significant impact on the whole system. This is the “butterfly effect” where a butterfly flapping its wings in South America can cause a hurricane in the USA. Another example is the unpredictability of the economy and financial systems. We understand all of the different parts of the system but as it evolves they often interact in unpredictable ways. It’s made even more difficult because we cannot accurately measure the state of the system at any one time.

So what can we do about this? There is in fact an active area of academic study focused on the study of Complex Systems which is known as Complexity Science. The history of Complexity Science is shown in Figure 1. We can see from this that it has been around in formal terms since the mid 60’s, largely evolving from Systems Theory and Cybernetics. It also shows many of the related fields of research and how they have evolved.

640px-complexity_map-svg
History of Complexity Science by Brian Castellani

The field is dominated by mathematicians and physicists but has lately begun to include economists and social scientists as they search for more sophisticated analytical tools. Some of the common approaches associated with complex systems analysis include:

1) Statistical Mechanics – this is a branch of physics that uses probability theory to study the average behaviour of a mechanical system where the state of the system is unknown. This is important in many real world applications where we are unable to accurately record/observe every part of a system.

2) Information Theory – this was originally developed to examine the properties in signal processing, such as data compression, by Shannon (1948). However, it is now widely used in the physical sciences for problems such as statistical inference due to its ability to analyse non-linear statistical dependencies.

3) Non-linear Dynamics – this analyses how a system changes under varying conditions and over time. This is important as it can help determine the stability of a system and ascertain the critical limits which would lead to instability.

4) Network Theory (aka Graph Theory) – this examines how different objects are connected to one another and allows us to examine how these relationships evolve over time and the affect on information travelling across the network.

5) Agent Based Modelling – this is a computational method used to analyse how autonomous agents interact with one another and the effect of these interactions on the overall system. By using simple interaction rules it is possible to observe very complex behaviour which can provide powerful insights into a system. ABM’s are related to Multi-Agent Systems (MAS).

6) Stochastic Processes (or random processes) – these describe how a collection of random variables evolves over time and is often used for prediction and to quantify potential risks.

It is also important to note that all of these methods must be used in conjunction with sophisticated statistics in order to determine their efficacy and statistical significance.

The above methods generally require some understanding of the system. Another approach is to use machine learning techniques such as Genetic Algorithms, Deep-learning Networks, Support Vector Machines and Gaussian Processes. These provide powerful approaches to uncovering hidden relationships within data; even non-linear relationships. However, one must be very careful when using such a ‘black box’ because they do not provide an understanding of why things happen or if they may change in the future.

Determining which method to use really depends upon the problem that you are analysing. Often it can be useful to apply several of them as they can all provide different information about a complex system. Hence it is important that any good data scientist has at least some familiarity with all of these different approaches and many others; this is not an exhaustive list.

Some readers may say “my problem is very simple so I don’t need complexity science”. Whilst it is true that some non-linear systems can be approximated by linear systems for small perturbations (linearisation) it is important to know what constitutes a small perturbation. When something significant occurs it is very easy for systems to fail and the consequences can often be disastrous e.g. the financial crisis.

There are other challenges to data science such as how to capture data, correct for missing and incorrect data and natural language processing. However, despite these being significant challenges I feel they are minor in comparison.

5 Challenges to Standardised Data Analytics Platforms

sky_of_flames_by_ludo38-d51en4qThe data analytics industry is growing fast with large established players vying for market share with exciting new start up companies. With such a range of services available it is very difficult for businesses to know what is the best approach. Should they use a standardised data analytics platform? If so, which one? Or should they build their own “in-house” analytics team or hire a consultant and develop a bespoke system?

 

These are difficult questions to answer and will generally depend upon the business. In this post I will outline a few potential risks of using standardised data analytics platforms and what, if anything,  can be done to avoid them. Standardised platforms are arguably the largest area of growth in this sector: many platforms are now being bundled with cloud computing services such as Amazon and Microsoft. IBM’s ‘Watson’ is becoming the most renowned of these platforms, having defeated Chess Grandmaster Gary Kasparov and winning the quiz show ‘Jeopardy’; it has since successfully turned its attention to medical services and is now analysing legal services.

 

As an analytics consultant, I am clearly biased towards the benefits of consultancy. However, I don’t believe that any of the following points are particularly controversial.

 

 

1) Data Entry

Most standardised systems will struggle to cope with missing or incorrect data. Unless pre-specified many will just ignore missing data and assume all other data is correct. This is an important issue because it can bias your results leading to erroneous conclusions. Dealing with missing data is a common challenge in statistics and there are sophisticated methods available. However, choosing the correct approach depends upon the problem; it cannot be offered in a standardised platform. Beyond a statistical approach one could also look to enrich and validate the data by using external data sources, but this would also require a tailored solution.

2) Statistical Inference

In order to conduct statistical inference it is imperative that you have an understanding of the problem and the data. For example, if your sample data is not representative of the entire population or the target group, then you will have biased results. This can also occur if you only have a small sample set or one that does not include significant events. In this situation one may wish to use Bayesian statistics, which incorporate expert knowledge of the problem. Unfortunately, it is not possible to employ expert knowledge in a standardised approach. Another risk is when using Machine Learning for prediction. These methods are excellent at modelling what has happened but they are often very poor at predicting regime change. These problems can only really be solved on a case by case basis by statisticians and/or data scientists.

3) Interpretation

This leads back to the previous point that it is important to understand the problem, the data and the method of statistical inference. This is important as it restricts what questions you can ask of the data and under what conditions your inferences are valid. It prevents the business from acting on erroneous results. Again, this can only be solved on a case by case basis by statisticians and/or data scientists.

4) Functionality

Data analytics can be applied to a wide range of business functions and if you wish to develop a data-driven organisation it is vital to do this and integrate your approach. However, it is unlikely that standardised platforms will have all the required functionality. To add functionality the best approach will be to select a platform that allows third-party add-ons. Unfortunately, this will require users to pay additional fees and the add-ons may still not be suitable/ideal.

5) Competitor differentiation

It is natural that the more common standardised platforms become the less opportunities there will be for competitor differentiation. Once the benefits from standardised techniques – of which there are many – are exhausted, businesses will have to start tailoring their systems to remain competitive.

The standardised platforms can certainly provide big benefits to businesses in a short period of time, reducing costs, improving efficiency and improving sales. They may also be enhanced by the use of third-party add-ons to tailor a system and improve functionality. However, it is not clear that they provide a cheap alternative to hiring an “in-house” analytics team or employing a data analytics consultant. This is because they are simply tools and as such require qualified data scientists to ameliorate the inherent risks. Consequently, you must build a de facto “in-house” analytics team, which is expensive.

 

Proponents of standardised platforms would likely argue that the systems are i) a great low cost way for businesses to start to develop data analytics capabilities ii) robust and well tested with great support iii) can be implemented very quickly and iv) will improve over time. I agree with these points. Finally, I’m actually excited to have the opportunity to play around with IBM’s Watson as it is likely to have the most sophisticated natural language processing available. It would be great for people to be able to embed this in their own applications.

Science is Vital

Please support the Science is Vital campaign to raise the science budget from less than 0.5% GDP to 0.8% GDP; the G8 average. Science and engineering form a vital part of our economy and provide huge societal benefits. If we are to remain world leaders in these areas it is vital that we develop a long term and stable policy framework. We need to fund our research institutions to a level that will allow them to remain internationally competitive.

Please write to your local politician and parliamentary candidates asking them to support the Science is Vital campaign.

You can find out more details and information on the Science is Vital website: