One of our data scientists Rob Eyre recently published a paper in Emerging Themes in Epidemiology on modelling fertility in a poor rural region of South Africa using an innovative non-linear approach (the full paper can be found here).

A common issue throughout much of quantitative Public Health research is the application of a range of standardised statistical methods even when such methods are not appropriate. Such standard methods often assume the relationships being modelled to be linear, despite this assumption often being unjustified. One such area where this is the case is in the modelling of how fertility changes over different socio-economic characteristics such as age, education, and social status.

A core aspect of the work we do here at Spectra Analytics involves using more modern, sophisticated, and well-thought-out methods that provide better results to our clients. In line with this, Rob’s research used an innovative combination of a non-linear parametric model of fertility over age, with the use of the highly flexible semi-parametric machine learning method of Gaussian process regression to bring in further variables such as socio-economic status for which no established fertility pattern model exists.

Rob and his research colleagues – Thomas House of the University of Manchester, F. Xavier Gómez-Olivé of the Agincourt research unit in South Africa, and Frances Griffiths of the University of Warwick – successfully applied this method to data from the Agincourt Health and Socio-Demographic Surveillance System (HDSS), run by the Medical Research Council/University of the Witwatersrand Rural Public Health and Health Transitions (Agincourt) Research Unit. This is an annual census performed in a poor rural region of South Africa, collecting information on births, deaths, migration, and many different health aspects. The results of this analysis provided more robust and reliable estimates of the fertility patterns within the Agincourt study area that are free from unjustified assumptions of linearity.

The researchers hope this work will encourage others working in fertility modelling to look beyond standard methodology and be more thoughtful about what methods they use and the assumptions they make when using these methods.

We were very excited to win a place on the British delegation to Cyber Tech Tel Aviv 2018. This was part of the InnovateUK Global Business Accelerator Programme organised by the Enterprise Europe Network and Business West. The aim was to develop closer relationships with the UK and Israel in the field of Cyber Security.

The UK and Israel are two of the leading countries in the field of Cyber Security so this was a fantastic opportunity for the countries to share ideas and learn from one another. Cyber Tech Tel Aviv, the largest Cyber Security event outside the USA, showcased some of the top technology from global providers such as Microsoft and Dell to more than 100 start-ups.

This was a hugely valuable experience for Spectra as we developed some great connections both in Israel and within the UK delegation. This culminated in a fabulous dinner with the British Ambassador which was a great opportunity to network with some of the more influential members of the community.

We are excited to announce that Spectra has won an InnovateUK grant to develop AI triage for primary care with the University of Manchester. The aim is to reduce GP workload and alleviate the mounting pressures on NHS primary care.

Primary care is the foundation of the NHS, accounting for 90% of all NHS contacts; over 340m consultations per year. Over recent years it has come under significant strain due to increasing demand, a hiring/retention crisis, and budget constraints. The pressure is exacerbated by avoidable GP consultations where patients should more appropriately self-care or consulting other healthcare professionals such as a nurse or pharmacist. This puts GP practices under unnecessary strain, and endangers patients who cannot see a GP when required. This problem is caused by a lack of effective triage processes – deciding where to send patients – which often rely on non-clinical staff such as receptionists. An estimated 27% of consultations are avoidable, with 6% of patients seen by another professional within the practice and 4% seeing pharmacists or using self-care [1].

The implementation of effective triage processes could dramatically reduce the workload on GP practices. In one approach, ‘Telephone First Triage’, GPs triage patients during a callback phone call when an appointment is requested. However, new research found that whilst face-to-face appointments fell 38% on average, telephone consultations increased 12-fold, increasing average GP workload by 8% [2]. In addition to its questionable effectiveness, telephone triage is not scalable as it requires constant support from GPs.

An alternative approach is to use Artificial Intelligence (AI) based triage. AI triage is well-suited to primary care due to the large amount of patient data, which allows the algorithm to identify both common more unusual ailments. In this feasibility study, we intend to evaluate the efficacy and impact of AI triage in primary care.

[1] Primary Care Foundation & NHS Alliance, 2015, Making Time in General Practice
[2] BMJ, 2017, Evaluation of telephone first approach to demand management in English general practice

Spectra is excited to announce that it has recently won an European Commission grant to study how mobility-impaired users utilise the London transport network. We are going to help identify ‘mobility black spots’ by analysing crowd sourced tracking data. Check out the MobiliCity London site to find out more and see our team.

The aim of data analytics is to infer the relationships between variables in a system in order to predict and/or control said system. For example, we may wish to understand the relationship between a stock’s return and its volatility in order to profit from changes in these variables or to reduce risk. For instance if we knew that when volatility went down the stock price would go up we could profit by buying the stock when we first saw signs of a drop in stock volatility.

Unfortunately, it is often not possible to directly infer causation because we are (usually) unable to directly perturb a given system; the only way to unambiguously determine causation. Consequently, we often settle for analysing simple correlations which act as a crude proxy for causation. This can be seen in the chart from Spurious Correlations which shows the total revenue generated by arcades in the US versus the number of computer science doctorates. Whilst there is a 98.5% correlation there is no plausible causal mechanism.

There are alternative approaches to inferring causation such as developing a mechanistic model as is done in epidemiology or building an experiment as is done is behavioural economics. The first of these requires a good understanding of the underlying process whilst the latter requires strict controls to prevent external influences and to make it as realistic as possible. Another approach is to consider causation in a statistical sense as is done with Granger causality.

Granger causality is based upon the premise that the process X strictly Granger causes another process Y if future values of Y can be better predicted using the past values of X rather than only the past values of Y. This notion was originally introduced by Wiener (1956) and later formalized in terms of linear auto-regression by Granger (1969). As stated by Barnett et al. (2009), “identifying Granger causality is not identical to identifying a physically instantiated causal interaction in a system; this can only be unambiguously identified by perturbing the system. Instead, it is a causal relation in a statistical sense.” A major problem with Granger causality is that most real problems, such as the return–volume relation, are nonlinear (Hiemstra and Jones, 1994; Chuang et al., 2009). This led researchers such as Baek and Brock (1992) and Hiemstra and Jones (1994) to develop nonlinear extensions to Granger causality. The Hiemstra and Jones (1994) test is now the most commonly used method among practitioners in finance and economics. Unfortunately, Diks and Panchenko (2005) show that this measure may not actually test Granger causality and identify numerous situations in which the test actually fails. An alternative approach is to use a truly nonlinear and nonparametric method such as information theory.

Information theory was originally developed to examine the properties in signal processing, such as data compression, by Shannon (1948). However, it is now widely used in the physical sciences for problems such as statistical inference due to its ability to analyse nonlinear statistical dependencies and higher order moments of the distribution.

Mutual information (MI) is a popular measure in the field of information theory. MI gives the mutual reduction in uncertainty of one variable given another. For example, one can calculate the reduction in uncertainty of the daily return at time, t, by knowing the daily volume at time, t. If there is no reduction in uncertainty, then the daily returns and volumes are statistically independent. Unfortunately, since this measure is symmetric under the exchange of variables, it is only able to determine if two variables are related. However, if one wishes to imply causation one can simply add a time lag to one variable; this assumes that the causal effect cannot back propagate through time. For example, one can find the reduction in uncertainty in the daily return at time, t+1, given the daily volume at time, t, and vice versa. If there is only a reduction in uncertainty in one direction or one is substantially larger, then one variable must be strongly influencing or causing the changes in the other variable.

One can use an asymmetric measure such as Transfer Entropy (TE) (Schreiber, 2000), an information theoretic measure of time-directed information transfer between jointly dependent processes. Barnett et al. (2009) state that TE is not framed in terms of prediction but in terms of resolution of uncertainty. The TE from Y to X is the degree to which Y disambiguates the future of X beyond the degree to which X already disambiguates its own future. This parallels the notion of Granger causality. In fact, Barnett et al. (2009) show that TE is equivalent to Granger causality for Gaussian distributed variables and Hlaváčková-Schindler (2011) extended this to variables distributed as exponential Weinman’s, log-normal’s and certain parametrizations of Generalized Gaussian’s.

This article is based upon a recent paper I published in the Journal of Applied Economics entitled “An information theoretic analysis of stock returns, volatility and trading volumes”; Ong (2015). In this paper I used information theory to show that the observed negative correlation between a stock’s returns and its volatility (known as the Leverage Effect, Black (1976)) is driven by trading volumes; this is supportive of previous research by Avramov et al (2006). This is important for trading and risk management purposes and supports the idea of a behavioural based explanation for the Leverage Effect.

References

Avramov, D., Chordia, T. and Goyal, A. (2006) The impact of trades on daily volatility, Review of Financial Studies, 19, 1241–77

Baek, E. and Brock, W. (1992) A general test for nonlinear Granger causality: bivariate model, Working Paper, Iowa State University and University of Wisconsin at Madison

Barnett, L., Barrett, A. B. and Seth, A. K. (2009) Granger causality and transfer entropy are equivalent for Gaussian variables, Physical Review Letters, 103, 238701

Black, F. (1976) Studies in stock market volatility changes, in Proceedings of the 1976 Meeting of the Business and Economics Statistics Section, American Statistical Association, Alexandria, VA, pp. 177–81

Chuang, -C.-C., Kuan, C.-M. and Lin, H.-Y. (2009) Causality in quantiles and dynamic stock return-volume relations, Journal of Banking & Finance, 33, 1351–60

Diks, C. and Panchenko, V. (2005) A note on the Hiemstra-Jones test for Granger non-causality, Studies in Nonlinear Dynamics and Econometrics, 9, 1558–3708.

Hiemstra, C. and Jones, J. D. (1994) Testing for linear and nonlinear Granger causality in the stock price-volume relation, The Journal of Finance, 49, 1639–64.

Hlaváčková-Schindler, K. (2011) Equivalence of Granger causality and transfer entropy: a generalization, Applied Mathematical Sciences, 5, 3637–48

Ong, M. (2015) An information theoretic analysis of stock returns, volatility and trading volumes, Applied Economics, 47, 36, 3891-3906

Schreiber, T. (2000) Measuring information transfer, Physical Review Letters, 85, 461–4

Shannon, C. E. (1948) A note on the concept of entropy, Bell System Technical Journal, 27, 379–423

From my perspective there are two main challenges in data science. The first is the Big Data problem and the second is the complex system problem. The Big Data problem refers to how we store and process large quantities of data. At its simplest, this requires developing faster and more efficient methods of storage and computation. This has been the main focus of industry and one in which we have made great strides. For example, we can now sequence the human genome in just a few hours. In this article we will focus on the complex system problem, whose solutions are still in their infancy and provides the greatest challenge to data science. (I say this because whilst more information is generally better it is the quality of information that is important for a data scientist rather than the quantity).

The complex system problem refers to how we analyse data from a system which has non-linear interactions and feedback loops; this characterises most real world systems. It is important to distinguish between a complex system and a complicated system. Just because a system is complicated it does not necessarily mean it is complex. A complicated system may have many elements but if they interact in a linear manner (without feedback loops) they are, in general, very simple to breakdown and analyse.

Many naive approaches to data science ignore the distinction between complicated and complex systems. They simply apply linear techniques, such as regression analysis, to problems that are inherently non-linear. This is extremely dangerous as it can lead to erroneous conclusions. For example, we can accurately measure all of the variables which drive the weather such as a temperature, humidity, rainfall, air pressure and wind speed but we find it very difficult to predict the weather beyond a few days. This is because the weather is a chaotic system which means that just a small change in one part of the system can grow to have a significant impact on the whole system. This is the “butterfly effect” where a butterfly flapping its wings in South America can cause a hurricane in the USA. Another example is the unpredictability of the economy and financial systems. We understand all of the different parts of the system but as it evolves they often interact in unpredictable ways. It’s made even more difficult because we cannot accurately measure the state of the system at any one time.

So what can we do about this? There is in fact an active area of academic study focused on the study of Complex Systems which is known as Complexity Science. The history of Complexity Science is shown in Figure 1. We can see from this that it has been around in formal terms since the mid 60’s, largely evolving from Systems Theory and Cybernetics. It also shows many of the related fields of research and how they have evolved.

The field is dominated by mathematicians and physicists but has lately begun to include economists and social scientists as they search for more sophisticated analytical tools. Some of the common approaches associated with complex systems analysis include:

1) Statistical Mechanics – this is a branch of physics that uses probability theory to study the average behaviour of a mechanical system where the state of the system is unknown. This is important in many real world applications where we are unable to accurately record/observe every part of a system.

2) Information Theory – this was originally developed to examine the properties in signal processing, such as data compression, by Shannon (1948). However, it is now widely used in the physical sciences for problems such as statistical inference due to its ability to analyse non-linear statistical dependencies.

3) Non-linear Dynamics – this analyses how a system changes under varying conditions and over time. This is important as it can help determine the stability of a system and ascertain the critical limits which would lead to instability.

4) Network Theory (aka Graph Theory) – this examines how different objects are connected to one another and allows us to examine how these relationships evolve over time and the affect on information travelling across the network.

5) Agent Based Modelling – this is a computational method used to analyse how autonomous agents interact with one another and the effect of these interactions on the overall system. By using simple interaction rules it is possible to observe very complex behaviour which can provide powerful insights into a system. ABM’s are related to Multi-Agent Systems (MAS).

6) Stochastic Processes (or random processes) – these describe how a collection of random variables evolves over time and is often used for prediction and to quantify potential risks.

It is also important to note that all of these methods must be used in conjunction with sophisticated statistics in order to determine their efficacy and statistical significance.

The above methods generally require some understanding of the system. Another approach is to use machine learning techniques such as Genetic Algorithms, Deep-learning Networks, Support Vector Machines and Gaussian Processes. These provide powerful approaches to uncovering hidden relationships within data; even non-linear relationships. However, one must be very careful when using such a ‘black box’ because they do not provide an understanding of why things happen or if they may change in the future.

Determining which method to use really depends upon the problem that you are analysing. Often it can be useful to apply several of them as they can all provide different information about a complex system. Hence it is important that any good data scientist has at least some familiarity with all of these different approaches and many others; this is not an exhaustive list.

Some readers may say “my problem is very simple so I don’t need complexity science”. Whilst it is true that some non-linear systems can be approximated by linear systems for small perturbations (linearisation) it is important to know what constitutes a small perturbation. When something significant occurs it is very easy for systems to fail and the consequences can often be disastrous e.g. the financial crisis.

There are other challenges to data science such as how to capture data, correct for missing and incorrect data and natural language processing. However, despite these being significant challenges I feel they are minor in comparison.

The data analytics industry is growing fast with large established players vying for market share with exciting new start up companies. With such a range of services available it is very difficult for businesses to know what is the best approach. Should they use a standardised data analytics platform? If so, which one? Or should they build their own “in-house” analytics team or hire a consultant and develop a bespoke system?

These are difficult questions to answer and will generally depend upon the business. In this post I will outline a few potential risks of using standardised data analytics platforms and what, if anything, can be done to avoid them. Standardised platforms are arguably the largest area of growth in this sector: many platforms are now being bundled with cloud computing services such as Amazon and Microsoft. IBM’s ‘Watson’ is becoming the most renowned of these platforms, having defeated Chess Grandmaster Gary Kasparov and winning the quiz show ‘Jeopardy’; it has since successfully turned its attention to medical services and is now analysing legal services.

As an analytics consultant, I am clearly biased towards the benefits of consultancy. However, I don’t believe that any of the following points are particularly controversial.

1) Data Entry

Most standardised systems will struggle to cope with missing or incorrect data. Unless pre-specified many will just ignore missing data and assume all other data is correct. This is an important issue because it can bias your results leading to erroneous conclusions. Dealing with missing data is a common challenge in statistics and there are sophisticated methods available. However, choosing the correct approach depends upon the problem; it cannot be offered in a standardised platform. Beyond a statistical approach one could also look to enrich and validate the data by using external data sources, but this would also require a tailored solution.

2) Statistical Inference

In order to conduct statistical inference it is imperative that you have an understanding of the problem and the data. For example, if your sample data is not representative of the entire population or the target group, then you will have biased results. This can also occur if you only have a small sample set or one that does not include significant events. In this situation one may wish to use Bayesian statistics, which incorporate expert knowledge of the problem. Unfortunately, it is not possible to employ expert knowledge in a standardised approach. Another risk is when using Machine Learning for prediction. These methods are excellent at modelling what has happened but they are often very poor at predicting regime change. These problems can only really be solved on a case by case basis by statisticians and/or data scientists.

3) Interpretation

This leads back to the previous point that it is important to understand the problem, the data and the method of statistical inference. This is important as it restricts what questions you can ask of the data and under what conditions your inferences are valid. It prevents the business from acting on erroneous results. Again, this can only be solved on a case by case basis by statisticians and/or data scientists.

4) Functionality

Data analytics can be applied to a wide range of business functions and if you wish to develop a data-driven organisation it is vital to do this and integrate your approach. However, it is unlikely that standardised platforms will have all the required functionality. To add functionality the best approach will be to select a platform that allows third-party add-ons. Unfortunately, this will require users to pay additional fees and the add-ons may still not be suitable/ideal.

5) Competitor differentiation

It is natural that the more common standardised platforms become the less opportunities there will be for competitor differentiation. Once the benefits from standardised techniques – of which there are many – are exhausted, businesses will have to start tailoring their systems to remain competitive.

The standardised platforms can certainly provide big benefits to businesses in a short period of time, reducing costs, improving efficiency and improving sales. They may also be enhanced by the use of third-party add-ons to tailor a system and improve functionality. However, it is not clear that they provide a cheap alternative to hiring an “in-house” analytics team or employing a data analytics consultant. This is because they are simply tools and as such require qualified data scientists to ameliorate the inherent risks. Consequently, you must build a de facto “in-house” analytics team, which is expensive.

Proponents of standardised platforms would likely argue that the systems are i) a great low cost way for businesses to start to develop data analytics capabilities ii) robust and well tested with great support iii) can be implemented very quickly and iv) will improve over time. I agree with these points. Finally, I’m actually excited to have the opportunity to play around with IBM’s Watson as it is likely to have the most sophisticated natural language processing available. It would be great for people to be able to embed this in their own applications.

Please support the Science is Vital campaign to raise the science budget from less than 0.5% GDP to 0.8% GDP; the G8 average. Science and engineering form a vital part of our economy and provide huge societal benefits. If we are to remain world leaders in these areas it is vital that we develop a long term and stable policy framework. We need to fund our research institutions to a level that will allow them to remain internationally competitive.

Please write to your local politician and parliamentary candidates asking them to support the Science is Vital campaign.

You can find out more details and information on the Science is Vital website: