Complexity Science – the hard thing about data science

From my perspective there are two main challenges in data science. The first is the Big Data problem and the second is the complex system problem. The Big Data problem refers to how we store and process large quantities of data. At its simplest, this requires developing faster and more efficient methods of storage and computation. This has been the main focus of industry and one in which we have made great strides. For example, we can now sequence the human genome in just a few hours. In this article we will focus on the complex system problem, whose solutions are still in their infancy and provides the greatest challenge to data science. (I say this because whilst more information is generally better it is the quality of information that is important for a data scientist rather than the quantity).

The complex system problem refers to how we analyse data from a system which has non-linear interactions and feedback loops; this characterises most real world systems. It is important to distinguish between a complex system and a complicated system. Just because a system is complicated it does not necessarily mean it is complex. A complicated system may have many elements but if they interact in a linear manner (without feedback loops) they are, in general, very simple to breakdown and analyse.

Many naive approaches to data science ignore the distinction between complicated and complex systems. They simply apply linear techniques, such as regression analysis, to problems that are inherently non-linear. This is extremely dangerous as it can lead to erroneous conclusions. For example, we can accurately measure all of the variables which drive the weather such as a temperature, humidity, rainfall, air pressure and wind speed but we find it very difficult to predict the weather beyond a few days. This is because the weather is a chaotic system which means that just a small change in one part of the system can grow to have a significant impact on the whole system. This is the “butterfly effect” where a butterfly flapping its wings in South America can cause a hurricane in the USA. Another example is the unpredictability of the economy and financial systems. We understand all of the different parts of the system but as it evolves they often interact in unpredictable ways. It’s made even more difficult because we cannot accurately measure the state of the system at any one time.

So what can we do about this? There is in fact an active area of academic study focused on the study of Complex Systems which is known as Complexity Science. The history of Complexity Science is shown in Figure 1. We can see from this that it has been around in formal terms since the mid 60’s, largely evolving from Systems Theory and Cybernetics. It also shows many of the related fields of research and how they have evolved.

640px-complexity_map-svg
History of Complexity Science by Brian Castellani

The field is dominated by mathematicians and physicists but has lately begun to include economists and social scientists as they search for more sophisticated analytical tools. Some of the common approaches associated with complex systems analysis include:

1) Statistical Mechanics – this is a branch of physics that uses probability theory to study the average behaviour of a mechanical system where the state of the system is unknown. This is important in many real world applications where we are unable to accurately record/observe every part of a system.

2) Information Theory – this was originally developed to examine the properties in signal processing, such as data compression, by Shannon (1948). However, it is now widely used in the physical sciences for problems such as statistical inference due to its ability to analyse non-linear statistical dependencies.

3) Non-linear Dynamics – this analyses how a system changes under varying conditions and over time. This is important as it can help determine the stability of a system and ascertain the critical limits which would lead to instability.

4) Network Theory (aka Graph Theory) – this examines how different objects are connected to one another and allows us to examine how these relationships evolve over time and the affect on information travelling across the network.

5) Agent Based Modelling – this is a computational method used to analyse how autonomous agents interact with one another and the effect of these interactions on the overall system. By using simple interaction rules it is possible to observe very complex behaviour which can provide powerful insights into a system. ABM’s are related to Multi-Agent Systems (MAS).

6) Stochastic Processes (or random processes) – these describe how a collection of random variables evolves over time and is often used for prediction and to quantify potential risks.

It is also important to note that all of these methods must be used in conjunction with sophisticated statistics in order to determine their efficacy and statistical significance.

The above methods generally require some understanding of the system. Another approach is to use machine learning techniques such as Genetic Algorithms, Deep-learning Networks, Support Vector Machines and Gaussian Processes. These provide powerful approaches to uncovering hidden relationships within data; even non-linear relationships. However, one must be very careful when using such a ‘black box’ because they do not provide an understanding of why things happen or if they may change in the future.

Determining which method to use really depends upon the problem that you are analysing. Often it can be useful to apply several of them as they can all provide different information about a complex system. Hence it is important that any good data scientist has at least some familiarity with all of these different approaches and many others; this is not an exhaustive list.

Some readers may say “my problem is very simple so I don’t need complexity science”. Whilst it is true that some non-linear systems can be approximated by linear systems for small perturbations (linearisation) it is important to know what constitutes a small perturbation. When something significant occurs it is very easy for systems to fail and the consequences can often be disastrous e.g. the financial crisis.

There are other challenges to data science such as how to capture data, correct for missing and incorrect data and natural language processing. However, despite these being significant challenges I feel they are minor in comparison.

5 Challenges to Standardised Data Analytics Platforms

sky_of_flames_by_ludo38-d51en4qThe data analytics industry is growing fast with large established players vying for market share with exciting new start up companies. With such a range of services available it is very difficult for businesses to know what is the best approach. Should they use a standardised data analytics platform? If so, which one? Or should they build their own “in-house” analytics team or hire a consultant and develop a bespoke system?

 

These are difficult questions to answer and will generally depend upon the business. In this post I will outline a few potential risks of using standardised data analytics platforms and what, if anything,  can be done to avoid them. Standardised platforms are arguably the largest area of growth in this sector: many platforms are now being bundled with cloud computing services such as Amazon and Microsoft. IBM’s ‘Watson’ is becoming the most renowned of these platforms, having defeated Chess Grandmaster Gary Kasparov and winning the quiz show ‘Jeopardy’; it has since successfully turned its attention to medical services and is now analysing legal services.

 

As an analytics consultant, I am clearly biased towards the benefits of consultancy. However, I don’t believe that any of the following points are particularly controversial.

 

 

1) Data Entry

Most standardised systems will struggle to cope with missing or incorrect data. Unless pre-specified many will just ignore missing data and assume all other data is correct. This is an important issue because it can bias your results leading to erroneous conclusions. Dealing with missing data is a common challenge in statistics and there are sophisticated methods available. However, choosing the correct approach depends upon the problem; it cannot be offered in a standardised platform. Beyond a statistical approach one could also look to enrich and validate the data by using external data sources, but this would also require a tailored solution.

2) Statistical Inference

In order to conduct statistical inference it is imperative that you have an understanding of the problem and the data. For example, if your sample data is not representative of the entire population or the target group, then you will have biased results. This can also occur if you only have a small sample set or one that does not include significant events. In this situation one may wish to use Bayesian statistics, which incorporate expert knowledge of the problem. Unfortunately, it is not possible to employ expert knowledge in a standardised approach. Another risk is when using Machine Learning for prediction. These methods are excellent at modelling what has happened but they are often very poor at predicting regime change. These problems can only really be solved on a case by case basis by statisticians and/or data scientists.

3) Interpretation

This leads back to the previous point that it is important to understand the problem, the data and the method of statistical inference. This is important as it restricts what questions you can ask of the data and under what conditions your inferences are valid. It prevents the business from acting on erroneous results. Again, this can only be solved on a case by case basis by statisticians and/or data scientists.

4) Functionality

Data analytics can be applied to a wide range of business functions and if you wish to develop a data-driven organisation it is vital to do this and integrate your approach. However, it is unlikely that standardised platforms will have all the required functionality. To add functionality the best approach will be to select a platform that allows third-party add-ons. Unfortunately, this will require users to pay additional fees and the add-ons may still not be suitable/ideal.

5) Competitor differentiation

It is natural that the more common standardised platforms become the less opportunities there will be for competitor differentiation. Once the benefits from standardised techniques – of which there are many – are exhausted, businesses will have to start tailoring their systems to remain competitive.

The standardised platforms can certainly provide big benefits to businesses in a short period of time, reducing costs, improving efficiency and improving sales. They may also be enhanced by the use of third-party add-ons to tailor a system and improve functionality. However, it is not clear that they provide a cheap alternative to hiring an “in-house” analytics team or employing a data analytics consultant. This is because they are simply tools and as such require qualified data scientists to ameliorate the inherent risks. Consequently, you must build a de facto “in-house” analytics team, which is expensive.

 

Proponents of standardised platforms would likely argue that the systems are i) a great low cost way for businesses to start to develop data analytics capabilities ii) robust and well tested with great support iii) can be implemented very quickly and iv) will improve over time. I agree with these points. Finally, I’m actually excited to have the opportunity to play around with IBM’s Watson as it is likely to have the most sophisticated natural language processing available. It would be great for people to be able to embed this in their own applications.