From my perspective there are two main challenges in data science. The first is the Big Data problem and the second is the complex system problem. The Big Data problem refers to how we store and process large quantities of data. At its simplest, this requires developing faster and more efficient methods of storage and computation. This has been the main focus of industry and one in which we have made great strides. For example, we can now sequence the human genome in just a few hours. In this article we will focus on the complex system problem, whose solutions are still in their infancy and provides the greatest challenge to data science. (I say this because whilst more information is generally better it is the quality of information that is important for a data scientist rather than the quantity).

The complex system problem refers to how we analyse data from a system which has non-linear interactions and feedback loops; this characterises most real world systems. It is important to distinguish between a complex system and a complicated system. Just because a system is complicated it does not necessarily mean it is complex. A complicated system may have many elements but if they interact in a linear manner (without feedback loops) they are, in general, very simple to breakdown and analyse.

Many naive approaches to data science ignore the distinction between complicated and complex systems. They simply apply linear techniques, such as regression analysis, to problems that are inherently non-linear. This is extremely dangerous as it can lead to erroneous conclusions. For example, we can accurately measure all of the variables which drive the weather such as a temperature, humidity, rainfall, air pressure and wind speed but we find it very difficult to predict the weather beyond a few days. This is because the weather is a chaotic system which means that just a small change in one part of the system can grow to have a significant impact on the whole system. This is the “butterfly effect” where a butterfly flapping its wings in South America can cause a hurricane in the USA. Another example is the unpredictability of the economy and financial systems. We understand all of the different parts of the system but as it evolves they often interact in unpredictable ways. It’s made even more difficult because we cannot accurately measure the state of the system at any one time.

So what can we do about this? There is in fact an active area of academic study focused on the study of Complex Systems which is known as Complexity Science. The history of Complexity Science is shown in Figure 1. We can see from this that it has been around in formal terms since the mid 60’s, largely evolving from Systems Theory and Cybernetics. It also shows many of the related fields of research and how they have evolved.

The field is dominated by mathematicians and physicists but has lately begun to include economists and social scientists as they search for more sophisticated analytical tools. Some of the common approaches associated with complex systems analysis include:

1)** Statistical Mechanics** – this is a branch of physics that uses probability theory to study the average behaviour of a mechanical system where the state of the system is unknown. This is important in many real world applications where we are unable to accurately record/observe every part of a system.

2) **Information Theory** – this was originally developed to examine the properties in signal processing, such as data compression, by Shannon (1948). However, it is now widely used in the physical sciences for problems such as statistical inference due to its ability to analyse non-linear statistical dependencies.

3) **Non-linear Dynamics** – this analyses how a system changes under varying conditions and over time. This is important as it can help determine the stability of a system and ascertain the critical limits which would lead to instability.

4) **Network Theory** (aka Graph Theory) – this examines how different objects are connected to one another and allows us to examine how these relationships evolve over time and the affect on information travelling across the network.

5) **Agent Based Modelling** – this is a computational method used to analyse how autonomous agents interact with one another and the effect of these interactions on the overall system. By using simple interaction rules it is possible to observe very complex behaviour which can provide powerful insights into a system. ABM’s are related to Multi-Agent Systems (MAS).

6) **Stochastic Processes** (or random processes) – these describe how a collection of random variables evolves over time and is often used for prediction and to quantify potential risks.

It is also important to note that all of these methods must be used in conjunction with sophisticated statistics in order to determine their efficacy and statistical significance.

The above methods generally require some understanding of the system. Another approach is to use machine learning techniques such as Genetic Algorithms, Deep-learning Networks, Support Vector Machines and Gaussian Processes. These provide powerful approaches to uncovering hidden relationships within data; even non-linear relationships. However, one must be very careful when using such a ‘black box’ because they do not provide an understanding of why things happen or if they may change in the future.

Determining which method to use really depends upon the problem that you are analysing. Often it can be useful to apply several of them as they can all provide different information about a complex system. Hence it is important that any good data scientist has at least some familiarity with all of these different approaches and many others; this is not an exhaustive list.

Some readers may say “my problem is very simple so I don’t need complexity science”. Whilst it is true that some non-linear systems can be approximated by linear systems for small perturbations (linearisation) it is important to know what constitutes a small perturbation. When something significant occurs it is very easy for systems to fail and the consequences can often be disastrous e.g. the financial crisis.

There are other challenges to data science such as how to capture data, correct for missing and incorrect data and natural language processing. However, despite these being significant challenges I feel they are minor in comparison.

// … one must be very careful when using such a ‘black box’ because they do not provide an understanding of why things happen or if they may change in the future. //

The most understated truth in Data Science. Indeed designing ones own algorithm based on the raw building blocks of uncertain reasoning, that is probability theory, logic and information theory, will not only give one a better understanding of the domain but in general results in more accurate models.