Article of the week 23 – 2023

How data can be deceiving

Our modern life is characterised by numbers, graphics and data of all kinds. No matter whether in the media, in business, in health care or in sport, data is used for evaluation and as a decision-making aid for questions. Today, the sports watch provides the hobby athlete with user-friendly data such as running speed, pulse, regeneration and performance zones. Until a few years ago, this data was only available to professional athletes, but today it is also intended to support amateurs in planning their training as efficiently and successfully as possible. The same approach has found its way into business much earlier. A graph showing trends and percentage distributions belongs in almost every PowerPoint presentation to enable a data- and fact-based decision. Because data doesn’t lie, does it? Well, it depends. In the following, we will use small examples to show how the preparation of data can be misleading and thus make decision-making more difficult instead of easier.

The visualisation of data is one of the most important tools for analysts to present complex and large data sets in a customer-friendly way. However, some stumbling blocks lurk here, which knowingly or unknowingly literally paint a wrong picture. One of these stumbling blocks can be the technical representation, such as the choice of scaling. A small selected axis section of one day makes the price of a share appear very volatile, whereas a long-term view over weeks, months and years might show a more stable and rather continuous curve. Also frequently encountered in processed data are the different scales on the Y-axis. This is particularly exciting when two data sets are presented in a graph and the false impression of a correlation is created. This method of representation should be viewed even more critically if the correlation is intended to prove causality. Because one thing is clear: correlation does not always prove causality. Of course, nowadays it is very easy to establish a correlation in the mass of data available to us. As humans, we are very good at recognising correlations and similarities around us. We intuitively rely on visual effects, so we quickly spot a correlation and draw conclusions from it. But is it a coincidence or a true correlation or even causality that the number of doctorates awarded in civil engineering in the US between 2000 and 2009 has a 95.86% correlation with the per capita consumption of mozzarella in the same period? There is, despite this impressive correlation of the two data sets, at least so far no proven causal relationship. (for more examples see „Spurious Correlations“ Tyler Vigen) Common sense is the name of the game when analyzing data and preparing it into appropriate visual graphs, but also as a user when reading this data. The probability of coincidence or an unfortunate choice of scaling in Excel that tries to fool us into believing causality is very high. This is also true when a graph comes up with labels for the percentage values it contains. The visual representation of bars triumphs over the indication of numbers in our visually primed perception and can thus convey a false impression.

A supposed solution to the problem of data representation and selection is offered by the sheer amount of data collected and the use of learning algorithms. Big Data and Machine Learning are terms that have been creeping into our daily lives for several years now. Complicated algorithms are fed with amounts of data, trained and draw valuable insights from it. Especially if you are not a data scientist, critically questioning the statements of such algorithms can be more difficult. The quality of the results also depends heavily on us humans when using algorithms and machine learning. The higher the quality of the data we feed, the better the results. For example, there was an attempt to have an algorithm decide whether people had already committed a crime or not based on pictures. The passport photos of offenders and randomized images from the internet were used as sources. This questionable selection of sources ensured that the algorithm ended up assigning the majority of all serious-looking people to the category of possible offenders. The reason for this was the serious facial expression on the passport photos of the offenders, which contrasted with the smiling people from social media and was the decision criterion for the algorithm in the end. No matter how sophisticated the algorithm is, if the wrong data is fed in, the result remains: garbage in, garbage out.

Data offer incredible added value and at the same time a supposed security that does not exist. Even, or perhaps even, especially when the graphics and diagrams shown seem so convincing, it is absolutely necessary to look at them very closely and to question them critically. If we delve into the area of machine learning and the complex learning algorithms that go with it, critical questioning seems much more difficult here. It is not always obvious how the computer makes its decisions. This makes it all the more important to scrutinize the data used to train the algorithm. Particularly if far-reaching decisions depend on the results, this step should be given special attention.

Here is a selection of questions that can be asked by anyone who is confronted with evaluations:

Is the selected data basis decisive for answering the question?
How is the scaling of graphs chosen? Does the Y-axis start at 0 or does it only show a certain section and why?
Does a visually impressive graph merely disguise a deficient data base?
How accurate is the labeling of the graphs and do they correlate with the bars and gradients?
Is an attempt made to prove causality and if so, is it rationally comprehensible?
Are complex systems used for the evaluation – e.g. machine learning: how carefully was training data selected? Can the author explain how the algorithm works and how the computer makes decisions?

Dominikus Leicht – Consultant

junokai