Learn the various obstacles you will face when analyzing data. Biases, focus points, pattern conflation are all pitfalls you should learn about, among other things.
The mere fact that we already had four articles about data analysis before this should remind us that data analysis is a serious business. The decisions made using it often have far-reaching consequences.
In this article we list down common pitfalls to watch for when doing data analysis. These pitfalls negatively affect our decision-making. As it is now possible to conduct sophisticated analysis without the need to learn the dirty details of the specific algorithms and methods, we need to focus more on applying them correctly.
Let us now start learning how to avoid these pitfalls!
Confirmation bias is one of the most common biases humans have. Everyone has committed it at least once in a while. Fs.blog defines confirmation bias as our tendency to cherry-pick information that confirms our existing beliefs or ideas. While it’s a human trait, confirmation bias clouds our judgment and distorts our perception of reality.
Towards Data Science lists three ways how confirmation bias sneaks to our process of data analysis:
Confirmation bias not only sneaks its way into the process of applying data analysis, but it can also influence the entire process:
How can we keep confirmation bias from tainting our data analysis and decision-making? Here are some tips, which is equally applicable to all things you do for your business:
Taking these into account is a good start in combating confirmation bias in our data analysis.
You have probably read the story of WW2 bombers that is often circulated in social media as an example of thinking outside the box. I include my own version here:
During WW2, the USAF was losing a lot of bombers due to enemy fire. A statistician was asked to look at the data and make recommendations to the USAF on where to add additional armor to keep them from being shot by German anti-aircraft guns. The USAF wants to add additional armor to the sides that sustained heavy gunfire. The statistician thought otherwise; the USAF should add more armor to the engines. The statistician argued that because the engines sustained little to no gunshot marks, the planes that got hit at their engines were actually shot down, and is not included in the initial data.
This is a good example of survivorship bias, which is very prevalent. The Data School presents a few ways survivorship bias manifests itself in data analysis:
The Decision Lab lists two ways of preventing survivorship bias:
Combating survivorship bias is essentially asking questions about what’s missing in the picture. Once you have included them in your analysis, you have managed to minimize or eliminate that bias.
In the process of data analysis, it is easy to get lost with the numbers and data. This is especially worse when you “catch” the so-called metric fixation. Metric fixation is defined by Jerry Z. Muller in The Narrative:
“The key components of metric fixation are the belief that it is possible–and desirable–to replace professional judgment (acquired through personal experience and talent) with numerical indicators of comparative performance based upon standardized data (metrics); and that the best way to motivate people within these organizations is by attaching rewards and penalties to their measured performance.”
As business data analysis is designed to summarize data into a set of metrics for interpretation, we need to make sure that we do not “catch” metric fixation. Stacey Barr lists three beliefs that drive the idea of metric fixation:
When we construct our problem statement, this includes a set of questions that may require you to calculate a set of metrics to help you make decisions. These metrics, however, are not enough in giving you the whole picture. There are still a lot of details these metrics don’t capture, and they are important in balancing the decision-making away from fixating on the metrics to having a holistic approach.
To keep yourself from fixating on the data and metrics, you should look at the bigger picture. Remember: data analysis and metrics are tools to decision making, not substitutes. As we have seen in the example of survivorship bias, a good look at the context leads the decision-makers to the right actions.
One trait of humans is to find patterns everywhere. Often, finding patterns is useful. Consider this example used by Psych Central:
This is a simple example of why pattern-finding is important to humans.
However, we often find patterns where there are none. Seeing dogs and cats in the shapes of the clouds? Check. (I am guilty of this one; it’s fun!) You get lucky because a constellation is up tonight and you are Gemini? Check. You hear Satanic verses when you play a famous song in reverse? Check, check, and check.
And the machines we use today can also find patterns in the sea of data. Machines are now improving their pattern-recognition capabilities. Some believe that AI has already surpassed us in pattern-finding.
However, just like humans, machines can also find patterns that do not exist in reality. It is also fairly common for people to take advantage of this to justify their actions. As we have mentioned in the section on confirmation bias, we can actually make the data yield a favorable result. Similarly, an unrelated set of data can be “dredged” to make a pattern appear at will. This is called data dredging. You can apply several types of analyses and algorithms until a pattern appears. That pattern, however, does not exist in reality.
Data dredging is often committed due to lack of awareness on the side of the analyst. The solution is to specify the methods to use when the problem is defined. For certain topics, there is a prescribed set of best practices and protocol in processing data which makes the choice of analysis method easier.
While previous data analyses conducted to tackle the same problem or a related problem can help outline the analysis method to apply, one should ensure that they did not also suffer from data dredging.
One important type of pattern is correlation. Correlation is a quantity that describes the tendency of two quantities to vary together. For example, we can say that quantity A and B are correlated when both increase or decrease given a certain change in another quantity.
But does having correlation mean that the two quantities or events are linked. That is, one event causes the other? Not necessarily. It is easy to find correlated quantities or events that have zero relation to each other in reality. One example is the picture at the beginning of this section. Does the marriage rate in Kentucky affect the number of people drowning from fishing boats? Nope. But the correlation exists. Another question: is Facebook driving the Greek debt crisis? What do you think?
These are few examples of spurious correlations - two events that are correlated, statistically speaking, but not related in terms of causation. Causation means that one event is caused by another. This is another side effect of humans’ ability to find patterns.
While spurious correlations can be easy to spot by looking at the context of the two correlated quantities, it may be difficult to filter out spurious correlations if they seem to make sense. How can we check for the existence of causation? Fortunately, as Amplitude explains, causal relationships don’t happen by accident. There are two ways of testing for causation, in the context of business data analysis: hypothesis testing and A/B testing. We already talked about hypothesis testing in one of our previous articles. You can check it out here.
Let us talk about A/B testing. A/B testing is a way of testing changes to your e-Commerce site, be it your landing page, your product page, or even your checkout page. According to Hubspot, you need to create two different versions of one piece of content with changes to a single variable. Then, you'll show these two versions to two similarly sized audiences and analyze which one performed better over a specific period of time. A/B testing helps marketers observe how one version of a piece of marketing content performs alongside another.
These two methods can be used together to analyze the relationship between two variables better.
In the process of data preparation, you may notice a set of outlier values. Outliers are data points that stray from the existing patterns in the data. Outliers can affect the results of data analysis if not cleaned out.
Identifying outlets is not difficult; it can be easy. Statistics by Jim lists five ways of identifying outliers. The two easiest ones are as follows:
Scatter plots consist of points representing different data points in an x-y plane. Scatter plots are used to visualize the correlation between two variables. Outliers manifest as a dot or two outside the pattern.
Histograms plot the distribution of values of a certain variable in the form of bars, where each bar contains the values that fall within a certain range. Outliers manifest as a small peak of bars in either end of the range.
Box-and-whiskers plot visualize the range of values of a certain variable as a vertical box bounded by thin vertical lines marked with a short horizontal line on either end. Outliers manifest in the plot as a point or asterisk outside the range.
One sophisticated method for identifying outliers is the z-score. The z-score quantifies how far a value is from the mean of its set. To understand this, you could imagine your dataset having a so-called normal distribution, with most values close to the mean.
To describe the spread of the distribution of your data, you need to calculate the standard deviation. As datasets would vary in how wide the distribution of values is, the standard deviation is used as a “ruler” to measure how far a certain value is from the mean. The z-score is essentially a measurement of how far a value is from the mean in terms of standard deviations. A positive z-score means that the value is higher than the mean while a negative z-score means that the value is lower than the mean. The higher the absolute value of a z-score is, the farther the value is from the mean. Outliers get a high absolute value of z-score.
However, the big question regarding outliers is whether to clean them or not. Why? Depending on the question at hand, you may actually need to zoom in on those outliers to understand the problem and solve it. At this point, you will need the power of diagnostic analysis.
To summarize, we discussed six common pitfalls to watch for:
These are just some of the pitfalls in data analysis to watch for. If they sound like statistics to you, it’s because they are often discussed in statistics.
I hope you learned a lot from our five-article series on data analysis. We talked a lot, but by this time you can now make good calls on analyzing data. To further enhance your data analysis skills, I recommend that you check our upcoming app Lido. As it automates data gathering and analysis, you won’t fall victim to the pitfalls listed here. Instead, you go straight to the metrics and make the right decisions for your business. Get started for free.
Confirmation Bias And the Power of Disconfirming Evidence
Examples and Observations of a Confirmation Bias
Confirmation Bias - Definition & Examples
Confirmation Bias: How It Affects Your Organization | HBS Online
5 Types of Bias in Data & Analytics
Business analytics is ridden with confirmation bias | by Keith McNulty
Previous Data Analytics and the Confirmation Bias
Survivorship Bias: The Tale of Forgotten Failures Reading Time
How Survivorship Bias Affects your Analysis
Missing data can be the best data | by Paul May
Missing data? Survive Survivorship Bias with Qlik
Survivorship bias - Biases & Heuristics
How ‘survivorship bias’ can cause you to make mistakes
What Every Founder Needs to Know About Survivorship Bias
Focusing Too Much on Data Is Bad For Performance
Is Too Much Focus a Problem? - HBS Working Knowledge
Are CEO's Missing out on Big Data's Big Picture?
Fixing Metric Fixation: A Review of The Tyranny of Metrics
Data-dredging bias - Catalog of Bias
No Comments on Data mining or data dredging?
Data Dredging - Causition When Snooping After Data Patterns
Humans Are the World's Best Pattern-Recognition Machines, But for How Long?
Interpreting Correlation Coefficients
Why correlation does not imply causation? | by Seema Singh
Correlation is not causation | Mathematics
Correlation vs Causation | Introduction to Statistics
Correlation vs Causation: Understand the Difference for Your Product
Correlation vs Causation: What's the Difference? | Astute
Spurious correlations: Margarine linked to divorce?
Clearing up confusion between correlation and causation
How to Do A/B Testing: A Checklist You'll Want to Bookmark
5 Ways to Find Outliers in Your Data
What are outliers and how to treat them in Data Analytics? - Aquarela
What are data outliers and how can they eliminate business latency?
Outlier Analysis: Definition, Techniques, How-To, and More
What is Outlier Analysis and How Can It Improve Analysis?