Capacilon | Pitfalls of Data Analysis (Part 2)

Pitfalls of Data Analysis (Part 2)

2018-10-29
By M. Mukhamadieva

In the previous article, I covered the first group of mistakes that were collectively called â€œdata miningâ€ (click here for the previous blog). Today I am going to talk about three other common problems that arise during data analytics process: correlation vs. causation, endogeneity, and data quality.

Probably the biggest mistake that researchers make or just oversee when interpreting model results is implying causation from correlation. However, these two concepts are quite different. In most of the cases, summary statistics and regressions only pin out that two variables are related (correlated). They move together, either in the same or opposite direction. Causation is a stronger concept which implies that changes in one variable causes the other variable to change as well. In some cases, logical thinking can be enough to prove causation. For example, reducing the amount of calories consumed and increasing the jogging intensity are quite plausible causes of weight-loss.

However, often, the relation is not so obvious. The models may show us that certain events are related, but it is still too early to jump to any conclusions. The good thing is, there are statistical tests that can show you which direction the relationship goes. But one should not forget to ask the question â€œDoes my model make sense at all?â€. An extreme example (really found in data) is the following. Divorce rate in Maine correlates with per capita consumption of margarine, with correlation being more than 99%! (click here for more examples on TylerVigen.com).

Next, in order to trust oneâ€™s results one should investigate the potential sources of endogeneity. Endogeneity arises when some variables that were assumed to be exogenous (given from the outside) are, on the contrary, correlated with the other terms off the model. It leads to model misspecification due to several reasons. So before going deeper in your analysis, you should first consider asking yourself the following questions.

1) Omitted Variable Bias
Are all the relevant variables included? Is there some factor that can influence both the dependent and independent variable? This is called the confounding, or omitted variable bias. For example, we observe that both ice cream sales and shark attacks have highest levels in summer months. What is lacking in this specification is the seasonality effect, or the temperature, since people tend to go swimming and consume ice cream during hot months. Therefore, itâ€™s not the ice cream consumption that causes shark attacks, nor the other way round.

2) Systematic Errors
How accurate are my variables? Could it be that they are measured with some error? Unfortunately, the data is not always provided in a clean way. Sometimes it is measured with systematic error. With such errors, we cannot get the correct estimate.

3) Directional Effect
Are you sure that one of the variables HAS to cause the other? Can it be that they both affect each other? For example, does the crime rate decrease if there are more police officers in the district? Or are more police officers assigned to the district because of the high crime rate?

Ideally, when all the three alternative explanations are ruled out, one can make conclusions that there is indeed causal relation between the two effects. However, even if this is not the case, the so-called â€œinstrumental variable approachâ€, which is probably deserves a blog on its own, is implemented.

Last but not the least, there are some factors that should be considered.
- Very often the data that should be useful is not available;
- Mistakes and errors in the data â€“ the whole data cleansing field;
- Data protection and sensibility â€“ is it lawful / ethical to use it?

In spite of careful analysis and taking the entire process step by step, we must be prepared for the most striking of all outcomes â€“ some questions are simply never meant to be answered!