2018-08-22
By M. Mukhamadieva
"Essentially, all models are wrong, but some are useful" is a remarkably insightful statement from George Box (Wikipedia article). Data analytics has been developing with an exponential pace in the recent years, and there are several reasons for that. A large part of it may be attributed to the technological advances, which allow for sophisticated models performing calculations on large amounts of data. Statistical analysis is now implemented in every industry or field of science. While data analytics can provide a necessary proof of a fact or, even more precisely, a dependency between various matters, sometimes people seem to blindly believe in whatever the models tell them, as long as the mystical 't-statistics' is greater than 2!
While the real big data requires more sophisticated machine learning technics, most of the daily tasks and analyses can be done using the well-established traditional models. But even the simple models can be seen as a black box. And they should be treated accordingly. So what could potentially go wrong or spoil the results of data analysis? Mistakes can arise at every stage of the process from data collection or during modelling to result interpretation. Here are some common groups of problems:
1) Data Mining
2) Correlation vs Causation
3) Endogeneity
4) Data quality
The first point of ‘Data Mining’, which this blog covers in details, relates to how we deal with the data and what we choose to look at. Prudence is necessary in identifying and dealing with various flaws in the data, from fat-finger errors to unavailability of data. And one should be careful while interpreting the results instead of just relying upon the statistical significance, or in other words, the ever so friendly and forever smiling 't-statistic'.
Data mining per se is not a bad thing, when applied appropriately. However, the desire to perfectly explain the nature of a phenomenon may force some analysts to use data snooping instead, which can lead to the usage of excessive number of variables or unnecessarily sophisticated models. That, in fact, could suit your sample data. But is this the goal? Well, generally not! The real aim is to uncover the patterns in order to be able to apply them on the other datasets. The ones that will arise in the future.
No model is perfect, and one has to keep in mind that a model simply cannot be expected to be accurate in 100% of the cases. The question is, where is the trade-off between the complexity and the acceptable error-margin? Statisticians, like many other fields, often follow the KISS principle – Keep It Sensibly Simple. Generally speaking, the more sophisticated the model and the more the model fits the sample data, the more difficult it is for the model to predict out-of-sample (i.e. future) exceptions. Indeed, such over-fitted models are often more sensitive to mistakes and inconsistencies in the sample data, such as outliers or lack of observations. These hidden errors or deviations have the capacity to taint, if not spoil, the outcomes of the analysis.
There is a (potentially incorrect) belief that just throwing more and more variables in the analysis improves the quality of the model. This is probably as far from the reality as you could get. For example, the age of the car, mileage and repair records could be valid predictors for the resale price, but is it really important to know the shoe size of the driver? In fact, if you do include the shoe size and play around with the parameters of the model long enough, you might just find the perfect shoe size to increase the resale price of your car. Well then, it’s time to go shopping and/or getting a new foot! |