What Is Overfitting?

Overfitting is a modeling error in statistics that occurs when a function is too closely aligned to a limited set of data points. As a result, the model is useful in reference only to its initial data set, and not to any other data sets.

Overfitting the model generally takes the form of making an overly complex model to explain idiosyncrasies in the data under study. In reality, the data often studied has some degree of error or random noise within it. Thus, attempting to make the model conform too closely to slightly inaccurate data can infect the model with substantial errors and reduce its predictive power.

Key Takeaways

  • Overfitting is an error that occurs in data modeling as a result of a particular function aligning too closely to a minimal set of data points.
  • Financial professionals are at risk of overfitting a model based on limited data and ending up with results that are flawed.
  • When a model has been compromised by overfitting, the model may lose its value as a predictive tool for investing.
  • A data model can also be underfitted, meaning it is too simple, with too few data points to be effective.
  • Overfitting is a more frequent problem than underfitting and typically occurs as a result of trying to avoid overfitting.

Understanding Overfitting

For instance, a common problem is using computer algorithms to search extensive databases of historical market data in order to find patterns. Given enough study, it is often possible to develop elaborate theorems that appear to predict returns in the stock market with close accuracy.

However, when applied to data outside of the sample, such theorems may likely prove to be merely the overfitting of a model to what were in reality just chance occurrences. In all cases, it is important to test a model against data that is outside of the sample used to develop it.

How to Prevent Overfitting

Ways to prevent overfitting include cross-validation, in which the data being used for training the model is chopped into folds or partitions and the model is run for each fold. Then, the overall error estimate is averaged. Other methods include ensembling: predictions are combined from at least two separate models, data augmentation, in which the available data set is made to look diverse, and data simplification, in which the model is streamlined to avoid overfitting.

Financial professionals must always be aware of the dangers of overfitting or underfitting a model based on limited data. The ideal model should be balanced.

Overfitting in Machine Learning

Overfitting is also a factor in machine learning. It might emerge when a machine has been taught to scan for specific data one way, but when the same process is applied to a new set of data, the results are incorrect. This is because of errors in the model that was built, as it likely shows low bias and high variance. The model may have had redundant or overlapping features, resulting in it becoming needlessly complicated and therefore ineffective.

Overfitting vs. Underfitting

A model that is overfitted may be too complicated, making it ineffective. But a model can also be underfitted, meaning it is too simple, with too few features and too little data to build an effective model. An overfit model has low bias and high variance, while an underfit model is the opposite—it has high bias and low variance. Adding more features to a too-simple model can help limit bias.

Overfitting Example

For example, a university that is seeing a college dropout rate that is higher than what it would like decides it wants to create a model to predict the likelihood that an applicant will make it all the way through to graduation.

To do this, the university trains a model from a dataset of 5,000 applicants and their outcomes. It then runs the model on the original dataset—the group of 5,000 applicants—and the model predicts the outcome with 98% accuracy. But to test its accuracy, they also run the model on a second dataset—5,000 more applicants. However, this time, the model is only 50% accurate, as the model was too closely fit to a narrow data subset, in this case, the first 5,000 applications.