Increasing sample size benefits a research study by increasing the confidence and reliability of the confidence interval, and as a result, the precision with which the population parameter can be estimated. Other choices affect how wide or how narrow a confidence interval will be: choice of statistic, with t being wider/more conservative than z, as well as degree of confidence, with lesser degrees such as 90% resulting in wider/more conservative intervals than 99%. An increase in sample size tends to have an even more meaningful effect, due to the formula for standard error (i.e. the ratio of 'sample standard deviation / sample size1/2'), resulting in the fact that standard error varies inversely with sample size. As a result, more observations in the sample (all other factors equal) improve the quality of a research study.
At the same time, two other factors tend to make larger sample sizes less desirable. The first consideration, which primarily affects time-series data, is that population parameters have a tendency to change over time. For example, if we are studying a mutual fund and using five years of quarterly returns in our analysis (i.e. sample size of 20, 5 years x 4 quarters a year). The resulting confidence interval appears too wide so in an effort to increase precision, we use 20 years of data (80 observations). However, when we reach back into the 1980s to study this fund, it had a different fund manager, plus it was buying more small-cap value companies, whereas today it is a blend of growth and value, with mid to large market caps. In addition, the factors affecting today's stock market (and mutual fund returns) are much different compared to back in the 1980s. In short, the population parameters have changed over time, and data from 20 years ago shouldn't be mixed with data from the most recent five years.
The other consideration is that increasing sample size can involve additional expenses. Take the example of researching hiring plans at S&P 500 firms (cross-sectional research). A sample size of 25 was suggested, which would involve contacting the human resources department of 25 firms. By increasing the sample size to 100, or 200 or higher, we do achieve stronger precision in making our conclusions, but at what cost? In many cross-sectional studies, particularly in the real world, where each sample takes time and costs money, it's sufficient to leave sample size at a certain lower level, as the additional precision isn't worth the additional cost.
Data Mining Bias
Data mining is the practice of searching through historical data in an effort to find significant patterns, with which researchers can build a model and make conclusions on how this population will behave in the future. For example, the so-called January effect, where stock market returns tend to be stronger in the month of January, is a product of data mining: monthly returns on indexes going back 50 to 70 years were sorted and compared against one another, and the patterns for the month of January were noted. Another well-known conclusion from data mining is the 'Dogs of the Dow' strategy: each January, among the 30 companies in the Dow industrials, buy the 10 with the highest dividend yields. Such a strategy outperforms the market over the long run.
Bookshelves are filled with hundreds of such models that "guarantee" a winning investment strategy. Of course, to borrow a common industry phrase, "past performance does not guarantee future results". Data-mining bias refers to the errors that result from relying too heavily on data-mining practices. In other words, while some patterns discovered in data mining are potentially useful, many others might just be coincidental and are not likely to be repeated in the future - particularly in an "efficient" market. For example, we may not be able to continue to profit from the January effect going forward, given that this phenomenon is so widely recognized. As a result, stocks are bid for higher in November and December by market participants anticipating the January effect, so that by the start of January, the effect is priced into stocks and one can no longer take advantage of the model. Intergenerational data mining refers to the continued use of information already put forth in prior financial research as a guide for testing the same patterns and overstating the same conclusions.
Distinguishing between valid models and valid conclusions, and those ideas that are purely coincidental and the product of data mining, presents a significant challenge as data mining is often not easy to discover. A good start to investigate for its presence is to conduct an out-of-sample test - in other words, researching whether the model actually works for periods that do not overlap the time frame of the study. A valid model should continue to be statistically significant even when out-of-model tests are conducted. For research that is the product of data mining, a test outside of the model's time frame can often reveal its true nature. Other warning signs involve the number of patterns or variables examined in the research - that is, did this study simply search enough variables until something (anything) was finally discovered? Most academic research won't disclose the number of variables or patterns tested in the study, but oftentimes there are verbal hints that can reveal the presence of excessive data mining.
Above all, it helps when there is an economic rationale to explain why a pattern exists, as opposed to simply pointing out that a pattern is there. For example, years ago a research study discovered that the market tended to have positive returns in years that the NFC wins the Super Bowl, yet it would perform relatively poorly when the AFC representative triumphs. However, there's no economic rationale for explaining why this pattern exists - do people spend more, or companies build more, or investors invest more, based on the winner of a football game? Yet the story is out there every Super Bowl week. Patterns discovered as a result of data mining may make for interesting reading, but in the process of making decisions, care must be taken to ensure that mined patterns not be blindly overused.
Sample Selection Bias
Many additional biases can adversely affect the quality and the usefulness of financial research. Sample-selection bias refers to the tendency to exclude a certain part of a population simply because the data is not available. As a result, we cannot state that the sample we've drawn is completely random - it is random only within the subset on which historic data could be obtained.
A common form of sample-selection bias in financial databases is survivorship bias, or the tendency for financial and accounting databases to exclude information on companies, mutual funds, etc. that are no longer in existence. As a result, certain conclusions can be made that may in fact be overstated were one to remove this bias and include all members of the population. For example, many studies have pointed out the tendency of companies with low price-to-book-value ratios to outperform those firms with higher P/BVs. However, these studies most likely aren't going to include those firms that have failed; thus data is not available and there is sample-selection bias. In the case of low and high P/BV, it stands to reason that companies in the midst of declining and failing will probably be relatively low on the P/BV scale yet, based on the research, we would be guided to buy these very same firms due to the historical pattern. It's likely that the gap between returns on low-priced (value) stocks and high-priced (growth) stocks has been systematically overestimated as a result of survivorship bias. Indeed, the investment industry has developed a number of growth and value indexes. However, in terms of defining for certain which strategy (growth or value) is superior, the actual evidence is mixed.
Sample selection bias extends to newer asset classes such as hedge funds, a heterogeneous group that is somewhat more removed from regulation, and where public disclosure of performance is much more discretionary compared to that of mutual funds or registered advisors of separately managed accounts. One suspects that hedge funds will disclose only the data that makes the fund look good (self-selection bias), compared to a more developed industry of mutual funds where the underperformers are still bound by certain disclosure requirements.
Research is guilty of look-ahead bias if is makes use of information that was not actually available on a particular day, yet the researchers assume it was. Let's returning to the example of buying low price-to-book-value companies; the research may assume that we buy our low P/BV portfolio on Jan 1 of a given year, and then (compared to a high P/BV portfolio) hold it throughout the year. Unfortunately, while a firm's current stock price is immediately available, the book value of the firm is generally not available until months after the start of the year, when the firm files its official 10-K. To overcome this bias, one could construct P/BV ratios using current price divided by the previous year's book value, or (as is done by Russell's indexes) wait until midyear to rebalance after data is reported.
This type of bias refers to an investment study that may appear to work over a specific time frame but may not last in future time periods. For example, any research done in 1999 or 2000 that covered a trailing five-year period may have touted the outperformance of high-risk growth strategies, while pointing to the mediocre results of more conservative approaches. When these same studies are conducted today for a trailing 10-year period, the conclusions might be quite different. Certain anomalies can persist for a period of several quarters or even years, but research should ideally be tested in a number of different business cycles and market environments in order to ensure that the conclusions aren't specific to one unique period or environment.
Calculating Confidence Intervals
MarketsSystematic sampling is similar to random sampling, but it uses a pattern for the selection of the sample.
MarketsSampling is a term used in statistics that describes methods of selecting a pre-defined representative number of data from a larger data population.
InvestingStandard error is a statistical term that measures the accuracy with which a sample represents a population.
MarketsIn statistics, a representative sample accurately represents the make-up of various subgroups in an entire data pool.
MarketsCentral limit theorem is a fundamental concept in probability theory.
MarketsA simple random sample is a subset of a statistical population in which each member of the subset has an equal probability of being chosen.
MarketsStratified random sampling is a technique best used with a sample population easily broken into distinct subgroups. Samples are then taken from each subgroup based on the ratio of the subgroup’s ...
ETFs & Mutual FundsSurvivorship bias erases substandard performers, distorting overall mutual fund returns.
Managing WealthWe all have biases. The key to better investing is to identify those biases and create rules to minimize their effect.
MarketsDo the owners of the large stock indices (McGraw Hill Financial, CME Group, and News Corp) have incentive to pick stocks to put in the index that are "shiny" as a marketing ploy? And if so, wouldn't ...