Blog

Monte Carlo vs Historical Backtesting: Which One Should You Trust?

Historical backtesting tests your plan against the past. Monte Carlo tests it against everything that could happen. You need both.

8 min read
Monte Carlo
Backtesting
Retirement Planning

There are two main ways to stress-test a retirement plan: run it through history, or simulate thousands of possible futures. Both have their place. But they answer different questions, and confusing the two can lead to bad decisions.

Historical backtesting takes your plan and runs it through every past period in the dataset. If you have data going back to 1926, you test your plan against a 1926 start, a 1927 start, a 1928 start, and so on. Each test uses the actual sequence of returns that followed that starting year.

Monte Carlo simulation generates thousands of synthetic return sequences by drawing randomly from a probability distribution. Each sequence is a possible future that never happened but could.

The question is not which one is "right." Both are models. The question is which one tells you more about the risks that matter.

What Historical Backtesting Does Well

History is real. Every data point in a backtest actually happened. There is no assumption about what distribution returns follow, no debate about fat tails versus bell curves, no parameter estimation error. The 1973-74 bear market, the 2000 crash, the 2008 crisis: they are in the data exactly as they occurred.

This makes backtesting intuitive and hard to argue with. When someone says "your plan would have survived every 30-year period since 1926," that is a concrete, verifiable claim.

Backtesting also captures complex features that are hard to model: inflation regimes, interest rate cycles, the correlation between stock and bond returns during crises, and structural changes in the economy. All of this is embedded in the historical data without you needing to specify it.

Where Historical Backtesting Fails

Sample size

U.S. market data going back to 1926 gives you about 100 years. For a 30-year retirement horizon, that is roughly 70 overlapping starting points. Seventy scenarios sounds like a lot until you realize that Monte Carlo can generate 50,000.

Worse, those 70 scenarios are not independent. The 1966-1995 period overlaps heavily with the 1967-1996 period. They share 29 out of 30 years. The effective number of independent observations is much smaller than 70.

With so few data points, you cannot estimate tail probabilities with any confidence. If your plan failed in 2 out of 70 periods, is the failure rate 3%? Or is it 5%? Or 8%? You do not have enough samples to tell.

Survivorship and selection bias

Using U.S. market data implicitly assumes that the U.S. experience is representative. But the United States had the best-performing stock market of the 20th century. If you were a Japanese investor who retired in 1989, or a German investor in 1913, or an Argentine investor at almost any point, the "4% rule" would have destroyed you.

Testing against U.S. history is like testing a ship design against the calmest ocean. It tells you the ship floats in good conditions. It does not tell you what happens in a storm.

It cannot generate scenarios worse than history

This is the fundamental limitation. Historical backtesting can only test you against things that already happened. It cannot generate a 1930s-style depression that lasts 15 years instead of 10. It cannot produce a scenario where inflation stays above 8% for two decades. It cannot simulate a sequence of returns worse than anything in the record.

But those scenarios are possible. They are not even implausible. They just have not happened yet in the specific dataset you are using.

The 2008 crisis was roughly a 5-sigma event under normal distribution assumptions. If it was the worst thing in your historical sample, your backtest says your plan survives it. But what about a 6-sigma event? A 7-sigma event? History says nothing about those, because history is a sample of size one.

What Monte Carlo Does Better

Monte Carlo's core advantage is combinatorial freedom. It can generate return sequences that never happened but are statistically plausible. This includes:

Worse sequences than history. By drawing from a probability distribution, Monte Carlo can produce scenarios where the first decade is worse than any historical decade. These scenarios may be unlikely (maybe 2-3% probability), but they are exactly the ones you need to plan for.

More extreme tail events. If you use fat-tailed distributions like the Student's t, the simulation generates crashes larger than anything in the historical record, at frequencies that match empirical observations better than the normal distribution.

Sensitivity analysis. With 50,000 scenarios, you can see how sensitive your plan is to specific variables. What happens if you shift the expected return down by 1%? How much does the spending strategy matter? These questions require large samples to answer reliably.

Precise tail probabilities. With enough iterations, you can estimate the probability of ruin at the 1st, 5th, and 10th percentile with high confidence. Historical backtesting gives you binary outcomes at each starting year - survived or did not.

Where Monte Carlo Falls Short

Monte Carlo is not without weaknesses:

Parameter dependence. The simulation is only as good as the distribution parameters you feed it: mean returns, volatility, correlations, tail shape. Get those wrong and you get precise but misleading results. History at least does not require you to specify these.

No regime structure. Most Monte Carlo models draw returns from a single distribution for the entire time horizon. Real markets go through regimes: high-growth periods, stagflation, secular bear markets. A Monte Carlo sequence of -5%, +22%, -8%, +15%, +30% might be statistically valid but economically unrealistic because it ignores the persistence of market regimes.

False precision. A Monte Carlo success rate of 87.3% looks very precise. But if your mean return assumption is off by 0.5%, the true rate might be 82% or 93%. The precision of the simulation can mask the imprecision of the inputs.

The Best Approach: Use Both

These methods are complementary, not competing.

Use historical backtesting to sanity-check your plan against known scenarios. "Would my plan have survived 1966-1995?" is a useful question. If the answer is no, your plan is too aggressive regardless of what Monte Carlo says.

Use Monte Carlo to understand the range of possibilities beyond history. "What is the probability my plan survives a worse sequence than 1966?" requires simulation. "How much does switching to guardrails improve my odds?" requires thousands of scenarios to answer precisely.

If your plan survives 100% of historical periods but only 82% of Monte Carlo scenarios, that gap tells you something important: history has been kind, and you are relying on that kindness continuing.

Go beyond historical backtesting

Retirement Lab runs up to 50,000 Monte Carlo iterations with fat-tail distributions, so you can stress-test your plan against scenarios history has not shown yet.