TL;DR

I run an LLM-driven trading hypothesis engine. For a while, every result that came back looked too good — Sharpe ratios above 5, win rates above 70%, all on out-of-sample windows. They were lies. The model was reading dates, headlines, and tickers in the prompt and pattern-matching against its training data, which extends well past my “out-of-sample” cutoff. The fix was a masking layer I now call Blind Oracle: strip every leak before evaluation, run the trigger before the eval, gate promotion on out-of-sample Sharpe with the masking enforced. After it shipped, the inflated numbers collapsed back to honest reality. Some hypotheses survived; most didn’t. That’s exactly what I needed to know.

The problem

The engine takes a hypothesis (“when X happens, expect Y”), runs it against a market data window, and asks an LLM to score whether it would have triggered and what the P&L would have been. Two windows: in-sample (training-era data, used for tuning) and out-of-sample (post-2025-01-01, supposed to be untouched by tuning).

The OOS numbers were beautiful. Too beautiful. A coin-flip strategy I added as a sanity-check baseline came back with Sharpe 4.8.

That is not how coin flips work.

What was leaking

I stared at the prompt template for an hour before I saw it. The eval prompt looked like this:

You are a trading evaluator. Below is a hypothesis and a market event window.
Hypothesis: <hypothesis text>
Window: 2025-03-14 to 2025-03-21
Headlines that occurred during the window:
  - 2025-03-15: Fed signals pause on rate hikes amid banking concerns
  - 2025-03-17: NVDA reports strong guidance, futures up 3%
  - ...
Decide: did the hypothesis trigger? What was the P&L?

The model has every one of those headlines in its training data. It knows what happened next. It is not “evaluating” — it is recalling. The whole exercise is an open-book test where the answer key is the prompt.

I had built a cheating machine and convinced myself it was an oracle.

The Blind Oracle patch

The fix is conceptually simple and operationally fiddly:

  1. Strip dates from the prompt. The window becomes “Day 1 through Day 7” — relative time only.
  2. Mask tickers. NVDA becomes “Stock A”, AAPL becomes “Stock B”, etc. The mapping is consistent within an evaluation but the model never sees the real symbol.
  3. Mask headlines to events without proper nouns. “Fed signals pause on rate hikes amid banking concerns” becomes “Central bank signals dovish shift amid sector stress.” Enough information to evaluate the mechanism of the hypothesis, not enough to recall the outcome.
  4. Run the trigger BEFORE the eval. Whether the hypothesis fired is a deterministic computation against price data — the LLM should never decide that. It only decides what happens next given that it fired.
  5. Bifurcate IS/OOS strictly. Out-of-sample starts 2025-01-01 and never moves. No “let me re-tune on slightly more data.”

The architecture became:

hypothesis ─┐
            ├─► trigger evaluator (deterministic, no LLM)
window ─────┘                │
                    triggered = True/False
              ┌─ if triggered ─┐
              │                │
              ▼                ▼
         masking layer    skip (no eval needed)
        ───────────────
        - dates → relative
        - tickers → "Stock A/B/..."
        - headlines → de-nounified
        LLM evaluator
        (sees masked window, returns P&L)
        IS metric        OOS metric
        (training data)  (post-2025-01-01)
              │                │
              └────────┬───────┘
              promotion gate:
              pass_validated AND oos_sharpe > 1.0

The masking layer in practice

Ticker masking is a stateful per-evaluation map:

def mask_tickers(text: str, mentioned: list[str]) -> tuple[str, dict]:
    mapping = {t: f"Stock {chr(65 + i)}" for i, t in enumerate(sorted(mentioned))}
    masked = text
    for real, fake in mapping.items():
        masked = re.sub(rf"\b{real}\b", fake, masked)
    return masked, mapping

Headline de-nounification is harder. I run each headline through a small local model (Gemma 4 26B, fast, no rate limits) with a prompt that asks it to rewrite the headline keeping the causal mechanism but removing identifying detail. Then I cache it by headline hash so I’m not rewriting the same headline twice.

Date stripping is trivial:

def strip_dates(text: str, window_start: date) -> str:
    text = re.sub(r"\b\d{4}-\d{2}-\d{2}\b",
                  lambda m: f"Day {(date.fromisoformat(m.group()) - window_start).days + 1}",
                  text)
    return re.sub(r"\b(January|February|...) \d{1,2},?\s*\d{4}\b", "Day ?", text)
}

(The second regex is a fallback for headlines that include long-form dates.)

What the numbers looked like before vs. after

The coin-flip baseline went from Sharpe 4.8 to Sharpe 0.04. That alone validated the patch — the system finally couldn’t tell the future.

Of the 13 alpha hypotheses I had queued, after the masking patch:

  • 2 survived with OOS Sharpe > 1.0. Both were mechanism-driven (vol-regime triggers), not headline-driven. Made sense in retrospect.
  • 3 collapsed to negative OOS Sharpe — they were hidden lookahead exploiters.
  • 8 came back as roughly zero — neither alpha nor anti-alpha, just noise. Honest noise.

Two real hypotheses out of thirteen is a much worse hit rate than I had been believing. It is also the actual hit rate, which is the only one that matters.

Lessons

  • Every historical alpha number from before the patch is suspect. I marked the older results as “INVALIDATED — pre-Blind Oracle” in the database. Don’t reference them. Don’t fight to preserve them. They were lies you told yourself.
  • LLMs are training-data oracles, not reasoning engines. If the prompt contains a fact the model has memorized, it will use that fact. The only defense is to remove the fact.
  • Determinism beats inference for things that are deterministic. The trigger is a price-data computation. Asking the LLM to evaluate whether the trigger fired was always wrong, even before the leak fix — it was just wrong in a way that didn’t show up in the output until I masked everything else.
  • Your OOS cutoff is sacred. The moment you say “let me just include Q1 2025 in training because it makes the curve look smoother,” you’ve destroyed your ability to know if anything works.

What’s next

Walk-forward optimization is the next step. Static OOS is a good first defense; rolling OOS is better. The hypothesis engine is also moving toward correlated-hypothesis penalties — if 7 of my “different” hypotheses are all variations of the same volatility signal, that’s N=1 wearing seven hats, not N=7. Triangulation requires actually-different priors.