Project

Simon

An autonomous trading system that separates prediction from strategy, fails loudly when it doesn't know, and learns from the things it gets wrong.


Simon is an autonomous trading system. He researches a small universe of equities, generates price predictions, builds an executable playbook with entry and exit points, executes it, scores his own performance, and retrains his models based on what he got wrong. Every parameter that matters — position sizes, stop losses, take profits, risk allocation, signal thresholds — is learned from experience, not hardcoded.

He is not an algorithmic trading bot following static rules. He is also not an LLM agent making free-form trading decisions on raw market data. He is a system designed around a specific architectural thesis about which problems are ML problems and which are LLM problems, and the project is, more than anything else, an extended argument for that thesis.

The thesis, in four pillars

The design decisions worth defending are these:

Two engines, one system. Simon separates prediction (what will happen) from strategy (what to do about it). These are distinct problems with distinct shapes. Prediction is a regression problem on numerical data. Strategy is a sequencing and allocation problem under constraints. Most ML trading systems fuse the two — they ask one model to predict and decide in the same breath. Simon does not. The prediction engine is pure ML, no LLM cost, no API calls, just math on data. The trading engine uses ML to identify candidate setups, and then an LLM to sequence them into a capital-efficient playbook.

ML does the heavy lifting, the LLM fills gaps. The ML processes large volumes of numerical data — price history, technical indicators, macro context, fundamentals, sentiment scores — and distills it into structured conclusions. The LLM receives those conclusions, never raw data. The LLM never sees 150 tickers of raw prices. It sees AMD: long setup, enter $116 on June 13, exit $127 on June 17, expected +9.4% — and reasons about sequencing, not about whether AMD is a good buy. This division is intentional. LLMs are bad at numerical reasoning on raw data and good at structured reasoning over symbolic inputs. The system is built to play to that.

No silent fallbacks. If a component fails — a missing sentiment score, a model that won't load, an empty feature set — Simon fails loudly. He does not inject neutral defaults. He does not paper over a missing input with a zero or a median. A prediction that cannot load its model does not produce a prediction. The system knows what it does not know. This is in deliberate opposition to the dominant pattern in ML trading systems, which silently substitute weaker signals when the strong ones disappear and then quietly underperform without anyone noticing why.

Everything is learned. The only hardcoded constraints are physics: you cannot spend money you do not have, you cannot allocate more than 100% to concurrent positions. Everything else — quality thresholds, risk tolerance, position sizing, retraining triggers, regime preferences, per-ticker trust scores — lives in Simon's memory and evolves through experience. The first version of Simon shipped with a lot of "reasonable defaults." Most of them turned out to be wrong, and the ones that turned out to be right would not have stayed right.

The architecture

        ┌──────────────────────────────────────────┐
        │  PREDICTION ENGINE (pure ML)             │
        │  ──────────────────────────────          │
        │  XGBoost per-ticker models               │
        │  Eighty-seven features per prediction    │
        │  Rolling forward on predicted prices     │
        │  No data leakage. No LLM cost.           │
        └──────────────────────────────────────────┘
                          │
                  Prediction curves
                          │
                          ▼
        ┌──────────────────────────────────────────┐
        │  TRADING ENGINE (ML + LLM)               │
        │  ──────────────────────────────          │
        │  Stage 1 — ML:  identifies swings        │
        │                 (peaks, troughs, runs)   │
        │  Stage 2 — LLM: sequences trades,        │
        │                 sizes positions,         │
        │                 respects capital limits  │
        └──────────────────────────────────────────┘
                          │
                  Executable playbook

The prediction engine

Per-ticker XGBoost models with a LightGBM residual correction layer and a brain-learned bias correction. Each prediction takes roughly eighty-seven features spanning seven categories: technical indicators, statistical features, relative performance, macro context, market relative strength, fundamentals, sector breadth, and LLM-scored sentiment.

The sentiment features are worth naming as a specific choice. The LLM does not write prose about a stock. The LLM reads news, press releases, analyst ratings, and SEC filings, and outputs structured numerical scores — sentiment, confidence, short-term and medium-term outlook, catalyst strength, risk level, insider signal, analyst signal. Those scores flow directly into the XGBoost feature matrix alongside everything else. The ML decides how much weight to give sentiment. It is learned, not prescribed.

No data leakage. Predictions for a window use only data available before the window starts. Each day's prediction builds on the previous day's predicted price, not actuals. This means the prediction quality is honest: it represents what Simon would actually know in a live environment.

Rolling predictions, not single shots. Simon predicts five to seven days ahead, executes, then re-predicts from real data. Compounding error resets every segment. In live operation, he re-predicts daily as the previous day's close lands.

The trading engine

Two stages.

The first stage is ML swing analysis. The prediction engine produces curves; the swing-analysis layer walks those curves and identifies the profitable swings within them — peaks for shorts, troughs for longs, the runs between. Each identified swing is described as a setup: ticker, direction, entry date and price, exit date and price, expected return, duration.

The second stage is LLM playbook construction. The LLM receives the ML-identified setups (not raw prices, not raw predictions) along with capital, regime, and market context, and sequences them into a playbook: when to enter, when to exit, how much to allocate, in what order. The capital constraints — no more than 100% allocated to concurrent positions, freed capital redeployed when trades close — are enforced by the system after the LLM responds. The LLM does not have to remember to follow them, because the system will not let it not follow them.

The three-tier testing framework

The most useful intellectual move I have made on this project is the decomposition of performance into three tests on the same window:

Optimal. A theoretical ceiling. Given perfect knowledge of every future price, what is the maximum achievable return? An optimal-calculator walks the actual price history and greedily allocates capital across the best non-overlapping trades. This is not a realistic target — it is what a time traveler with perfect execution could do. It exists to contextualize the other two.

Answer Key. Simon's trading engine with perfect predictions. The prediction curves are replaced with actual prices, so every "prediction" is 100% accurate. This isolates the quality of the trading engine: how well does Simon convert perfect information into returns?

Blind. Simon's complete system, end to end. His own predictions feed his own trading engine. This is the honest measure of operational performance.

The gap between Optimal and Answer Key tells me how much the LLM is leaving on the table — strategy weakness, conservative sizing, missed sequencing.

The gap between Answer Key and Blind tells me how much my prediction error is costing me — direction errors, price drift, swing misidentification.

Together they give me a diagnostic. I do not just know that Simon underperformed; I know whether the failure was a prediction failure or a strategy failure, which means I know which engine to work on next.

Current performance — honest version

Tested on June–August 2025, ten tickers, $15k starting capital:

Test Return Trades Win rate
Optimal +236.4% 177 100%
Answer Key +104.5% 17 100%
Blind (rolling) +22.1% 61 67%

The numbers I want to draw your attention to are not the +22% — they are the gaps.

Trading gap (Optimal → Answer Key): 131.9 points. With perfect predictions, the LLM sequences seventeen trades against the optimal calculator's one hundred seventy-seven. That gap is capital rotation: the LLM is too conservative about freeing capital and redeploying it into the next setup. There are tens of profitable trades it is simply not taking, not because it disagrees with them, but because it does not aggressively enough cycle out of completed positions. This is the largest single source of underperformance, and it is the easiest one to address: it is a strategy problem, not a prediction problem.

Prediction gap (Answer Key → Blind): 82.4 points. This is the prediction engine. Direction accuracy hovers around 60% on average and 70% on best windows, and price error sits at three to eight percent on fourteen-day horizons. This is the harder problem to close. The fix is not a single change to a single model. It is more features, better sentiment, better residual correction, more disciplined training on the windows where the model is most wrong.

I publish these numbers because the alternative — picking the best window and the best test and reporting that one — is what most public ML-trading writing does, and I think that practice is dishonest. The blind number is the honest one. It is what I would be making if I had been live for that window.

What I have learned, in passing

A few things I did not know when I started.

The hardest part of an ML-and-LLM hybrid is not getting either component to work. It is being disciplined about which one does what. The temptation to let the LLM "also predict" or "also weigh in on the sentiment" is constant, and every time I have given in to it the system got worse. The clean separation is fragile and worth defending.

Silent fallbacks are the single most common quiet failure mode of ML systems shipped in production. Every system I have shipped — Simon and otherwise — has been better the day after I removed the last silent fallback. It is an unsexy invariant that pays back consistently.

Honest performance reporting is its own kind of edge. Most systems that look good in public reporting are looking at a cherry-picked window. If you build the honest-reporting infrastructure first, you spend the rest of the project actually improving the system, rather than choosing which numbers to show.

There is more cross-pollination than I expected between this work and the day job. The principles I find myself returning to when designing AI deployments inside legal teams — separate the deterministic from the probabilistic; let the model do what it is good at and constrain it elsewhere; refuse silent fallbacks; report honest numbers, not cherry-picked ones — are the same principles I am defending in Simon. I am not sure how much of that is convergent evolution and how much of it is that I have been doing both at once for too long now to keep them in separate boxes.

What's next

Three things, in order of leverage:

  1. Tighten capital rotation. This is the largest open gap and the most tractable. The LLM is allowed to sit on capital for too long. The fix is some combination of prompt redesign, a deterministic post-validation layer that flags under-rotation, and possibly a small reinforcement signal during retraining.
  2. Close the prediction gap. Slower. The training loop is built; the work is the work — more features, more diverse training windows, better sentiment harmonization. Direction accuracy from 60% to 70% would meaningfully change the blind number.
  3. Public write-ups of specific design decisions. This page is the start. The two-engines split, the no-silent-fallbacks invariant, the structured-LLM-output pattern, and the three-tier testing framework each deserve their own treatment. Other ML systems builders are arriving at adjacent versions of these decisions, and I think the field is better when those decisions get argued about in public rather than rediscovered in private.