The Shape of the Room.
How 100 AI agents debated Peloton in 2021 and converged on what later happened. The same setup re-run on Lululemon stayed split, which is the part that matters.
If you have thirty seconds.
We sat 100 imaginary people in a room and asked them to evaluate Peloton at a specific moment in time. Different roles, different motivations, same evidence. We let them debate. The room converged on "this company is in trouble." Then we ran the same setup on Lululemon, and that room stayed split. The interesting result is not the verdict. It is that the same machinery produced two different shapes.
Build a room, run a debate, compare two shapes.
No technical knowledge required. The same picture works for a 12-year-old and a portfolio manager.
If the room was just trained to be pessimistic, both companies should look the same. They don't. The same instrument produces two different shapes. That asymmetry is the actual experiment, not the headline number.
What does it look like when 100 people argue the same thesis at once?
Most diligence ends in a static artifact: a memo, a deck, a model, a call summary. Those artifacts can be rigorous. But they all do the same thing in the end. They compress disagreement into a single authored view.
This experiment points at a different primitive. Instead of asking one analyst to summarize a narrative, you build a small synthetic market of perspectives and let the narrative take damage in public. The question stops being "what is the right answer" and becomes something sharper: does the business story survive contact with motivated disagreement?
Before any of that, though, here is what the room actually looks like. One hundred dots. Each one is an agent with a role, an incentive, and a piece of the evidence pack. Right now, none of them have argued with each other. They are about to.
The room, before it argues
100 agents, eight stakeholder archetypes, one shared evidence pack from before June 30, 2021. Each dot drifts through possibility-space; nothing is decided yet.
It looks unremarkable. That is the point. The interesting part is not the starting state. It is how the dots move once five rounds of structured argument get applied to them, and whether the same machinery does the same thing to a different company.
So a control was added. Lululemon. Adjacent consumer exposure, similar demographic, similar premium-brand vulnerability, run through the exact same eight-archetype panel with the exact same five-round structure. If the system simply amplified generic skepticism, both companies should collapse. If only one collapses, the collapse is information.
That is the question this essay walks through. Not "did AI predict Peloton." But: what does fragility look like, in real time, when you can finally watch it happen.
Eight archetypes, one evidence pack, five rounds of pressure.
The composition of the panel matters. The point is not to get a representative sample of investors. The point is to put different incentives into the same room and force them to argue in front of each other. A short seller has career risk on one side; a sell-side analyst has career risk on both. A pandemic convert is testing whether their own behavior survives reopened life. A gym operator watches Peloton as a threat and is paid, in some sense, to find weak seams.
Click any archetype to see who is in the room and what they brought with them.
Building the room
Same composition for both Peloton and Lululemon. The separation between the two cannot be explained by changing who is in the panel. It comes only from what the evidence pack does to them once they engage.
Dedicated subscriber
Click any archetype above to switch the highlighted dots. Each role carries a different incentive into the debate.
The mechanics are deliberately simple. Each agent receives the same evidence pack: financials, transcripts, analyst notes, product reviews, news coverage, all dated June 30, 2021 or earlier. Each agent reads through their archetype's lens. Then five rounds happen, in order:
- R1Independent orientation. Each agent forms a position from the evidence pack alone. No room dynamics yet. This is the cleanest read of what the documents actually say.
- R2Forced engagement with disagreement. Agents see an anonymized cross-section of opposing arguments. Confidence is required to update.
- R3Stress-testing. Strongest counter-arguments from the prior round are explicitly directed back at each position. Bull cases have to defend; bear cases have to defend.
- R4Room-level signal. Agents see an anonymized summary of where the broader panel is sitting. If the system is just inducing herding, both companies should accelerate together.
- R5Final commitment. Each agent locks in a position and a confidence score. The shape of the final distribution is the output.
That is the whole instrument. It is not subtle. The reason it is interesting is what happens inside it.
Peloton converges. Lululemon refuses to.
Press play. Watch the dots migrate. Two panels, side by side, the same machinery applied to both. The Peloton panel is on the left; the Lululemon control is on the right. Round by round, the agents revise their positions in front of each other.
The only meaningful difference between the two panels is the evidence pack. Same archetypes. Same number of agents. Same five rounds. Same confidence-update protocol.
Five rounds, two companies, one instrument
Watch the dots. By round five the Peloton room is almost unanimous; the Lululemon room is still genuinely split. That asymmetry, produced by identical machinery, is the actual finding.
By round five Peloton is at 91 bearish out of 100, average confidence 8.9 / 10, with 36 agents at maximum conviction. The control ends at 47 bearish, 46 bullish, 7 conflicted, average confidence 8.12, with only two agents at max conviction. Same instrument, same evidence period, two different shapes.
The trajectory matters as much as the endpoint. Peloton's bearish line climbs monotonically through every round: 62, 73, 82, 88, 91. Lululemon oscillates inside a narrow corridor, never converging. A second Peloton run, with round-four framing tightened, still ended in roughly the same place. Whatever the system was doing to the Peloton evidence, it was doing it consistently.
The shape is the signal
Peloton climbs. The control oscillates. A replication run, re-framed at round four, lands within a single agent of the original.
The interesting output is not the verdict. It is that the same machinery, applied to two adjacent companies, produced two structurally different shapes. One narrative carried load. The other did not.
The room did not just turn bearish. It found where the story carried load.
The convergence number is the headline. The structure underneath is the actual product. After five rounds, the panel did not simply pile onto a single bearish theme. It identified seven distinct load-bearing weaknesses and showed that they were not independent. They reinforced each other. Demand softness fed unit-economics fragility, which fed management-credibility erosion, which fed customer-sentiment drift, which fed back into demand softness.
That coupling is what made the narrative fragile. Not any single weakness. The fact that all of them had to remain non-critical at the same time for the bull case to hold.
Tap any dimension to see what the agents kept circling back to.
Seven load-bearing weaknesses
The same seven dimensions ran on both companies. The Peloton column is mostly red. The Lululemon column is mostly empty. The two columns side by side are the actual diagnostic.
A useful agent system does not replace judgment. It changes where judgment starts.
Traditional diligence is expensive before it is adversarial. You collect the mosaic. You write the memo. You schedule the IC debate. Then you find out where the story is brittle, usually after the team has already formed a house view, and usually after the deadline has compressed everyone into "let's move forward unless something is obviously wrong."
A synthetic stakeholder panel moves that pressure earlier. The first-pass adversarial test stops being the most expensive part of the process and becomes the cheapest. Two days into a deal, you can already see where the narrative is fragile. Not because the model told you so, but because every archetype in the room found the same failure modes from different starting points.
Disagreement is compressed late, often after the team has already formed a house view. The IC meeting is when you find out the deal is fragile.
The first-pass stress test gets cheap. Humans spend their scarce attention on the load-bearing weaknesses surfaced by the room, not on producing the room from scratch.
The win is not "believe the agents." The win is: before your team spends two weeks deep in a deal, you have a sharper map of the questions that deserve human attention.
That is the actual product. Not a verdict. A map. The map says: spend time here, not there. Demand cohorts deserve real work. Hardware unit economics deserve real work. Management credibility deserves real work. The synthetic panel did the cheap part of finding the dimensions, so the human team could do the expensive part of pressure-testing them with primary research.
And on the control side: the panel not converging is also useful. It tells the team that Lululemon's narrative is genuinely contested, which is itself a signal. Genuinely contested usually means there is a real seam to investigate; manufactured consensus usually means there isn't.
What this is, and what it is not.
The cutoff was set at June 30, 2021, and the evidence pack was bounded accordingly. But the underlying model weights were trained well after that. Latent knowledge of Peloton's later decline almost certainly lives in the weights even when the prompt context is time-gated. That vulnerability has to be stated plainly.
If the method simply amplified generic skepticism, the Lululemon control should have collapsed into the same bearish consensus. It did not. The control stayed split, which is what makes the Peloton pattern more interesting than theatrical. The control is the load-bearing piece of the design.
A second Peloton run, after tightening the round-four framing, still landed at 91 bearish. That is replication, not validation. Validation needs prospective tests on names whose outcomes are not yet known.
A broader historical basket of winners, losers, and muddled middle cases. Round-by-round convergence tracked more explicitly. Cost and speed measured against a realistic human baseline. Prospective tests where the answer is not in the training data. And most importantly: agents treated as triage infrastructure, not as autonomous decision-makers. The point of this experiment is to make the next experiment cheaper.
Not as oracles. As instruments.
The interesting part is not that 91 synthetic stakeholders ended bearish on Peloton. The interesting part is that, under a fixed historical cutoff and with a live control, the panel repeatedly focused on the same structural weaknesses that later mattered, while not producing the same collapse dynamic on the adjacent name.
That is enough to keep going. Not enough to declare victory. But enough to justify the next phase: broader baskets, prospective tests, cleaner controls, sharper measurement against human baselines.
The agents do not replace the analyst. They change where the analyst's attention starts. That is the whole game.