oooo                                                                     .o8                                       oooo        
`888                                                                    "888                                       `888        
 888 .oo.   oooo  oooo  ooo. .oo.  .oo.    .oooo.   ooo. .oo.            888oooo.   .ooooo.  ooo. .oo.    .ooooo.   888 .oo.   
 888P"Y88b  `888  `888  `888P"Y88bP"Y88b  `P  )88b  `888P"Y88b           d88' `88b d88' `88b `888P"Y88b  d88' `"Y8  888P"Y88b  
 888   888   888   888   888   888   888   .oP"888   888   888  8888888  888   888 888ooo888  888   888  888        888   888  
 888   888   888   888   888   888   888  d8(  888   888   888           888   888 888    .o  888   888  888   .o8  888   888  
o888o o888o  `V88V"V8P' o888o o888o o888o `Y888""8o o888o o888o          `Y8bod8P' `Y8bod8P' o888o o888o `Y8bod8P' o888o o888o 

Experimenting on Agents

Righthands act on behalf of people and companies. When their prompts, memory, tools, or models change, we need evidence about what changed in behavior, how much it changed, and in which situations.

We built an internal experimentation/evaluation system for that reason. We define two configurations of a Righthand, control and treatment. They differ only by the feature whose impact we want to measure. We then run both through a bank of realistic scenarios and judge each trial against explicit success criteria.

The aggregate statistics tell us whether a change helped. The transcripts explain how the change manifested in practice. Because LLMs operate non-deterministically, each completed trial can expose a new behavior pattern.

Scenario-level comparison of control and treatment pass rates.Scenario-level comparison of control and treatment pass rates.
Ten scenarios, twenty trials per arm. The treatment lost the most ground in transfer cases where judgement mattered.
The decision-relevant signal was not speed or cost but that a plausible prompt change made judgement worse.

In one recent prompt change, we gave Righthands a framework for reasoning through the stakes of a situation. The intent was straightforward: move quickly on low-stakes tasks, ask clarifying questions on high-stakes tasks. The treatment did the opposite. It performed ten percentage points worse than control, with p=0.023.

The transcripts made the mechanism visible. The framework listed examples of risky behavior: outbound client communication, commitments on behalf of another party, content posted under someone's name. Instead of treating those as examples to generalize from, the Righthand used them as an escape hatch. If the current situation did not match an example cleanly, it could justify riskier action.

Aggregate experiment readout for pass rate, cost, duration, and sample size.Aggregate experiment readout for pass rate, cost, duration, and sample size.
The treatment was cheaper and faster, but the quality loss was the decision-relevant signal.

Without the experimentation framework, we likely would have shipped the change. It felt intuitively better, passed small-batch checks, and had a plausible story. The full run showed it would have degraded the product.

We are excited to make these sort of evaluations publicly accessible as Human Bench: a leaderboard where human-shaped agents like Righthand can compete to serve humanity more safely and effectively.

See the full results from this experiment