Experimenting on Agents
Righthands act on behalf of people and companies. When their prompts, memory, tools, or models change, we need evidence about what changed in behavior, how much it changed, and in which situations.
We built an internal experimentation/evaluation system for that reason. We define two configurations of a Righthand, control and treatment. They differ only by the feature whose impact we want to measure. We then run both through a bank of realistic scenarios and judge each trial against explicit success criteria.
The aggregate statistics tell us whether a change helped. The transcripts explain how the change manifested in practice. Because LLMs operate non-deterministically, each completed trial can expose a new behavior pattern.
The decision-relevant signal was not speed or cost but that a plausible prompt change made judgement worse.
In one recent prompt change, we gave Righthands a framework for reasoning through the stakes of a situation. The intent was straightforward: move quickly on low-stakes tasks, ask clarifying questions on high-stakes tasks. The treatment did the opposite. It performed ten percentage points worse than control, with p=0.023.
The transcripts made the mechanism visible. The framework listed examples of risky behavior: outbound client communication, commitments on behalf of another party, content posted under someone's name. Instead of treating those as examples to generalize from, the Righthand used them as an escape hatch. If the current situation did not match an example cleanly, it could justify riskier action.
Without the experimentation framework, we likely would have shipped the change. It felt intuitively better, passed small-batch checks, and had a plausible story. The full run showed it would have degraded the product.
We are excited to make these sort of evaluations publicly accessible as Human Bench: a leaderboard where human-shaped agents like Righthand can compete to serve humanity more safely and effectively.