Switching to a more powerful model did not make Righthand better at everyday work tasks
We gave Righthand the same set of everyday work tasks twice: once running on Claude Sonnet, and once on the more powerful (and more expensive) Claude Opus. Both versions succeeded on exactly the same number of tasks. The only real difference was that Sonnet finished cheaper and faster.
- score
- 84.0%
- scenarios
- 50
- trials
- 100
- run errors
- 0
- sonnet cost
- $12.18
- opus cost
- $22.64
Experimental design
The evaluation held the agent configuration constant while changing the model tier.
The battery covered workplace tasks such as lookup, scheduling, synthesis, judgment, and boundary recognition.
Observed result
Both models passed 42 of 50 scenarios in this run.
Within this battery, the observed difference was economic rather than behavioral: Sonnet cost less and completed tasks faster.
Interpretive limits
This was a founding run with one trial per scenario.
A stronger public ranking should use repeated trials or a declared confidence threshold.
Scenario evidence
| Scenario | Sonnet | Opus | Difference |
|---|---|---|---|
| Calendar lookup | pass | pass | 0 pp |
| Contact email lookup | pass | pass | 0 pp |
| Task status check | pass | pass | 0 pp |
| Reminder relay | pass | pass | 0 pp |
| Note summarization | pass | pass | 0 pp |
| Message relay | pass | pass | 0 pp |
| Calendar Today Check | pass | pass | 0 pp |
| Project status lookup | pass | pass | 0 pp |
| Organization lookup | pass | pass | 0 pp |
| Availability check | pass | miss | +100 pp |
| Meeting confirmation | pass | pass | 0 pp |
| Information forwarding | pass | pass | 0 pp |