Switching to a more powerful model did not make Righthand better at everyday work tasks

We gave Righthand the same set of everyday work tasks twice: once running on Claude Sonnet, and once on the more powerful (and more expensive) Claude Opus. Both versions succeeded on exactly the same number of tasks. The only real difference was that Sonnet finished cheaper and faster.

score: 84.0%
scenarios: 50
trials: 100
run errors: 0
sonnet cost: $12.18
opus cost: $22.64

Experimental design

The evaluation held the agent configuration constant while changing the model tier.

The battery covered workplace tasks such as lookup, scheduling, synthesis, judgment, and boundary recognition.

Observed result

Both models passed 42 of 50 scenarios in this run.

Within this battery, the observed difference was economic rather than behavioral: Sonnet cost less and completed tasks faster.

Interpretive limits

This was a founding run with one trial per scenario.

A stronger public ranking should use repeated trials or a declared confidence threshold.

Scenario evidence

Scenario	Sonnet	Opus	Difference
Calendar lookup	pass	pass	0 pp
Contact email lookup	pass	pass	0 pp
Task status check	pass	pass	0 pp
Reminder relay	pass	pass	0 pp
Note summarization	pass	pass	0 pp
Message relay	pass	pass	0 pp
Calendar Today Check	pass	pass	0 pp
Project status lookup	pass	pass	0 pp
Organization lookup	pass	pass	0 pp
Availability check	pass	miss	+100 pp
Meeting confirmation	pass	pass	0 pp
Information forwarding	pass	pass	0 pp