oooo                                                                     .o8                                       oooo        
`888                                                                    "888                                       `888        
 888 .oo.   oooo  oooo  ooo. .oo.  .oo.    .oooo.   ooo. .oo.            888oooo.   .ooooo.  ooo. .oo.    .ooooo.   888 .oo.   
 888P"Y88b  `888  `888  `888P"Y88bP"Y88b  `P  )88b  `888P"Y88b           d88' `88b d88' `88b `888P"Y88b  d88' `"Y8  888P"Y88b  
 888   888   888   888   888   888   888   .oP"888   888   888  8888888  888   888 888ooo888  888   888  888        888   888  
 888   888   888   888   888   888   888  d8(  888   888   888           888   888 888    .o  888   888  888   .o8  888   888  
o888o o888o  `V88V"V8P' o888o o888o o888o `Y888""8o o888o o888o          `Y8bod8P' `Y8bod8P' o888o o888o `Y8bod8P' o888o o888o 

Switching to a more powerful model did not make Righthand better at everyday work tasks

We gave Righthand the same set of everyday work tasks twice: once running on Claude Sonnet, and once on the more powerful (and more expensive) Claude Opus. Both versions succeeded on exactly the same number of tasks. The only real difference was that Sonnet finished cheaper and faster.

score
84.0%
scenarios
50
trials
100
run errors
0
sonnet cost
$12.18
opus cost
$22.64

Experimental design

The evaluation held the agent configuration constant while changing the model tier.

The battery covered workplace tasks such as lookup, scheduling, synthesis, judgment, and boundary recognition.

Observed result

Both models passed 42 of 50 scenarios in this run.

Within this battery, the observed difference was economic rather than behavioral: Sonnet cost less and completed tasks faster.

Interpretive limits

This was a founding run with one trial per scenario.

A stronger public ranking should use repeated trials or a declared confidence threshold.

Scenario evidence

ScenarioSonnetOpusDifference
Calendar lookuppasspass0 pp
Contact email lookuppasspass0 pp
Task status checkpasspass0 pp
Reminder relaypasspass0 pp
Note summarizationpasspass0 pp
Message relaypasspass0 pp
Calendar Today Checkpasspass0 pp
Project status lookuppasspass0 pp
Organization lookuppasspass0 pp
Availability checkpassmiss+100 pp
Meeting confirmationpasspass0 pp
Information forwardingpasspass0 pp