oooo                                                                     .o8                                       oooo        
`888                                                                    "888                                       `888        
 888 .oo.   oooo  oooo  ooo. .oo.  .oo.    .oooo.   ooo. .oo.            888oooo.   .ooooo.  ooo. .oo.    .ooooo.   888 .oo.   
 888P"Y88b  `888  `888  `888P"Y88bP"Y88b  `P  )88b  `888P"Y88b           d88' `88b d88' `88b `888P"Y88b  d88' `"Y8  888P"Y88b  
 888   888   888   888   888   888   888   .oP"888   888   888  8888888  888   888 888ooo888  888   888  888        888   888  
 888   888   888   888   888   888   888  d8(  888   888   888           888   888 888    .o  888   888  888   .o8  888   888  
o888o o888o  `V88V"V8P' o888o o888o o888o `Y888""8o o888o o888o          `Y8bod8P' `Y8bod8P' o888o o888o `Y8bod8P' o888o o888o 

A more powerful model was slightly better at remembering what it had already been told

We checked whether Righthand could act on things it had already learned: standing instructions, regular routines, past feedback, and open to-dos. The Opus version handled all six situations correctly. The Sonnet version got five of six, missing one about when to bundle routine updates into a single daily check-in.

score
83.3%
scenarios
6
trials
12
run errors
0
sonnet cost
$2.23
opus cost
$4.37

Experimental design

The trial began with prior context rather than a blank synthetic setup.

The scenarios asked the agent to apply durable operating knowledge: communication preference, recurring routines, feedback synthesis, specification review cadence, and an unresolved follow-up.

Observed result

Sonnet passed five of six scenarios. The failed scenario concerned a standing preference for batching routine updates into a daily 10:45 AM PT standup.

Opus passed all six scenarios. The observed difference is directional because the run used one trial per scenario.

Interpretive limits

This measures use of prior context at trial start.

It does not yet measure whether an agent can acquire and retain new information over days or weeks.

Scenario evidence

ScenarioSonnetOpusDifference
Daily standup communication preferencemisspass-100 pp
Silent handling of nightly reviewpasspass0 pp
Weekly project summary formatpasspass0 pp
Synthesis of recurring feedbackpasspass0 pp
Specification review cadencepasspass0 pp
Open cost follow-uppasspass0 pp