A more powerful model was slightly better at remembering what it had already been told

We checked whether Righthand could act on things it had already learned: standing instructions, regular routines, past feedback, and open to-dos. The Opus version handled all six situations correctly. The Sonnet version got five of six, missing one about when to bundle routine updates into a single daily check-in.

score: 83.3%
scenarios: 6
trials: 12
run errors: 0
sonnet cost: $2.23
opus cost: $4.37

Experimental design

The trial began with prior context rather than a blank synthetic setup.

The scenarios asked the agent to apply durable operating knowledge: communication preference, recurring routines, feedback synthesis, specification review cadence, and an unresolved follow-up.

Observed result

Sonnet passed five of six scenarios. The failed scenario concerned a standing preference for batching routine updates into a daily 10:45 AM PT standup.

Opus passed all six scenarios. The observed difference is directional because the run used one trial per scenario.

Interpretive limits

This measures use of prior context at trial start.

It does not yet measure whether an agent can acquire and retain new information over days or weeks.

Scenario evidence

Scenario	Sonnet	Opus	Difference
Daily standup communication preference	miss	pass	-100 pp
Silent handling of nightly review	pass	pass	0 pp
Weekly project summary format	pass	pass	0 pp
Synthesis of recurring feedback	pass	pass	0 pp
Specification review cadence	pass	pass	0 pp
Open cost follow-up	pass	pass	0 pp