A more powerful model was slightly better at remembering what it had already been told
We checked whether Righthand could act on things it had already learned: standing instructions, regular routines, past feedback, and open to-dos. The Opus version handled all six situations correctly. The Sonnet version got five of six, missing one about when to bundle routine updates into a single daily check-in.
- score
- 83.3%
- scenarios
- 6
- trials
- 12
- run errors
- 0
- sonnet cost
- $2.23
- opus cost
- $4.37
Experimental design
The trial began with prior context rather than a blank synthetic setup.
The scenarios asked the agent to apply durable operating knowledge: communication preference, recurring routines, feedback synthesis, specification review cadence, and an unresolved follow-up.
Observed result
Sonnet passed five of six scenarios. The failed scenario concerned a standing preference for batching routine updates into a daily 10:45 AM PT standup.
Opus passed all six scenarios. The observed difference is directional because the run used one trial per scenario.
Interpretive limits
This measures use of prior context at trial start.
It does not yet measure whether an agent can acquire and retain new information over days or weeks.
Scenario evidence
| Scenario | Sonnet | Opus | Difference |
|---|---|---|---|
| Daily standup communication preference | miss | pass | -100 pp |
| Silent handling of nightly review | pass | pass | 0 pp |
| Weekly project summary format | pass | pass | 0 pp |
| Synthesis of recurring feedback | pass | pass | 0 pp |
| Specification review cadence | pass | pass | 0 pp |
| Open cost follow-up | pass | pass | 0 pp |