# Human Bench

> Human Bench is the authority on measuring whether AI agents can actually do real workplace work: remember context, handle ambiguity, use tools, respect boundaries, and produce auditable outcomes.

Human Bench is built by American Productivity Company, based in San Francisco. We publish the evidence layer for long-lived workplace agents: task scenarios, trial outcomes, judgment rubrics, costs, model configuration, and limits. Prefer the Markdown exports below when giving an LLM context about the benchmark.

Last edited: June 23, 2026.

## Core pages

- [Human Bench overview and leaderboard](https://human-bench.com/index.html.md): Project summary, public leaderboard row, and evidence standard.
- [Methodology](https://human-bench.com/methodology/index.html.md): How task instances, trials, rubrics, and reporting work.
- [Articles index](https://human-bench.com/experiments/index.html.md): List of published articles and experiment debriefs.
- [External v0 protocol](https://human-bench.com/external-v0/index.html.md): Submission, verification, run, publication, and task matrix details.

## Experiment debriefs

- [Switching to a more powerful model did not make Righthand better at everyday work tasks](https://human-bench.com/experiments/sonnet-opus-task-completion/index.html.md): Benchmark run. We gave Righthand the same set of everyday work tasks twice: once running on Claude Sonnet, and once on the more powerful (and more expensive) Claude Opus. Both versions succeeded on exactly the same number of tasks. The only real difference was that Sonnet finished cheaper and faster.
- [A more powerful model was slightly better at remembering what it had already been told](https://human-bench.com/experiments/memory-snapshot-v0/index.html.md): Memory sub-benchmark. We checked whether Righthand could act on things it had already learned: standing instructions, regular routines, past feedback, and open to-dos. The Opus version handled all six situations correctly. The Sonnet version got five of six, missing one about when to bundle routine updates into a single daily check-in.
- [A prompt meant to sharpen the assistant's judgment actually made it worse](https://human-bench.com/experiments/judgement-prompt-change/index.html.md): Product experiment. We tried giving Righthand explicit instructions on how to weigh the stakes of a decision before acting, expecting better judgment when situations were unclear. It backfired. The version with the new instructions made worse decisions than the version without them, even though the idea had looked promising in early spot checks.

## Articles

- [Articles index](https://human-bench.com/articles/index.html.md): Article list and summaries.
- [Experimenting on Agents](https://human-bench.com/articles/experimenting-on-agents/index.html.md): Narrative article about an experiment where a plausible prompt change reduced judgment performance.

## Optional

- [Full LLM context](https://human-bench.com/llms-full.txt): Expanded Markdown bundle containing every linked Human Bench Markdown export.
- [Rendered website](https://human-bench.com/): Human-facing site with navigation and visualizations.
