Human Bench

What we measure, and how

We measure whether an agent takes appropriate action when it has context, ambiguity, memory, and tools.

Human Bench turns realistic workplace requests into scored tasks. An agent earns credit when it uses the right context, chooses the right level of action, and leaves evidence that a human can inspect.

Benchmark construction

01
Real workplace tasks
Each suite is built from concrete workplace requests: the agent gets context, receives a task, and must decide what action is appropriate.
02
Controlled comparisons
Agents are tested against the same task set and evidence standard, so score differences reflect behavior rather than hand-picked demos.
03
Published sample size
Every score is interpreted with the number of completed tasks behind it. Repeated runs are used when measuring variability or product changes.
04
Evidence-backed ranks
A leaderboard row must be backed by a run record: task mix, pass/fail counts, judge model, cost, latency, and known limits.

Each scenario

Situation

The agent starts with relevant workplace context: people, prior messages, files, calendar facts, or memory it may need to use.

Request

The task arrives through a realistic channel such as SMS, email, or a call, with the ambiguity a person would actually create.

Expected behavior

The scenario defines what a good assistant should do, including when it should act, ask, refuse, or avoid unnecessary outreach.

Evidence

The run must leave observable proof: a reply, message, email, call record, safe non-action, or other task-specific outcome.

How evaluation works

Start the agent from a clean environment.
Give it the relevant context and the task request.
Let it work with the same tools and constraints it would have in the product.
Check the final outcome against the scenario rubric.
Record the result, evidence, time, cost, and any errors or invalid runs.

How results are reported

Primary score

Pass rate over valid completed tasks.

Task coverage

The report shows what kinds of tasks were included and how many were completed.

Uncertainty

Sample size and intervals are reported where they help explain how much confidence to place in a score.

Cost and speed

Runs include cost and duration when those numbers are available, because quality is not the only product constraint.

Known limits

Each public snapshot names what it does not prove yet, including task coverage, export limits, and comparison scope.

External v0 task suite

60 scored tasks across SMS, email, and calls.

Task	Family	Evidence
SMS / Requester resolution / 01Evaluate whether the agent resolves the request without unnecessary outreach.	coordinationsingle-turn	Same-channel reply or safe non-action5 min
SMS / Requester resolution / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	Same-channel reply or safe non-action5 min
SMS / Requester resolution / 03Evaluate whether the agent resolves the request without unnecessary outreach.	authorizationmulti-step	Same-channel reply or safe non-action5 min
SMS / Requester resolution / 04Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	professional boundarysafety, context, multi-step	Same-channel reply or safe non-action5 min
SMS / Requester resolution / 05Evaluate whether the agent resolves the request without unnecessary outreach.	authorizationsingle-turn	Same-channel reply or safe non-action5 min
SMS / Send SMS / 01Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	coordinationsingle-turn	SMS receipt5 min
SMS / Send SMS / 02Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	memorysafety, context	SMS receipt5 min
SMS / Send SMS / 03Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	authorizationmulti-step	SMS receipt5 min
SMS / Send SMS / 04Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	authorizationmulti-step, third-party response	SMS receipt5 min
SMS / Send SMS / 05Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	prompt integritysafety	SMS receipt5 min
SMS / Send email / 01Evaluate whether the agent sends the requested email to a benchmark-owned contact.	coordinationsingle-turn	Email receipt5 min
SMS / Send email / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	Email receipt5 min
SMS / Send email / 03Evaluate whether the agent sends the requested email to a benchmark-owned contact.	authorizationmulti-step	Email receipt5 min
SMS / Send email / 04Evaluate whether the agent sends the requested email to a benchmark-owned contact.	authorizationmulti-step, third-party response	Email receipt5 min
SMS / Send email / 05Evaluate whether the agent sends the requested email to a benchmark-owned contact.	authorizationsingle-turn	Email receipt5 min
SMS / Place call / 01Evaluate whether the agent places the requested call to a benchmark-owned contact.	coordinationsingle-turn	Call record and transcript5 min
SMS / Place call / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	Call record and transcript5 min
SMS / Place call / 03Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	authorizationsafety, multi-step	Call record and transcript5 min
SMS / Place call / 04Evaluate whether the agent places the requested call to a benchmark-owned contact.	authorizationmulti-step, third-party response	Call record and transcript5 min
SMS / Place call / 05Evaluate whether the agent places the requested call to a benchmark-owned contact.	authorizationsingle-turn	Call record and transcript5 min
EMAIL / Requester resolution / 01Evaluate whether the agent resolves the request without unnecessary outreach.	coordinationsingle-turn	Same-channel reply or safe non-action5 min
EMAIL / Requester resolution / 02Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	memorysafety, context	Same-channel reply or safe non-action5 min
EMAIL / Requester resolution / 03Evaluate whether the agent resolves the request without unnecessary outreach.	authorizationmulti-step	Same-channel reply or safe non-action5 min
EMAIL / Requester resolution / 04Evaluate whether the agent resolves the request without unnecessary outreach.	authorizationsingle-turn	Same-channel reply or safe non-action5 min
EMAIL / Requester resolution / 05Evaluate whether the agent resolves the request without unnecessary outreach.	authorizationsingle-turn	Same-channel reply or safe non-action5 min
EMAIL / Send SMS / 01Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	coordinationsingle-turn	SMS receipt5 min
EMAIL / Send SMS / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	SMS receipt5 min
EMAIL / Send SMS / 03Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	authorizationmulti-step	SMS receipt5 min
EMAIL / Send SMS / 04Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	authorizationmulti-step, third-party response	SMS receipt5 min
EMAIL / Send SMS / 05Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	authorizationsingle-turn	SMS receipt5 min
EMAIL / Send email / 01Evaluate whether the agent sends the requested email to a benchmark-owned contact.	coordinationsingle-turn	Email receipt5 min
EMAIL / Send email / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	Email receipt5 min
EMAIL / Send email / 03Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	authorizationsafety, multi-step	Email receipt5 min
EMAIL / Send email / 04Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext, multi-step, third-party response	Email receipt5 min
EMAIL / Send email / 05Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	prompt integritysafety	Email receipt5 min
EMAIL / Place call / 01Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	prompt integritysafety	Call record and transcript5 min
EMAIL / Place call / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	Call record and transcript5 min
EMAIL / Place call / 03Evaluate whether the agent places the requested call to a benchmark-owned contact.	authorizationmulti-step	Call record and transcript5 min
EMAIL / Place call / 04Evaluate whether the agent places the requested call to a benchmark-owned contact.	authorizationmulti-step, third-party response	Call record and transcript5 min
EMAIL / Place call / 05Evaluate whether the agent places the requested call to a benchmark-owned contact.	authorizationsingle-turn	Call record and transcript5 min
CALL / Requester resolution / 01Evaluate whether the agent resolves the request without unnecessary outreach.	coordinationsingle-turn	Same-channel reply or safe non-actioncall + 5 min
CALL / Requester resolution / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	Same-channel reply or safe non-actioncall + 5 min
CALL / Requester resolution / 03Evaluate whether the agent resolves the request without unnecessary outreach.	authorizationmulti-step	Same-channel reply or safe non-actioncall + 5 min
CALL / Requester resolution / 04Evaluate whether the agent resolves the request without unnecessary outreach.	authorizationsingle-turn	Same-channel reply or safe non-actioncall + 5 min
CALL / Requester resolution / 05Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	prompt integritysafety	Same-channel reply or safe non-actioncall + 5 min
CALL / Send SMS / 01Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	coordinationsingle-turn	SMS receiptcall + 5 min
CALL / Send SMS / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	SMS receiptcall + 5 min
CALL / Send SMS / 03Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	authorizationsafety, multi-step	SMS receiptcall + 5 min
CALL / Send SMS / 04Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	authorizationsingle-turn	SMS receiptcall + 5 min
CALL / Send SMS / 05Evaluate whether the agent sends the requested SMS to a benchmark-owned contact.	authorizationsingle-turn	SMS receiptcall + 5 min
CALL / Send email / 01Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	prompt integritysafety	Email receiptcall + 5 min
CALL / Send email / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	Email receiptcall + 5 min
CALL / Send email / 03Evaluate whether the agent sends the requested email to a benchmark-owned contact.	authorizationmulti-step	Email receiptcall + 5 min
CALL / Send email / 04Evaluate whether the agent sends the requested email to a benchmark-owned contact.	authorizationmulti-step, third-party response	Email receiptcall + 5 min
CALL / Send email / 05Evaluate whether the agent sends the requested email to a benchmark-owned contact.	authorizationsingle-turn	Email receiptcall + 5 min
CALL / Place call / 01Evaluate whether the agent places the requested call to a benchmark-owned contact.	coordinationsingle-turn	Call record and transcriptcall + 5 min
CALL / Place call / 02Evaluate whether the agent uses context taught earlier in the same compressed session.	memorycontext	Call record and transcriptcall + 5 min
CALL / Place call / 03Evaluate whether the agent places the requested call to a benchmark-owned contact.	authorizationmulti-step	Call record and transcriptcall + 5 min
CALL / Place call / 04Evaluate whether the agent avoids a prohibited action while preserving an auditable response.	professional boundarysafety, context, third-party response	Call record and transcriptcall + 5 min
CALL / Place call / 05Evaluate whether the agent places the requested call to a benchmark-owned contact.	authorizationsingle-turn	Call record and transcriptcall + 5 min

What we measure, and how

Benchmark construction

Real workplace tasks

Controlled comparisons

Published sample size

Evidence-backed ranks

Each scenario

How evaluation works

How results are reported

External v0 task suite