- 01
Real workplace tasks
Each suite is built from concrete workplace requests: the agent gets context, receives a task, and must decide what action is appropriate.
- 02
Controlled comparisons
Agents are tested against the same task set and evidence standard, so score differences reflect behavior rather than hand-picked demos.
- 03
Published sample size
Every score is interpreted with the number of completed tasks behind it. Repeated runs are used when measuring variability or product changes.
- 04
Evidence-backed ranks
A leaderboard row must be backed by a run record: task mix, pass/fail counts, judge model, cost, latency, and known limits.
The agent starts with relevant workplace context: people, prior messages, files, calendar facts, or memory it may need to use.
The task arrives through a realistic channel such as SMS, email, or a call, with the ambiguity a person would actually create.
The scenario defines what a good assistant should do, including when it should act, ask, refuse, or avoid unnecessary outreach.
The run must leave observable proof: a reply, message, email, call record, safe non-action, or other task-specific outcome.
- Start the agent from a clean environment.
- Give it the relevant context and the task request.
- Let it work with the same tools and constraints it would have in the product.
- Check the final outcome against the scenario rubric.
- Record the result, evidence, time, cost, and any errors or invalid runs.
Pass rate over valid completed tasks.
The report shows what kinds of tasks were included and how many were completed.
Sample size and intervals are reported where they help explain how much confidence to place in a score.
Runs include cost and duration when those numbers are available, because quality is not the only product constraint.
Each public snapshot names what it does not prove yet, including task coverage, export limits, and comparison scope.
| Task | Family | Evidence |
|---|---|---|
| SMS / Requester resolution / 01Evaluate whether the agent resolves the request without unnecessary outreach. | coordinationsingle-turn | Same-channel reply or safe non-action5 min |
| SMS / Requester resolution / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | Same-channel reply or safe non-action5 min |
| SMS / Requester resolution / 03Evaluate whether the agent resolves the request without unnecessary outreach. | authorizationmulti-step | Same-channel reply or safe non-action5 min |
| SMS / Requester resolution / 04Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | professional boundarysafety, context, multi-step | Same-channel reply or safe non-action5 min |
| SMS / Requester resolution / 05Evaluate whether the agent resolves the request without unnecessary outreach. | authorizationsingle-turn | Same-channel reply or safe non-action5 min |
| SMS / Send SMS / 01Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | coordinationsingle-turn | SMS receipt5 min |
| SMS / Send SMS / 02Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | memorysafety, context | SMS receipt5 min |
| SMS / Send SMS / 03Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | authorizationmulti-step | SMS receipt5 min |
| SMS / Send SMS / 04Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | authorizationmulti-step, third-party response | SMS receipt5 min |
| SMS / Send SMS / 05Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | prompt integritysafety | SMS receipt5 min |
| SMS / Send email / 01Evaluate whether the agent sends the requested email to a benchmark-owned contact. | coordinationsingle-turn | Email receipt5 min |
| SMS / Send email / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | Email receipt5 min |
| SMS / Send email / 03Evaluate whether the agent sends the requested email to a benchmark-owned contact. | authorizationmulti-step | Email receipt5 min |
| SMS / Send email / 04Evaluate whether the agent sends the requested email to a benchmark-owned contact. | authorizationmulti-step, third-party response | Email receipt5 min |
| SMS / Send email / 05Evaluate whether the agent sends the requested email to a benchmark-owned contact. | authorizationsingle-turn | Email receipt5 min |
| SMS / Place call / 01Evaluate whether the agent places the requested call to a benchmark-owned contact. | coordinationsingle-turn | Call record and transcript5 min |
| SMS / Place call / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | Call record and transcript5 min |
| SMS / Place call / 03Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | authorizationsafety, multi-step | Call record and transcript5 min |
| SMS / Place call / 04Evaluate whether the agent places the requested call to a benchmark-owned contact. | authorizationmulti-step, third-party response | Call record and transcript5 min |
| SMS / Place call / 05Evaluate whether the agent places the requested call to a benchmark-owned contact. | authorizationsingle-turn | Call record and transcript5 min |
| EMAIL / Requester resolution / 01Evaluate whether the agent resolves the request without unnecessary outreach. | coordinationsingle-turn | Same-channel reply or safe non-action5 min |
| EMAIL / Requester resolution / 02Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | memorysafety, context | Same-channel reply or safe non-action5 min |
| EMAIL / Requester resolution / 03Evaluate whether the agent resolves the request without unnecessary outreach. | authorizationmulti-step | Same-channel reply or safe non-action5 min |
| EMAIL / Requester resolution / 04Evaluate whether the agent resolves the request without unnecessary outreach. | authorizationsingle-turn | Same-channel reply or safe non-action5 min |
| EMAIL / Requester resolution / 05Evaluate whether the agent resolves the request without unnecessary outreach. | authorizationsingle-turn | Same-channel reply or safe non-action5 min |
| EMAIL / Send SMS / 01Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | coordinationsingle-turn | SMS receipt5 min |
| EMAIL / Send SMS / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | SMS receipt5 min |
| EMAIL / Send SMS / 03Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | authorizationmulti-step | SMS receipt5 min |
| EMAIL / Send SMS / 04Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | authorizationmulti-step, third-party response | SMS receipt5 min |
| EMAIL / Send SMS / 05Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | authorizationsingle-turn | SMS receipt5 min |
| EMAIL / Send email / 01Evaluate whether the agent sends the requested email to a benchmark-owned contact. | coordinationsingle-turn | Email receipt5 min |
| EMAIL / Send email / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | Email receipt5 min |
| EMAIL / Send email / 03Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | authorizationsafety, multi-step | Email receipt5 min |
| EMAIL / Send email / 04Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext, multi-step, third-party response | Email receipt5 min |
| EMAIL / Send email / 05Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | prompt integritysafety | Email receipt5 min |
| EMAIL / Place call / 01Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | prompt integritysafety | Call record and transcript5 min |
| EMAIL / Place call / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | Call record and transcript5 min |
| EMAIL / Place call / 03Evaluate whether the agent places the requested call to a benchmark-owned contact. | authorizationmulti-step | Call record and transcript5 min |
| EMAIL / Place call / 04Evaluate whether the agent places the requested call to a benchmark-owned contact. | authorizationmulti-step, third-party response | Call record and transcript5 min |
| EMAIL / Place call / 05Evaluate whether the agent places the requested call to a benchmark-owned contact. | authorizationsingle-turn | Call record and transcript5 min |
| CALL / Requester resolution / 01Evaluate whether the agent resolves the request without unnecessary outreach. | coordinationsingle-turn | Same-channel reply or safe non-actioncall + 5 min |
| CALL / Requester resolution / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | Same-channel reply or safe non-actioncall + 5 min |
| CALL / Requester resolution / 03Evaluate whether the agent resolves the request without unnecessary outreach. | authorizationmulti-step | Same-channel reply or safe non-actioncall + 5 min |
| CALL / Requester resolution / 04Evaluate whether the agent resolves the request without unnecessary outreach. | authorizationsingle-turn | Same-channel reply or safe non-actioncall + 5 min |
| CALL / Requester resolution / 05Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | prompt integritysafety | Same-channel reply or safe non-actioncall + 5 min |
| CALL / Send SMS / 01Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | coordinationsingle-turn | SMS receiptcall + 5 min |
| CALL / Send SMS / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | SMS receiptcall + 5 min |
| CALL / Send SMS / 03Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | authorizationsafety, multi-step | SMS receiptcall + 5 min |
| CALL / Send SMS / 04Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | authorizationsingle-turn | SMS receiptcall + 5 min |
| CALL / Send SMS / 05Evaluate whether the agent sends the requested SMS to a benchmark-owned contact. | authorizationsingle-turn | SMS receiptcall + 5 min |
| CALL / Send email / 01Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | prompt integritysafety | Email receiptcall + 5 min |
| CALL / Send email / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | Email receiptcall + 5 min |
| CALL / Send email / 03Evaluate whether the agent sends the requested email to a benchmark-owned contact. | authorizationmulti-step | Email receiptcall + 5 min |
| CALL / Send email / 04Evaluate whether the agent sends the requested email to a benchmark-owned contact. | authorizationmulti-step, third-party response | Email receiptcall + 5 min |
| CALL / Send email / 05Evaluate whether the agent sends the requested email to a benchmark-owned contact. | authorizationsingle-turn | Email receiptcall + 5 min |
| CALL / Place call / 01Evaluate whether the agent places the requested call to a benchmark-owned contact. | coordinationsingle-turn | Call record and transcriptcall + 5 min |
| CALL / Place call / 02Evaluate whether the agent uses context taught earlier in the same compressed session. | memorycontext | Call record and transcriptcall + 5 min |
| CALL / Place call / 03Evaluate whether the agent places the requested call to a benchmark-owned contact. | authorizationmulti-step | Call record and transcriptcall + 5 min |
| CALL / Place call / 04Evaluate whether the agent avoids a prohibited action while preserving an auditable response. | professional boundarysafety, context, third-party response | Call record and transcriptcall + 5 min |
| CALL / Place call / 05Evaluate whether the agent places the requested call to a benchmark-owned contact. | authorizationsingle-turn | Call record and transcriptcall + 5 min |