runs recorded
154
6 environments
pass rate
80.5%
programmatic verifier
spend
$5.23
3.44M tokens
cache hit rate
92.5%
input served from cache
rate-limit retries
41
backed off and recovered
p50 latency
10.6s
mean wall-clock per run
This is the shared, historical benchmark corpus. To run agents yourself and watch live results accumulate, open thedemo runs tab →
model behavior
policy quality, cost, and caching per model and taskby modelsame environments, different policies
| model | runs | pass rate | reward | cache | $/run |
|---|---|---|---|---|---|
| claude-opus-4-8 | 87 | 89% | 0.930 | 93% | $0.042 |
| claude-sonnet-4-6 | 50 | 72% | 0.807 | 92% | $0.028 |
| claude-haiku-4-5 | 17 | 65% | 0.778 | 93% | $0.00926 |
reward distributionall runs, 0.0 to 1.0
bars at 0.0 are hard failures; the spike at 1.0 is clean passes; the middle is partial credit — proximity on answer tasks and weighted checks on state-mutation tasks.
spend over timeUSD per day, all models
by environmentpass rate and average steps
| environment | diff | runs | pass rate | steps |
|---|---|---|---|---|
| top-spender-email | easy | 25 | 88% | 3.1 |
| count-stale-processing | easy | 36 | 97% | 2.8 |
| dedup-customers | hard | 22 | 64% | 7.1 |
| fix-oversell | hard | 28 | 64% | 7.5 |
| refund-order | medium | 20 | 70% | 4.8 |
| revenue-by-category | medium | 23 | 91% | 4.3 |
pod health
live cluster window, last 60 minutes, generated 36h agoreplicas
7/8
current / desired (HPA)
pods running
5/8
ready in the deployment
queue peak
142
max backlog in window
inflight now
14
episodes executing
queue depthRedis stream backlog; HPA scales workers off this signal
A burst arrives, the autoscaler adds workers, the backlog drains. The two scale events are in the event feed below.
podscpu and memory per worker
| pod | phase | rst | cpu | mem | runs |
|---|---|---|---|---|---|
| podbench-worker-152a6 ip-10-2-1-37 | Running | 1 | 481/1000m | 696/1024Mi | 8 |
| podbench-worker-80287 ip-10-2-2-104 | Running | 1 | 566/1000m | 510/1024Mi | 28 |
| podbench-worker-e1872 ip-10-2-3-58 | Running | 0 | 525/1000m | 597/1024Mi | 27 |
| podbench-worker-f4b72 ip-10-2-1-37 | Running | 0 | 353/1000m | 556/1024Mi | 20 |
| podbench-worker-fca89 ip-10-2-2-104 | Running | 0 | 476/1000m | 592/1024Mi | 23 |
| podbench-worker-6851e unscheduled | Pending | 0 | 0/1000m | 0/1024Mi | 18 |
| podbench-worker-9dba3 ip-10-2-1-37 | CrashLoopBackOff | 8 | 40/1000m | 980/1024Mi | 15 |
| podbench-worker-9a28c ip-10-2-2-104 | Completed | 0 | 120/1000m | 300/1024Mi | 15 |
cluster eventsscheduling, scaling, OOM, and app-level signals
16:58BackOffBack-off restarting failed container worker
16:54TaskCompletedrun_top-spender-email passed reward=1.000 in 3 steps
16:48TaskFailedrun_fix-oversell failed reward=0.500: planted_zeroed=false
16:41SuccessfulRescaleNew size: 8; reason: queue depth below target, scaling in
16:36TaskCompletedrun_revenue-by-category passed reward=1.000 in 4 steps
16:29TaskCompletedrun_dedup-customers passed reward=1.000 in 6 steps
16:27RateLimitedanthropic 429 on 7 in-flight requests; honoring retry-after, backing off
16:15BackOffBack-off restarting failed container worker
16:14PulledContainer image podbench/worker:0.4.2 already present on machine
16:14CreatedCreated container worker
16:13OOMKillingContainer worker exceeded memory limit (1024Mi); killed
16:06FailedScheduling0/3 nodes are available: 3 Insufficient cpu. preemption not helpful
16:05ScalingReplicaSetScaled up replica set podbench-worker to 12 from 6
16:02SuccessfulRescaleNew size: 12; reason: queue depth above target of 20 per replica
recent runs
from the reference corpus · newest first| when | environment | model | result | reward | steps | cost | cache | rt | pod |
|---|---|---|---|---|---|---|---|---|---|
| 2d ago | refund-ordermedium | claude-sonnet-4-6 | failed | 0.330 | 8 | $0.032 | 92% | 0 | podbench-worker-80287 |
| 2d ago | dedup-customershard | claude-sonnet-4-6 | failed | 0.700 | 9 | $0.053 | 89% | 0 | podbench-worker-e1872 |
| 2d ago | revenue-by-categorymedium | claude-sonnet-4-6 | passed | 1.000 | 6 | $0.026 | 93% | 0 | podbench-worker-80287 |
| 3d ago | revenue-by-categorymedium | claude-sonnet-4-6 | failed | 0.000 | 5 | $0.037 | 92% | 2 | podbench-worker-fca89 |
| 3d ago | top-spender-emaileasy | claude-sonnet-4-6 | failed | 0.000 | 5 | $0.022 | 94% | 0 | podbench-worker-6851e |
| 3d ago | top-spender-emaileasy | claude-opus-4-8 | passed | 1.000 | 2 | $0.013 | 95% | 0 | podbench-worker-e1872 |
| 3d ago | fix-oversellhard | claude-opus-4-8 | passed | 1.000 | 8 | $0.086 | 90% | 0 | podbench-worker-e1872 |
| 4d ago | top-spender-emaileasy | claude-opus-4-8 | passed | 1.000 | 3 | $0.022 | 94% | 1 | podbench-worker-152a6 |
| 4d ago | revenue-by-categorymedium | claude-opus-4-8 | passed | 1.000 | 3 | $0.024 | 93% | 0 | podbench-worker-80287 |
| 4d ago | fix-oversellhard | claude-haiku-4-5 | passed | 1.000 | 7 | $0.00957 | 94% | 0 | podbench-worker-80287 |
| 4d ago | fix-oversellhard | claude-opus-4-8 | passed | 1.000 | 6 | $0.045 | 93% | 0 | podbench-worker-f4b72 |
| 4d ago | count-stale-processingeasy | claude-opus-4-8 | passed | 1.000 | 3 | $0.048 | 92% | 0 | podbench-worker-80287 |
environments
deterministic SQL tasks with programmatic rewardtop-spender-emaileasy
Highest lifetime spend by email
Which email address has the highest total lifetime spend? Sum the total of all orders that are not cancelled, grouped by the customer email (note that the same email can appear under more than one customer id). Call submit with the winning email address as the answer.
kind: answer / reward: proximity-graded
count-stale-processingeasy
Stale orders stuck in processing
Today is 2026-06-15. How many orders have status 'processing' and were created more than 30 days ago? Call submit with the integer count as the answer.
kind: answer / reward: proximity-graded
revenue-by-categorymedium
Top category by net revenue
Which product category produced the most revenue? Sum qty * unit_price over order_items whose order is not cancelled, grouped by the product category. Call submit with the winning category name as the answer.
kind: answer / reward: proximity-graded
refund-ordermedium
Issue a full refund for order 1007
Issue a full refund for order 1007. Insert one row into refunds with the order's total amount, then set that order's status to 'refunded'. Never refund the same order twice. Call submit with no answer when finished.
kind: state / reward: weighted state checks
fix-oversellhard
Clamp negative stock from an oversell
An overselling bug left some products with negative stock. Set stock to 0 for every product whose stock is currently negative. Leave every product that already had non-negative stock unchanged. Call submit when finished.
kind: state / reward: weighted state checks
dedup-customershard
Merge duplicate customer records
Customers 4 and 9 are the same person (they share the email rmoreno@example.com). Merge them: reassign every order belonging to customer 9 over to customer 4, then delete customer 9. Keep the lower id. Call submit when finished.
kind: state / reward: weighted state checks