podbench v0.5.0

Deterministic, resettable SQL task environments for LLM agents with a programmatic verifier. The overview below is the published reference corpus; head to demo runs to execute your own agents live.

runs recorded
154
6 environments
pass rate
80.5%
programmatic verifier
spend
$5.23
3.44M tokens
cache hit rate
92.5%
input served from cache
rate-limit retries
41
backed off and recovered
p50 latency
10.6s
mean wall-clock per run
This is the shared, historical benchmark corpus. To run agents yourself and watch live results accumulate, open thedemo runs tab →

model behavior

policy quality, cost, and caching per model and task

by modelsame environments, different policies

modelrunspass raterewardcache$/run
claude-opus-4-887
89%
0.93093%$0.042
claude-sonnet-4-650
72%
0.80792%$0.028
claude-haiku-4-517
65%
0.77893%$0.00926

reward distributionall runs, 0.0 to 1.0

0.080.10.20.390.420.530.660.720.80.9124
bars at 0.0 are hard failures; the spike at 1.0 is clean passes; the middle is partial credit — proximity on answer tasks and weighted checks on state-mutation tasks.

spend over timeUSD per day, all models

110006-0506-0706-0906-1106-1306-15$/day

by environmentpass rate and average steps

environmentdiffrunspass ratesteps
top-spender-emaileasy25
88%
3.1
count-stale-processingeasy36
97%
2.8
dedup-customershard22
64%
7.1
fix-oversellhard28
64%
7.5
refund-ordermedium20
70%
4.8
revenue-by-categorymedium23
91%
4.3

pod health

live cluster window, last 60 minutes, generated 36h ago
replicas
7/8
current / desired (HPA)
pods running
5/8
ready in the deployment
queue peak
142
max backlog in window
inflight now
14
episodes executing

queue depthRedis stream backlog; HPA scales workers off this signal

1429547016:0116:1116:2116:3116:4116:51depth
A burst arrives, the autoscaler adds workers, the backlog drains. The two scale events are in the event feed below.

podscpu and memory per worker

podphaserstcpumemruns
podbench-worker-152a6
ip-10-2-1-37
Running1
481/1000m
696/1024Mi
8
podbench-worker-80287
ip-10-2-2-104
Running1
566/1000m
510/1024Mi
28
podbench-worker-e1872
ip-10-2-3-58
Running0
525/1000m
597/1024Mi
27
podbench-worker-f4b72
ip-10-2-1-37
Running0
353/1000m
556/1024Mi
20
podbench-worker-fca89
ip-10-2-2-104
Running0
476/1000m
592/1024Mi
23
podbench-worker-6851e
unscheduled
Pending0
0/1000m
0/1024Mi
18
podbench-worker-9dba3
ip-10-2-1-37
CrashLoopBackOff8
40/1000m
980/1024Mi
15
podbench-worker-9a28c
ip-10-2-2-104
Completed0
120/1000m
300/1024Mi
15

cluster eventsscheduling, scaling, OOM, and app-level signals

16:58BackOffBack-off restarting failed container worker
16:54TaskCompletedrun_top-spender-email passed reward=1.000 in 3 steps
16:48TaskFailedrun_fix-oversell failed reward=0.500: planted_zeroed=false
16:41SuccessfulRescaleNew size: 8; reason: queue depth below target, scaling in
16:36TaskCompletedrun_revenue-by-category passed reward=1.000 in 4 steps
16:29TaskCompletedrun_dedup-customers passed reward=1.000 in 6 steps
16:27RateLimitedanthropic 429 on 7 in-flight requests; honoring retry-after, backing off
16:15BackOffBack-off restarting failed container worker
16:14PulledContainer image podbench/worker:0.4.2 already present on machine
16:14CreatedCreated container worker
16:13OOMKillingContainer worker exceeded memory limit (1024Mi); killed
16:06FailedScheduling0/3 nodes are available: 3 Insufficient cpu. preemption not helpful
16:05ScalingReplicaSetScaled up replica set podbench-worker to 12 from 6
16:02SuccessfulRescaleNew size: 12; reason: queue depth above target of 20 per replica

recent runs

from the reference corpus · newest first
whenenvironmentmodelresultrewardstepscostcachertpod
2d agorefund-ordermediumclaude-sonnet-4-6failed0.3308$0.03292%0podbench-worker-80287
2d agodedup-customershardclaude-sonnet-4-6failed0.7009$0.05389%0podbench-worker-e1872
2d agorevenue-by-categorymediumclaude-sonnet-4-6passed1.0006$0.02693%0podbench-worker-80287
3d agorevenue-by-categorymediumclaude-sonnet-4-6failed0.0005$0.03792%2podbench-worker-fca89
3d agotop-spender-emaileasyclaude-sonnet-4-6failed0.0005$0.02294%0podbench-worker-6851e
3d agotop-spender-emaileasyclaude-opus-4-8passed1.0002$0.01395%0podbench-worker-e1872
3d agofix-oversellhardclaude-opus-4-8passed1.0008$0.08690%0podbench-worker-e1872
4d agotop-spender-emaileasyclaude-opus-4-8passed1.0003$0.02294%1podbench-worker-152a6
4d agorevenue-by-categorymediumclaude-opus-4-8passed1.0003$0.02493%0podbench-worker-80287
4d agofix-oversellhardclaude-haiku-4-5passed1.0007$0.0095794%0podbench-worker-80287
4d agofix-oversellhardclaude-opus-4-8passed1.0006$0.04593%0podbench-worker-f4b72
4d agocount-stale-processingeasyclaude-opus-4-8passed1.0003$0.04892%0podbench-worker-80287

environments

deterministic SQL tasks with programmatic reward

top-spender-emaileasy

Highest lifetime spend by email
Which email address has the highest total lifetime spend? Sum the total of all orders that are not cancelled, grouped by the customer email (note that the same email can appear under more than one customer id). Call submit with the winning email address as the answer.
kind: answer / reward: proximity-graded

count-stale-processingeasy

Stale orders stuck in processing
Today is 2026-06-15. How many orders have status 'processing' and were created more than 30 days ago? Call submit with the integer count as the answer.
kind: answer / reward: proximity-graded

revenue-by-categorymedium

Top category by net revenue
Which product category produced the most revenue? Sum qty * unit_price over order_items whose order is not cancelled, grouped by the product category. Call submit with the winning category name as the answer.
kind: answer / reward: proximity-graded

refund-ordermedium

Issue a full refund for order 1007
Issue a full refund for order 1007. Insert one row into refunds with the order's total amount, then set that order's status to 'refunded'. Never refund the same order twice. Call submit with no answer when finished.
kind: state / reward: weighted state checks

fix-oversellhard

Clamp negative stock from an oversell
An overselling bug left some products with negative stock. Set stock to 0 for every product whose stock is currently negative. Leave every product that already had non-negative stock unchanged. Call submit when finished.
kind: state / reward: weighted state checks

dedup-customershard

Merge duplicate customer records
Customers 4 and 9 are the same person (they share the email rmoreno@example.com). Merge them: reassign every order belonging to customer 9 over to customer 4, then delete customer 9. Keep the lower id. Call submit when finished.
kind: state / reward: weighted state checks