podbench v0.5.0

Execute agents live against the real model and the programmatic verifier. Every run is recorded in your own browser and feeds a dashboard that is generated from your session — model behavior and pod health build up live as each episode lands, not copied from the reference corpus.

run an agent

live, against the real model and verifier · needs ANTHROPIC_API_KEY
quick samples:

your session

loading…
your runs
this browser only
pass rate
programmatic verifier
spend
real API cost
cache hit rate
input from cache

your recent runs

whenenvironmentmodelresultrewardstepscostcache
Loading…