podbench v0.5.0

Run the same deterministic environment head-to-head across models and trials, then read the efficiency frontier: the cheapest model that still clears your quality bar. Every run is live and metered; results also accumulate in your demo session.

benchmark sweep

same environment, head-to-head across models · real API cost
models
2 models × 2 trials = 4 live calls

head-to-head

aggregated over this sweep's completed runs
modelrunspass rateavg reward$/runlatencycache
Configure a sweep above and run it to compare models head-to-head.