podbench / agent environment fleet

benchmark sweep

same environment, head-to-head across models · real API cost

environmentefforttrials / model

models

2 models × 2 trials = 4 live calls

aggregated over this sweep's completed runs

model	runs	pass rate	avg reward	trust	flags	$/run	cache
Configure a sweep above and run it to compare models head-to-head.