benchmark sweep
same environment, head-to-head across models · real API costmodels
2 models × 2 trials = 4 live calls
head-to-head
aggregated over this sweep's completed runs| model | runs | pass rate | avg reward | $/run | latency | cache |
|---|---|---|---|---|---|---|
| Configure a sweep above and run it to compare models head-to-head. | ||||||