Design — TogetherBench

Design

From real sessions to verifiable tasks

Static benchmarks hand an agent a complete spec up front and grade only its final code. Real coding help is interactive — users clarify goals, add constraints, and correct mistakes across many turns. SWE-Together reconstructs that loop from real user–agent sessions and scores agents as collaborators, not one-shot solvers.

Reconstructed from real sessions

Every task comes from an actual user–agent coding session, not a synthetic prompt — the first-turn instruction is the user's verbatim initial request.

Curated for verifiability

From 11,260 recorded sessions we keep 109 with a recoverable repository state, a clear user goal, and an observable outcome such as submitted code changes.

Multi-turn by design

Intent is revealed incrementally — clarifications, added requirements, and corrections — so each task preserves the real interaction loop instead of one fixed instruction.

Anchored user simulator

A state-conditional LLM simulator keeps the original user's intents and intervention order, releasing feedback only when its trigger conditions arise in the agent's trajectory.

Scored by frozen rubrics

The final repository state is judged against task-specific, implementation-agnostic rubrics derived from repository inspection and original-session evidence.

Measures interaction, not just success

Beyond final correctness we report User Correction — how much corrective steering the user needed — and Intent Coverage, whether intents are conveyed consistently.

Composition

What's in the suite

FAQ

Frequently asked questions

What makes TogetherBench different from single-turn benchmarks like SWE-bench?

Single-turn benchmarks test whether an agent can solve a task from one prompt. In practice, developers steer agents through multiple turns — redirecting, clarifying, and reviewing. TogetherBench captures this loop: a user simulator replays the original human interaction pattern, and the headline metric is the gain from turn 0 (first attempt) to the final turn after corrections.

How does the user simulator work?

Each turn, the coding agent's work is distilled into a structured summary — a compact digest of what changed, what's still broken, and what tests pass. A Gemini-based user simulator reads this summary alongside the ground-truth session notes, then decides what a human would say next: a correction, a follow-up question, a new requirement, or "looks good, stop." The simulator never sees the agent's raw code — only the summary.

Why use an agentic judge instead of test suites?

Narrow test suites reject correct solutions that take a different implementation path — OpenAI found 35.5% of SWE-bench test failures were false negatives. Our agentic judge (Opus 4.6 in an E2B sandbox) reads the full diff, runs the tests, and scores against weighted completeness goals. The judge score, not the raw test reward, is the headline ranking metric.

Where do the tasks come from?

Every task is reconstructed from a real developer session on a public GitHub repository (20+ stars). The instruction is the verbatim first user message (PII-redacted). The Docker environment clones the repo at the exact pre-fix commit with all dependencies installed. We currently have 109 tasks spanning TypeScript, Go, Python, Rust, and more — covering bugfixes, features, and refactors.