>_<

SWE-Together: Evaluating Coding Agents in Interactive User Sessions


Leaderboard

Model judge correction tokens min
Oracle reference
~78%
0.904
claude-opus-4.8
52%
63%
0.801
1.38
74.0k
23.3
gpt-5.5
48%
58%
0.763
1.59
29.9k
10.7
claude-opus-4.6
46%
58%
0.755
1.59
42.0k
23.2
glm-5.2
42%
55%
0.735
1.53
41.7k
24.5
glm-5.1
34%
52%
0.729
1.54
41.6k
38.8
deepseek-v4-pro
29%
48%
0.679
1.76
49.8k
21.0
minimax-2.7
24%
39%
0.630
2.17
43.4k
36.2
0%20%40%60%80%

SWE-Together · 109 tasks · opencode harness · k = 2. Models sorted by pass@1. On each bar, the dark number = pass@1 and the white number = pass² (both runs solve at judge ≥ 0.85); the hatched tail is unstable (pass@1 − pass²). Oracle is the gold-patch reference ceiling. The best value in each column is bold (judge ↑ higher is better; correction / tokens / min ↓ lower is better).

Evaluation runs · provider · date (2026)

  • claude-opus-4.8·openrouter·6/17
  • gpt-5.5·openrouter·6/28
  • claude-opus-4.6·openrouter·6/8
  • glm-5.2·openrouter·6/17
  • glm-5.1·openrouter·6/8
  • deepseek-v4-pro·deepseek·6/8
  • minimax-2.7·openrouter·6/8

Get in touch

We'd love to hear from you

Have a question, a session you think would make a great task, or feedback on the benchmark? Want to contribute or collaborate? Contact Yifan Wu and Shengzhi Li, join the Discord, or open an issue or pull request on GitHub.


Citation

Cite SWE-Together

@article{wu2026swetogether,
  title   = {SWE-Together: Evaluating Coding Agents in Interactive User Sessions},
  author  = {Wu, Yifan and Zhao, Zhuokai and Li, Songlin and Lee, Ho Hin and Zhu, Jiacheng and Wu, Shirley and Yu, Tianhe and Li, Serena and Zhang, Lizhu and Fan, Xiangjun and Li, Shengzhi},
  year    = {2026},
  journal = {arXiv preprint arXiv:2606.29957},
  url     = {https://arxiv.org/pdf/2606.29957}
}