Add CHI-Bench eval results — agent harness: OpenAI Agents SDK

#197
by hlnchen - opened

Adds CHI-Bench (actAVA) evaluation results for deepseek-ai/DeepSeek-V4-Pro.

  • Benchmark: actava/chi-bench (evaluation_framework: harbor)
  • Agent harness: OpenAI Agents SDK (best-performing harness for this model)
  • Protocol: 75 managed-care tasks x 3 trials; metric pass@1 (%)
  • Scores: Overall 14.2 | PA 10.7 | UM 28.0 | CM 4.0 ; reliability pass^3 9.3
  • Source: CHI-Bench paper, arXiv:2605.16679

Submitted as community-provided results; close the PR if disputed.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment