R2E: Turning any GitHub Repository into a Programming Agent Environment

Introduction

R2E is a framework that turns any GitHub repository into a programming agent environment. This is achieved by a novel equivalence test generation technique powered by a synergy of program analysis and LLMs. These environments can be used for benchmarking static LLMs and dynamic programming agents that can interact with interpreter on real-world (potentially unseen) codebases. Furthermore, R2E environments can also be used for improving LLMs themselves by fine-tuning models with execution traces on such real-world codebases.

Test Harness Generation: The 🗝️ to Environments

R2E framework generates test harnesses for any GitHub repository by leveraging the following key design principles

Equivalence Tests, not Output Prediction. R2E decouples input generation from output prediction by generating equivalence tests that use the ground truth function implementation to generate test cases - a much simpler task than predicting the output of a function.
Harnesses, not I/O Pairs. R2E generates test harnesses that contain the test cases along with a setup which prepares additional information like database connections, external files, configurations, etc. This is a deparature from I/O examples with primitive types in traditional benchmarks.
Sliced Context, not Entire Repositories. R2E uses a novel dependecy slicing based prompt construction approach which provides the minimal repository context requires to understand the function implementation. Emperically, we find it to be both cheaper and more effective than using the entire repository context.
Execution and Coverage for Quality. The generated test harnesses are run in a self-equivalence mode to filter wrong or stochastic tests. Additionally, we evaluate the quality of the generated tests using branch coverage.

The following figure depicts an example test harness generated by R2E.

We evaluate the quality of the generated test harnesses by comparing the validity and coverage of the generated tests. Following table compares the quality of the generated tests for a simple output prediction setting against our equivalence test generation setting. As we see, the equivalence tests achieve better validity due to the decoupling of input generation from output prediction. We study different context creation approaches and find that sliced context is the most effective (see paper for details).

	In-File		Out-File
Strategy	Val	Cov	Val	Cov
Output Pred.	35.43%	87.59%	30.68%	82.54%
Equivalence	52.37%	88.18%	35.01%	79.65%

Evaluating Real-world Code Generation

R2E environments enable evaluating code generation systems. Particularly, we can evaluate both single round "static" code generation from LLMs and multi-round programming agents. We construct R2E-Eval1, a large scale benchmark spanning 137 repositories that can be used to evaluate code generation systems. The figure on the bottom left compares single round performance of current LLMs and on the bottom right depicts performance of self-repair agents over multiple rounds.

As we see, model performances on R2E-Eval1 are considerably lower than traditional benchmarks like HumanEval highlighting the challenges of real-world codebases. Next, we find that self-repair agents can significantly improve the performance of LLMs using oracle mistake feedback and compiler interactions over multiple turns. GPT-4 can fix more than 30% of the mistakes after 3 turns! We will release a larger and more challenging benchmark (R2E-Eval2) soon with more detailed results. Stay tuned!

BibTeX

@inproceedings{
    jain2024r2e,
    title={R2E: Turning any Github Repository into a Programming Agent Environment},
    author={Naman Jain and Manish Shetty and Tianjun Zhang and King Han and Koushik Sen and Ion Stoica},
    booktitle={ICML 2024},
}

R2E: Turning any GitHub Repository into a Programming Agent Environment

Introduction

👆 See R2E turn a few GitHub repositories into a benchmark.

Test Harness Generation: The 🗝️ to Environments

Evaluating Real-world Code Generation

Beyond Code Generation with Environments

BibTeX