GSR-Bench

Automating and Benchmarking LLM Agent with Tool-Use on Deploying Science Research Repositories

Overview

The increasing complexity of computer science research demands more effective tools for managing code repositories. Large Language Models (LLMs), such as Anthropic Claude and Meta Llama, have demonstrated significant advancements across various fields of computer science research, including the automation of diverse software engineering tasks.

We design a novel framework that utilizes LLM agents to automate the exploration of code repositories of computer science research projects. Specifically, by checking instructions from markdown files and interpreting repository structures, the model generates and automatically improves example scripts/code that set up the experimental environments and conduct research tasks.

To evaluate the effectiveness of LLMs in handling complex scientific development tasks on GitHub, particularly for ML/CV/NLP topics, we introduce the GitHub Science Repository Benchmark (GSR-Bench). This benchmark assesses LLMs from various aspects including accuracy, efficiency, and deployment script quality, aiming to explore their potential in conducting computer science research autonomously. Preliminary results from GSR-Bench indicate that LLMs can significantly enhance the workflow of repository deployment, thereby augmenting developer productivity and improving the management of developmental workflows.

Framework Design

The system comprises four agents: Command Drafter, Script Executor, Log Analyzer, Issue Retriever, and Web Searcher. These agents collectively facilitate the deployment of a science repository. The deployment task consists of several sub-tasks: environment setup, data preparation, training and test/demo. For reproducibility and to safeguard that there are some repository executions that involve system permissions, we use the Docker container to isolate the experiment environment for each repository.

This agent reads the deployment instructions and adjusts the script paths to ensure that the execution commands are correct. It divides the entire script into five sections, each corresponding to a step in the science repository deployment. This sectional division also serves as an evaluation standard later on.
The Script Executor receives the draft Command from the Drafter and runs it in a Docker environment equipped with tools like a batch terminal, Conda, GCC, and Make. After execution, it collects return messages, including standard output and error. Since bash doesn't provide explicit return codes, we tried setting predefined prompts to bash and parsing return codes from outputs and errors. However, many commands lack return codes, making bash feedback less informative. To address this, we integrated an LLM into the executor, instructing it to provide feedback based on the standard output and error messages. We then parse this feedback to generate a return code: if zero, then the command is executed successfully; otherwise, the command, output, and error messages are logged and sent to the Log Analyzer.
This Log Analyzer takes the log and the command that produced it, combines the two, and checks for updates, missing prerequisites, or script paths that need updating. It identifies any other missing components and returns a curated command.
The retriever takes in the command, standard output and error message and search them against the issue database we collected from the repository. RAG pipeline requires a search algorithms to query the input against the database. In our case, the queries are the commands executed, standard outputs and error messages. The databases are the issue records of each repository. We experimented with BM25 and Contriever as the retrieval algorithm, and decided to use BM25 for (1) BM25's higher search speed and (2) the fact that error logs and issues generally share keywords, so sentimental search do not possess much advantage over lexical search.
The Web Searcher utilizes the Perplexity API. If the pipeline reaches this stage, it indicates the execution failed; otherwise, the thread will not be transferred to Log Analyzer. The standard output, standard error, and the failed command are fed to Perplexity to search the web for solutions. The Web Searcher then analyzes the feedback from Perplexity and extracts scripts to resolve the issue.

Dataset Construction

We collect a diverse and comprehensive collection of science-related repositories from GitHub. We use GitHub tags to filter the repositories relevant to our science field, categorizing them into five areas of computer science: computer vision, natural language processing, reinforcement learning, robotics, and data mining. We sort the repositories by their star numbers and manually check the content of each repository to make sure it contains sufficient information to deploy.

We also collect repositories related to large language models because nowadays LLM-related papers are applicable to fields such as natural language processing, computer vision, and robotics. We manually select the most representative, research-oriented large language model repositories from the top repositories tagged with “large language model”.

Our benchmark currently includes 100 computer science research related repositories to ensure the diversity.

Evaluations on Mainstream LLMs

Generally, models showed higher success rates in Setup and Download tasks, with performance tapering off in more complex tasks such as Inference, Evaluation, and Training. This pattern highlights the challenges LLMs face in handling the full deployment process autonomously. The results on different execution type demonstrate that while LLMs when equipped with tools have made significant strides in automating repository deployment, their ability to manage complex tasks remains limited.

GPT-4o

Mistral-Large 2

Llama3.1-70B

Claude 3.5 Sonnet