Solve Issues in Large Code Repositories
Team
- E/19/236,Lahiru Menikdiwela , email
- E/19/163, Eshan Jayasundara, email
- E/19/007, Achsuthan T., email
Supervisors
Table of content
- Abstract
- Related works
- Methodology
- Experiment Setup and Implementation
- Conclusion
- Publications
- Links
Abstract
The Software Engineering (SWE) Bench is a crucial benchmark for evaluating methods in software engineering problem-solving. Current open-source approaches, such as agentless solutions, SWE agents, AutoCoderOver, SWE Search, and OpenHands, have limitations in efficiency, cost, and accuracy. This research proposes a novel methodology that balances cost-effectiveness with high SWE Bench scores. Our approach integrates an iterative reasoning process, mimicking the learning process of software engineers, and employs a retrieval-augmented generation (RAG) framework leveraging Stack Overflow datasets. We introduce a graph-based repository representation to reduce the search space and improve retrieval accuracy by establishing structured connections between files. Additionally, we incorporate deep Large Language Model (LLM) thinking strategies, inspired by Google’s Mind Evolution research, to learn from the wrong patches through iterative recombination and refinement. Our proposed method aims to achieve higher SWE Bench scores while maintaining a cost-effective and practical approach for real-world software development.
Related works
Several current approaches to SWE Bench optimization have informed this research:
- Agentless approaches (e.g., Agentless arXiv:2407.01489) are simple and cost-efficient but lack iterative reasoning.
- Agentic approaches (e.g., SWE Agent, AutoCodeRover, SWE Search, OpenHands) offer dynamic actions through tools but suffer from high cost, tool errors, and complexity.
- Graph-based retrieval systems like RepoHyper and RepoGraph represent repositories as graphs for semantic search, though often at function-level granularity which increases complexity.
- Retrieval-Augmented Generation (RAG) systems (e.g., StackRAG) leverage external data like Stack Overflow, improving retrieval accuracy. However, relying on keyword-based StackOverflow APIs has limitations. This work instead uses vector databases for semantic search.
- Recent LLM developments, including models like Claude, GPT, and DeepSeek R1, introduce advanced reasoning and multi-agent debate to improve patch generation accuracy and decision-making.
- Deeper LLM Thinking (arXiv:2501.09891) emphasizes learning from incorrect outputs, informing this project’s strategy for iterative refinement from failed patches.
Methodology
The proposed methodology includes the following key steps:
1. Analysis of the Dataset
- Analyze relationships between keywords (file names, class names, function names) and golden patches.
- Proceed with retrieval-based implementation if the patch file is within the first-level neighbors 85% of the time.
2. Graph-Based Repository Representation
- Construct a graph capturing inter-file relationships (imports, function calls, dependencies).
- Reduces the search space by enabling targeted retrieval based on file connections rather than full-repo searches.
3. Retrieval-Based Suspicious File Identification
-
Use three retrieval methods:
- LLM-Based: Uses a language model to predict relevant files.
- Embedding-Based: Converts files/descriptions into vectors and computes similarity.
- LLM + RAG: Incorporates Stack Overflow knowledge to refine results.
-
Use DeepSeek R1 as a reasoning model to finalize file selection from multiple retrieval results.
4. Artificial Stack Trace Generation
- If patch files aren’t consistently found through graph traversal, generate artificial stack traces to guide retrieval.
- Expands the graph level-by-level, feeding context into the LLM to mimic logical execution paths.
5. RAG with Stack Overflow Data
- Integrates Stack Overflow knowledge using a semantic search via vector databases.
- Enhances context during retrieval and patch generation, reducing hallucination and improving factuality.
6. Patch Generation with Multiple LLMs and Iterative Learning
- Combine multiple LLMs (e.g., Claude, GPT, DeepSeek) for diverse patch generation.
- Learn from failed patches using iterative refinement and reasoning, improving overall patch quality.
Architecture
Experiment Setup and Implementation
- A graph-based repository model is created using technologies like NetworkX and visualized using Gephi.
- Vector databases (e.g., ChromaDB) and semantic search tools (e.g., Sentence-BERT) are used to support embedding-based retrieval.
- LLMs used: GPT-4, Claude, DeepSeek R1 for both retrieval and patch generation.
- The decision-making model DeepSeek R1 is used to select the best suspicious file.
- Stack Overflow data is integrated into a RAG framework using LlamaIndex, BeautifulSoup, and StackAPI.
-
Evaluation is based on the SWE-bench framework, focusing on:
- Retrieval accuracy (targeting >82%)
- Patch validity
- Cost-efficiency
- Artificial stack traces are generated if retrieval fails to find correct files initially.
Conclusion
-
Introduces two key innovations:
- Graph-Based Retrieval: Efficient, accurate retrieval by modeling inter-file relationships in a repository.
- Iterative Learning from Incorrect Patches: Enhances patch quality by analyzing and refining failed attempts.
-
These innovations mimic real-world debugging processes and lead to:
- Improved patch generation accuracy,
- Efficient bug localization,
- Reduced computational costs.
-
The proposed approach serves as a cost-effective, high-performance solution for automated debugging and patch generation in large codebases evaluated via the SWE Bench framework.