Fixing SWE-bench: A Smarter Way to Evaluate Coding AI

LLMs show strong coding abilities, but they have mostly been tested on isolated, small-size problems, like the basic Python tasks in the well-known HumanEval and MBPP benchmarks. This leaves their proficiency in realistic, large-scale projects still unexplored. As these models become more integrated into professional coding workflows, the big question is: Are they truly ready to assist in real-world software development?
SWE-bench was created to address this gap, serving as an industry-standard benchmark used by tech leaders like OpenAI, Anthropic, DeepSeek, and Amazon to test models in realistic coding environments. However, SWE-bench has significant limitations that skew evaluations, making it difficult to assess an LLM’s true capabilities.
In this article, we’ll break down the shortcomings of SWE-bench, how to clean and refine the dataset to ensure fair, reproducible assessments, and the next steps to develop an even more comprehensive, accurate, and scalable benchmark.
What is SWE-bench?
SWE-bench was designed to address a recurring dilemma: Can LLMs solve real GitHub issues as effectively as human developers? Unlike coding competition tasks that focus on isolated problems, real-world software development entails debugging large codebases, analyzing issue reports, and making coordinated changes across multiple files. SWE-bench requires models to understand an issue within the broader context of a software project, implement fixes, and verify the solution through testing.
The SWE-bench dataset is built from real GitHub issues and their corresponding bug-fixing commits. Each sample includes a problem description or feature request, alongside the pull request containing the code changes and unit tests that address the issue. LLMs are tasked with using the description context to generate patches that resolve the issues by modifying the codebase. If the unit tests pass after implementing the solution, the sample is marked as correctly solved. This data collection approach is a more realistic way to capture the complexities and nuances of maintaining large-scale software projects.
SWE-bench evaluations are in high demand among model developers. However, the benchmark needs to be cleaned up and extended with Dockerfiles before it’s ready to apply in auto-evals.
Where the current benchmark falls short
Although the SWE-bench provides a strong foundation, it does have some shortcomings.
Unfit tests. Some tests are irrelevant or overly specific, leading to many false negatives. For example, a test might reject a correct solution because it slightly varies from the reference solution.
Vague or ambiguous problem descriptions. The tasks in this dataset aren't always clear, making it difficult for the model to understand the problem or determine what needs to be solved. This ambiguity causes models to generate incorrect solutions due to misinterpreting requirements.
Inconsistent development environments. A patch that works in one setting might fail in another—not due to flaws in the solution, but because of differences in setup (specific dependencies, configurations, and system settings). These inconsistencies can lead to misleading test results, where failures stem from mismatched environments rather than the model's actual performance.
Outdated dataset. The latest pull request in SWE-bench is dated 2023. For Python, this is ancient—leading to irrelevant evaluations or misleading claims about model performance based on older tasks that don’t reflect the modern coding landscape. The benchmark needs to be updated to keep up with evolving best practices, updates, and deprecations.

Lack of automated environment replication: The main problem with SWE-bench is that we can't use it for automatic evaluation without manually setting up the environment for each repository first, which is very resource-intensive and time-consuming (research shows that manually creating functional environments for each repository can take an average of 10 hours). One possible solution is creating a Dockerfile for each data point, which still requires significant time and effort, with obstacles like missing dependencies, unsupported package versions, or conflicts between old and new system libraries.
To put things in perspective, the current best-performing agent, powered by Claude 3.5, can solve only 29% of the issues. Acknowledging the benchmark’s limitations, we must ask ourselves: Are the models really that far from independently handling compound coding tasks, or are the evaluation standards skewed?
The weaknesses in SWE-bench make model assessments unreliable. Yet, there are ways to address the shortcomings and yield a benchmark that truly mirrors the challenges of real-world software development. My team at Toloka took the first step to upgrade SWE-bench with a thorough cleanup, and laid out plans for future work that will lead to a new and improved benchmark.
Cleaning up the SWE-bench dataset
Developing benchmarks that evaluate code quality and problem-solving skills requires a deep understanding of how software works across different environments. Drawing on deep expertise, the Toloka team collaborated with coding experts in our network of data annotators, giving them stringent guidelines to refine the dataset and create structured, multi-turn code-editing demonstrations. The coding experts assessed training samples based on problem specification, test coverage, and correctness—carefully annotating or rejecting samples to maintain data quality.
Our two-step approach is designed to filter out low-quality samples and enhance benchmark reliability.
Step 1. Evaluate the quality of the samples.
The first step is to identify samples that contain unsuitable tests or confusing descriptions. Annotators rate each sample on three criteria: clarity of the issue, test relevance, and difficulty.
Clarity looks at how well the problem is explained. Some issues are clearly defined, while others are so vague they're nearly impossible to understand
Test relevance checks whether the tests are fair and adequate: do they include all valid solutions, or are they too specific or not specific enough?
Difficulty estimates how long it would take a professional to solve the problem. This ranges from quick fixes (under 15 minutes) to complex tasks (over 4 hours).
Based on the scores, problematic samples are removed from the dataset. Coding experts also provide feedback on aspects that may have been overlooked and flag other significant issues they see in the samples. The final step is to scan for false negatives caused by test misalignment and unclear descriptions. As a result, the final dataset only includes samples that meet a minimum quality threshold.
Step 2. Reject low-quality samples and complete missing fields for non-rejected samples.
After removing low-quality samples, the next step is to test solution patches from the remaining samples. Annotators create Dockerfiles to be used for setting up the environment corresponding to each issue. Then annotators apply solution patches to confirm they function correctly, eliminating the possibility of tests failing due to a setup error. Once the environments are configured and running, annotators run the unit tests to check if the proposed solution solves the problem.
A sample is considered correct only if it passes both types of unit tests, which are:
FAIL_TO_PASS: Designed to fail before the solution is implemented but passes afterward. Passing the FAIL_TO_PASS test confirms that the issue has been fixed.
PASS_TO_PASS: These tests pass before and after applying the code changes, ensuring the proposed solution doesn't introduce any new bugs to the codebase.
In summary, the final dataset consists of high-quality samples that successfully pass both the FAIL_TO_PASS and PASS_TO_PASS tests.
The outcome and benefits of dataset cleanup
The resulting dataset has 5000 clean samples that are more suitable for assessing LLM coding skills than the original SWE benchmark. This thorough cleanup has several significant advantages that directly benefit model evaluation, particularly when testing advanced coding assistants like GitHub Copilot, Cursor AI, or open-source models available through Ollama.
First, it allows for more realistic model evaluation through clearly defined tasks and relevant scenarios. The LLMs are tested in real-world conditions, leading to more accurate assessments of their abilities.
This method introduces another key benefit: reproducibility. With standardized testing environments, results will be consistent no matter where or when we run the tests. This makes it easier to compare performance across different LLMs and monitor progress over time.
All these adjustments contribute to increased credibility. Benchmarks that are user-friendly, conducive to realistic evaluations, and reproducible are inherently more trustworthy within the research community. The cleaned SWE benchmark is an invaluable tool for evaluating models in software engineering.
Further improvements and the prospect of a new benchmark
Although cleaning up the dataset is a big step forward, there is still more to be done. At Toloka, we are leveraging our extensive network of coding experts to work on improving SWE-bench's usability and expand it into an entirely new benchmark. Let's look at some of the proposed adaptations.
Add everyday coding tasks. The SWE-bench dataset doesn't represent the diverse real-world tasks enterprise developers face on a daily basis. Aiming to cover this gap, our research team is working on an extended version of SWE-bench that includes tasks from the enterprise world. After all, tasks should reflect developers’ typical needs when they use GitHub Copilot, Cursor AI, Codeium or other popular coding assistants.
Include diverse prompts. SWE-bench mainly focuses on one type of prompt, restricting the evaluation of crucial programming skills. Adding a wider array of prompts, such as code translations, analytical tasks, and bare code generation, will expand the evaluated skill set and provide a more rounded view of LLMs' coding abilities.
Expand to include more programming languages. Currently, there’s SWE-bench for Python, SWE-bench Multimodal for JavaScript, and SWE-bench-java… well, for Java. Broadening the range to other programming languages like C# and TypeScript will create a versatile benchmark that supports testing on diverse use cases. Later, industry-specific languages like Fortran, COBOL, Lisp and C could be added.
Increase dataset size. The current SWE-bench has 21,527 samples, with only 5000 approved by human annotators. Augmenting the dataset with new data will make it more comprehensive.
Address the static nature of the dataset. Another concern is the dataset's static nature, which does not account for new coding practices and can quickly become outdated. The dataset will need to be continuously updated to ensure it stays relevant to modern developments in Python, JavaScript and Java.
Create more suitable data for assessing AI agents. Finally, creating new data designed explicitly for assessing AI agents will reflect the reality of modern AI development. This includes implementing feedback loops in which agents use output logs and error messages to refine their solutions, as well as capturing the agents' reasoning processes similar to the chain-of-thought approach used by Nebius.
Why it matters
Refined benchmarks offer a clearer, more accurate picture of how software development tools and models perform. By making evaluations more accessible and reliable, they empower researchers and developers to make well-informed decisions. Our work enhances the quality of benchmarking data, ensuring stronger, more meaningful assessments of proprietary and open-source models, whether you’re considering the DeepSeek model family, Anthropic’s Claude, Google’s Gemini, or even Alibaba’s Qwen.
At Toloka, our deep expertise in data annotation and benchmarking allows us to extend these improvements beyond SWE-bench. We look forward to opportunities to refine other benchmarks as we deliver high-quality, trustworthy data across a broad range of use cases and industry domains.
Ready to take your LLM's coding skills to the next level?
Improve your model assessments with our specialized coding data solutions. Learn more
Article written by:
Updated:
Mar 17, 2025