Results indicate that the real-world freelance work in our
benchmark remains challenging for frontier language mod-
els. The best performing model, Claude 3.5 Sonnet, earns
$208,050 on the SWE-Lancer Diamond set and resolves
26.2% of IC SWE issues; however, the majority of its so-
lutions are incorrect, and higher reliability is needed for
trustworthy deployment
That’s still kind of crazy if it can actually do some meaningful portion of real world software jobs by itself.
With the caveat that the majority of its “solutions” are wrong. So it generates output that looks plausible enough to be accepted as an answer but is not exactly correct. That’s pretty much on par for LLMs.
The lack of precision may be acceptable for a chatbot or a summarizer. But for coding you need precision and that’s something LLMs don’t offer.
That’s still kind of crazy if it can actually do some meaningful portion of real world software jobs by itself.
With the caveat that the majority of its “solutions” are wrong. So it generates output that looks plausible enough to be accepted as an answer but is not exactly correct. That’s pretty much on par for LLMs.
The lack of precision may be acceptable for a chatbot or a summarizer. But for coding you need precision and that’s something LLMs don’t offer.