Mid-last year, I had GPT (maybe 5.0 or 5.1) try to find the source of a bug. Naturally, this code didn’t have tests and git bisect wouldn’t work, and it was a UI interaction bug for which I’m not even really qualified to write a test for, so I asked Codex to bisect between dates X and Y to find the commit that introduced this bug. Codex immediately told me the offending commit was after this date range (which couldn’t possibly be correct). On telling Codex this was wrong, it then told me some commit that was obviously also not the offending commit once or twice. On telling it those were wrong, it then told me the offending commit was some plausible looking commit. When I asked it to prove or disprove its theory, it told me that it wrote a test and confirmed that the alleged commit was the breaking commit.

I then asked it to show me by making a video with the full developer end-to-end stack in the normal browser test environment. It claimed that it didn’t have permissions to do that (which was a lie), but it could make video of the execution of the repro before and after the commit in playwright with the appropriate test code. The video was convincing and showed the feature working properly before the commit and failing to work after the commit. Something about this didn’t feel right, so I tried reproducing the issue by hand before and after the commit and found out that the whole thing was a fabrication. The video made it look like Codex had reproduced the bug, but it was an artificial browser environment that was designed to create a fake repro, not the real environment.

Like I said, because this was non-ironically such a great experience, I immediately thought to myself, “how can I get more of this?”

  • x1gma
    link
    fedilink
    arrow-up
    1
    ·
    14 hours ago

    I’m not assuming he’s not competent, and I’ve looked him up - he’s by no means incompetent. But he himself said he’s not qualified to write tests for that. If you cannot write tests for whatever you’re doing, you shouldn’t be doing that. Someone with his knowledge, or at least the knowledge he should have given his CV, should know that. In this specific case he is incompetent, because what he’s doing is simply wrong on every level.

    You don’t need to be an expert on what you’re doing to use LLMs efficiently. You can also have solid prompts and ideas to use a LLM to cancel out your personal lack of knowledge in a specific domain. In any case, expecting LLMs to produce correct output when you’re actively guiding it to do something wrong is simply stupid.

    Any claim of actual intelligence in a LLM is simply not true. Never been, never will be. Artificial intelligence is an umbrella term for ANI, AGI and ASI, artificial narrow, general and super intelligence respectively. A narrow intelligence is not even close to human intelligence, and is hyper-specialized in a single task. All and any LLMs are and always will be ANIs, and their hyper-specialization is basically a stochastic word (well, token) completion on steroids. An AGI is mostly defined as “close to” or “approaching” human intelligence, as in general knowledge and transfer of it into unrelated fields.

    This, reasoning and capabilities will help you nothing when you guide it in the wrong direction. You need to keep in mind the absolutely mind blowing amount of money involved around LLMs. The bubble is too big to fail. Any LLM is a product, and their first and foremost goal is to make you use it, so you pay for it - therefore the primary directive of the AI is to give you what you ordered, to glaze you, and to be your best, obedient buddy. You want a video of the bug, of course! Here you have a video of how that bug looks like - stochastically that’s the answer to the prompt.