• jj4211
    link
    fedilink
    arrow-up
    1
    ·
    11 hours ago

    Not just automated testing but, for CodeGen to really work ‘agentic’ like:

    • You need that automated test case to trigger the misbehavior 100% of the time (often, the act of figuring out how to trigger the misbehavior means you already know the fix, but not always)
    • That automated test case needs to be succinct and as much as possible, feed only the problematic output back to the CodeGen. CodeGen can easily get distracted by irrelevant input
    • That automated test needs to be very quick from time to code change to test case completion. Even with everything just right, expect the CodeGen to basically thrash around guessing things that sound right but to no avail. Most attempts summed up as: "Ok, the problem is absolutely caused by <plausible but useless prose>, and here is the definite fix <code changes that do nothing at all for the error> and it is complete but just double checking… <test fails> Ok, that didn’t quite fully fix it… see next attempt. So a long test case can make it take an eternity as the CodeGen has to wait and run it over and over and over again, while a human might actually reason through it.
    • You need to let the token hose go. It’s guessing and it can take quite a few guesses to get right.
    • Be prepared for pointless code changes along the way. It makes guesses and often leaves the wrong guesses in, doing nothing at all to help the problem, but potentially having side effects. It decides that while it didn’t work, it must have been a part of the solution, and that it must be left in.
    • Consequently, you better have an amazing test suite to capture the likely side effects of those spurious changes, or be prepared to unwind the progress and extricate the result manually.
    • pixxelkick
      link
      fedilink
      arrow-up
      1
      ·
      2 hours ago

      Absolutely 100% all of this, though with a lot of other tricks like caveman mode and careful skill files and helper scripts to help the agent quickly surgical extract out just the useful output, you can substantially reduce token burn and improve its memory.

      As well as carefully having it rollback changes everytime a fix doesn’t work, and having ut keep a markdown file log of each fix it tried and the results, so it can review each thing it tried previously.