• @Stovetop
    link
    English
    169 months ago

    This is only going to be adding recent Reddit data.

    A growing amount of which I would wager is already the product of LLMs trying to simulate actual content while selling something. It’s going to corrupt itself over time unless they figure out how to sanitize the input from other LLM content.

    • @kromem
      link
      English
      7
      edit-2
      9 months ago

      It’s not really. There is a potential issue of model collapse with only synthetic data, but the same research on model collapse found a mix of organic and synthetic data performed better than either or. Additionally that research for cost reasons was using worse models than what’s typically being used today, and there’s been separate research that you can enhance models significantly using synthetic data from SotA models.

      The actual impact will be minimal on future models and at least a bit of a mixture is probably even a good thing for future training given research to date.