• @[email protected]
    link
    fedilink
    English
    149 months ago

    People that want to train AI models on Reddit content can just scrape the site, or use data from archive sites that archive Reddit content.

    • AnyOldName3
      link
      English
      79 months ago

      The archive sites used to use the API, which is another reason they wanted to get rid of it. I always found they were a great moderation tool as users would always edit their posts to no longer break the rules before they claimed a rogue moderator had banned them for no reason, and there was no way within reddit to prove them wrong.

        • AnyOldName3
          link
          English
          29 months ago

          Yeah, the Wayback Machine doesn’t use Reddit’s API, but on the other hand, I’m pretty sure they don’t automatically archive literally everything that makes it onto Reddit - doing that would require the API to tell you about every new post, as just sorting /r/all by new and collecting every link misses stuff.

          • @[email protected]
            link
            fedilink
            English
            29 months ago

            You don’t need every post, just a collection big enough to train an AI on. I imagine it’s a lot easier to get data from the Internet Archive (whose entire mission is historical preservation) than from Reddit.

            The thing I’m not sure about is licensing, but it seems like that’d the case for the whole AI industry at the moment.