Some of the world’s wealthiest companies, including Apple and Nvidia, are among countless parties who allegedly trained their AI using scraped YouTube videos as training data. The YouTube transcripts were reportedly accumulated through means that violate YouTube’s Terms of Service and have some creators seeing red. The news was first discovered in a joint investigation by Proof News and Wired.

While major AI companies and producers often keep their AI training data secret, heavyweights like Apple, Nvidia, and Salesforce have revealed their use of “The Pile”, an 800GB training dataset created by EleutherAI, and the YouTube Subtitles dataset within it. The YouTube Subtitles training data is made up of 173,536 YouTube plaintext transcripts scraped from the site, including 12,000+ videos which have been removed since the dataset’s creation in 2020.

Affected parties whose work was purportedly scraped for the training data include education channels like Crash Course (1,862 videos taken for training) and Philosophy Tube (146 videos taken), YouTube megastars like MrBeast (two videos) and Pewdiepie (337 videos), and TechTubers like Marques Brownlee (seven videos) and Linus Tech Tips (90 videos). Proof News created a tool you can use to survey the entirety of the YouTube videos allegedly used without consent.

  • @[email protected]
    link
    fedilink
    English
    314 months ago

    Because the techbros know that licensing is far more expensive than theft.

    It’d cost so much money to license the content that the AI model they’re trying to shit out needs that it’d literally never be profitable, so they’re doing that thing from Fight Club where they assume the number of times they’ll get sued and lose is going to cost less than paying anyone reasonable license fees.

    The stupid thing is, that in the US at least, they’re not wrong: in a civil suit over this you have to pay for your own lawyer fees, and since this would be a Federal case, that ends up being pretty expensive.

    And, even if you win, you’re just going to likely get statutory damages since proving real actual losses is probably impossible, so you’d be lucky if, after a few years in court, to end up coming out ahead - and having to pay for all the legal and other costs in the mean time - so why would you bother?

    It’s a pretty shitty situation that’s being exploited because the remedies are out of the reach of most people who’ve had their shit stolen so that OpenAI can suggest you cover your pizza with glue.

    • @BertramDitore
      link
      English
      84 months ago

      Thank you for the thoughtful answer. This is so frustrating, and is very similar to other situations where megacorps decide that paying fines is cheaper than following the law.

      Another terrible byproduct of all this is the false incentive structure it sets up. Rather than investing in people who are capable of producing unique and creative products, it incentivizes people to make more quantity of shitty content rather than high quality stuff, and that will ultimately make the eventual consumer product that’s based on shitty stolen work, well, shitty.

      • @[email protected]
        link
        fedilink
        English
        64 months ago

        It makes people who WANT to make creative content decide maybe they shouldn’t, or they do things like disable subtitles so AI won’t steal content via that means which is a usability issue.

        If you know that your photos, stories, videos, and whatever else were going to be slurped up so someone else could make money on it, it makes sharing less attractive.

        • @rekorse
          link
          14 months ago

          What’s shocked me is that creative folk aren’t abandoning google in droves. Even a boycott would make more sense than mildly complaining and then posting more videos on YouTube and waiting for your google check to arrive.

          At some point the artists need to stand up for themselves, it can’t just be all the tech bros on lemmy shouting about it. Feels a lot like people who hate their jobs but do nothing to find a better one.

          • @[email protected]
            link
            fedilink
            English
            14 months ago

            Its all about monetization. YouTube is the only credible game in town and I’m not sure how you fix that.

            The technical hurdles are largely solved: something like peertube is good enough, except there’s no clear path to monetization and no clear path to growing an audience.

            If the money problem and discoverability are solved then sure, I bet a lot of creators will happily leave googles services since its been an abusive relationship for some time for a lot of them.

            • @rekorse
              link
              1
              edit-2
              4 months ago

              Like I said they should strike. Has everyone forgotten that strikes are almost always short term sacrifices? Otherwise people would strike for fun…

              There won’t be another place to go until artists go there and build it themselves or demand it be built.

              Theres also a good chance that the monetization scheme people are used to under Google is fiscally irresponsible. People posting on YouTube might need to come to terms with their art being worth less outside Googles system.

      • subignition
        link
        fedilink
        34 months ago

        We need a corporate death penalty for flagrant and repeated disregard of the law like this.

        Oh you “moved fast and broke things?” Well that included the law, so now we’re liquidating your assets, compensating the injured parties to the fullest extent, and spending whatever’s left over paying to put homeless people in homes.

        • @BertramDitore
          link
          English
          24 months ago

          Hard agree. It’s the only kind of death penalty I could get behind.