Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

  • Admiral Patrick
    link
    fedilink
    English
    43
    edit-2
    3 months ago

    Ugh, fuck these and their tech bro creators so much. Not only is “AI” is enshittifying everything it touches, it’s even passively fucking up things it can’t touch.

    With the line needlessly blurring between search engines and LLM models, and sites rightfully blocking AI scraper bots, I fully believe we’re on the cusp of a digital dark age. If you think search engines suck now, just wait until very little of the quality content on the internet is indexable because people don’t want it scraped for training data. Or if it is indexed, the actual content is locked up, requiring registration or otherwise no longer being easily accessible.

    These “AI” tech bros are basically strip mining the internet while shitting where they eat (and maybe also pissing in the pool if I haven’t mixed enough metaphors for your liking). They’re exploiting what makes the internet great while simultaneously ruining it for the future.

    For as long as search engines have existed, we had a deal going: search providers could crawl and index site data and show ads to support themselves and in exchange, sites gained visibility. Now they’re using those same scrapers to steal content for their own purposes while depriving the sources of traffic. They have broken the deal, and with it, the fundamental way the internet has worked for over 30 years.

    I say it again: Fuck these AI-pushing tech bros and the horses they rode in on.

    • @[email protected]
      link
      fedilink
      English
      9
      edit-2
      3 months ago

      strip mining the internet

      That’s such a wonderfully succint way to describe the arc of tech companies over the last decade and a half.

      And even earlier than that, I miss the days of actually “surfing” the net. Start on one page you know and get farther and farther down into webrings and personal pages linking to each other. Could really find some awesome things tucked away way back when.