Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.
The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.
A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.
While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.
Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.
The AI cat is out of the bag. How do they know they’re not feeding AI generated garbage into their models?
Actually I think I’m gonna go in my personal website and add 200 pages of locally generated LLM garbage with hidden links to those pages that only bots should follow.
They don’t. Any popular place on the internet which lets users type text for people to publicly view is now full of AI trash. They’ve fucked it, this shit is just gonna spiral into progressively worse garbage
They screwed the artificial pooch in a manner of speaking.