Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.
The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.
A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.
While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.
Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.
My take is that if never-ending opt-in requests are a pest, perhaps people should stop doing the pesky activities.
Let’s move this from the digital world (where people seem to get easily confused on topics of consent) into the physical. Remember the good old days of door-to-door salesmen? (Probably not. I only barely remember them and I’m likely far older than you.) In any case you had some twat interrupting your daily/evening tasks, your family time, your sleep, etc. all so they could sell you some shit you didn’t want. They got so obnoxious that regulations had to be put in place to control them: what time they could arrive, what things they could say, what tactics they could or could not use (the old “foot in the door” shit), etc. Finally, over time, people would put up aggressive signs about sales (which salesmen would cheerily ignore, rather like this robots.txt thing), buy dogs to frighten them off, etc.
And this was being done by people selling the products of “established corporations”. When taken to task for it they’d throw the salesmen under the bus, claiming that the tactics used were not countenanced by them (but the fact that their sales targets practically mandated this was quietly left unspoken). “Established corporations” are no more prone to ethical behaviour and, indeed, even basically social behaviour than are small agents. It’s just that in this day and age when they commit an ethical breach (like Google’s camera trucks siphoning personal data that time) it’s an ‘accident’ or ‘just some bad apples’ and so on.
The reality is that Meta can be trusted as far as you can throw it. Which is to say zero distance. As can Google, Microsoft, anything Elon Musk foists on us at any point, etc. etc. etc. And this whole “opt out” bullshit is how they get away with being antisocial shits.
Uh … Meta is being pretty upfront about what they’re doing after they ran it a while and siphoned off the stuff they wanted. This is not the pass you seem to think it is.
Oh, no, that wasn’t excusing Meta in general. Just giving them a pass on that they’ve had, to my knowledge, a history of respecting robots.txt, which makes this piece of software better than outright malware. Starting it secretly and not giving site hosts a chance to make sure they had their privacy configured the way they liked first was a shady as hell move, no argument there.
I don’t know I’d call it “respecting robots.txt” if you don’t tell people that your robot even exists. Basically if you don’t just automatically block any and all robots (and then watch many of them cheerfully ignore you), this is an end-run around user desires.
Which is why I give these institutions the same moral regard I give door-to-door salesmen, telemarketers, slug slime, and other moral vacuums.