A new web crawler launched by Meta last month is quietly scraping the web for AI training data

lemme in · 6 months ago

A new web crawler launched by Meta last month is quietly scraping the web for AI training data

@GarrulousBrevity · edit-2 6 months ago

Have you used a search engine? Crawlers are not generative AI.

Aniki 🌱🌿 · edit-2 6 months ago

The internet is not a search engine, and no - search engines are not generative ai. That’s new.

Do you have any idea how many content bot crawlers there are? Most of the corporate sites I host at work are serving content to bots more than half the time.

Do you know altivista still has bots??

When was the last time you used that search engine?

@GarrulousBrevity · 6 months ago

I guess I don’t really see the problem with that though. There are configuration levers you could be pulling, but those sites you’re hosting are not. There are lots of shady questions about how these models are getting training data, but crawlers have a well defined opt out mechanism.

The web would not be what we know it as without them, because it’s how you find sites. Why shouldn’t Alta Vista have one? I don’t object to what Alta Vista does with the data.

Aniki 🌱🌿 · 6 months ago

Mate we have absurdly restrictive robots.txt including a custom WordPress plugin that automatically generates the file and the bots don’t give a fuck.

@GarrulousBrevity · 6 months ago

But meta’s will, and Alta Vista. I’m not angry at them when a script kitty makes a bad crawler

🇨🇦🇩🇪🇨🇳张殿李🇨🇳🇩🇪🇨🇦 · 6 months ago

Anybody who thinks “well defined opt out mechanisms” are good has no clue how “consent” works.

@GarrulousBrevity · 6 months ago

I know what you’re trying to say, but that phrasing though. Being able to opt out is an important part of consent. No means no, man.

🇨🇦🇩🇪🇨🇳张殿李🇨🇳🇩🇪🇨🇦 · 6 months ago

Even more important to being able to opt out, however, is not having to opt out in the first place.

Otherwise you get this script:

Wanna fuck?

No.

How about now?

No.

It’s five minutes later. Have you changed your mind?

No.

. . .

Which is exactly what techbrodudes have been doing to us by having “opt out” features a dozen times a day.

@GarrulousBrevity · 6 months ago

I think of this as a problem with opt-in only systems. Think of how sites ask you to opt in to allow tracking cookies every goddamn time a page loads. A rule based system which lets you opt in and opt out, like robots.txt, to let you opt out of cookie requests and tell all sites to fuck would be great. @[email protected] is complaining about malicious instances of crawlers that ignore those rules (assuming they’re right and that the rules are set up correctly), and lumping that malware with software made by established corporations. However, Meta and other big tech companies haven’t historically had a problem with ignoring configurations like robots.txt. They have had an issue with using the data they scrape in ways that are different than what they claimed they would, or scraping data from a site that does not allow scraping by coming at it via a URL on a page that it legitimately scraped, but that’s not the kind of shenanigans this article is about, as meta is being pretty upfront about what they’re doing with the data. At least after they announced it existed.

An opt-in only solution would just lead to a world where all hosts were being constantly bombarded with requests to opt in. My major take away from how meta handled this is that you should configure any site you own to disallow any action from bots you don’t recognize. As much as reddit can fuck off, I don’t disagree with their move to change their configuration to:

User-agent: *
Disallow: /

🇨🇦🇩🇪🇨🇳张殿李🇨🇳🇩🇪🇨🇦 · 6 months ago

My take is that if never-ending opt-in requests are a pest, perhaps people should stop doing the pesky activities.

Let’s move this from the digital world (where people seem to get easily confused on topics of consent) into the physical. Remember the good old days of door-to-door salesmen? (Probably not. I only barely remember them and I’m likely far older than you.) In any case you had some twat interrupting your daily/evening tasks, your family time, your sleep, etc. all so they could sell you some shit you didn’t want. They got so obnoxious that regulations had to be put in place to control them: what time they could arrive, what things they could say, what tactics they could or could not use (the old “foot in the door” shit), etc. Finally, over time, people would put up aggressive signs about sales (which salesmen would cheerily ignore, rather like this robots.txt thing), buy dogs to frighten them off, etc.

And this was being done by people selling the products of “established corporations”. When taken to task for it they’d throw the salesmen under the bus, claiming that the tactics used were not countenanced by them (but the fact that their sales targets practically mandated this was quietly left unspoken). “Established corporations” are no more prone to ethical behaviour and, indeed, even basically social behaviour than are small agents. It’s just that in this day and age when they commit an ethical breach (like Google’s camera trucks siphoning personal data that time) it’s an ‘accident’ or ‘just some bad apples’ and so on.

The reality is that Meta can be trusted as far as you can throw it. Which is to say zero distance. As can Google, Microsoft, anything Elon Musk foists on us at any point, etc. etc. etc. And this whole “opt out” bullshit is how they get away with being antisocial shits.

… but that’s not the kind of shenanigans this article is about, as meta is being pretty upfront about what they’re doing with the data. At least after they announced it existed.

Uh … Meta is being pretty upfront about what they’re doing after they ran it a while and siphoned off the stuff they wanted. This is not the pass you seem to think it is.

@GarrulousBrevity · 6 months ago

Oh, no, that wasn’t excusing Meta in general. Just giving them a pass on that they’ve had, to my knowledge, a history of respecting robots.txt, which makes this piece of software better than outright malware. Starting it secretly and not giving site hosts a chance to make sure they had their privacy configured the way they liked first was a shady as hell move, no argument there.