Largest Dataset Powering AI Images Removed After Discovery of ‘Suspected’ Child Sexual Abuse Material

BlackEco · 11 months ago

Largest Dataset Powering AI Images Removed After Discovery of ‘Suspected’ Child Sexual Abuse Material

FaceDeer · edit-2 11 months ago

Sounds like nothing particularly unusual or alarming. Researchers found a few thousand images that could be illegal that were referenced by it, told LAION about it, and LAION pulled the database down temporarily while checking and removing them. A few thousand images out of five billion is not significant.

There’s also the persistent misunderstanding of what the LAION database is, which is even perpetuated by the paper itself (making me suspicious of the researchers’ motivations since they surely know better). The paper says: “We find that having possession of a LAION‐5B dataset populated even in late 2023 implies the possession of thousands of illegal images—not including all of the intimate imagery published and gathered non‐consensually, the legality of which is more variable by jurisdiction,” When the LAION-5B dataset doesn’t actually have any pictures at all in. It’s purely a list of URLs pointing at images that are on the Internet, each with text describing them. Possessing the dataset doesn’t make you in possession of any of those images.

Edit: Yeah, down at the bottom of the article I see the researcher state that in his opinion LAION-5B shouldn’t even exist and use inaccurate emotionally-charged language about how AI training data is “stolen.” So there’s the motivation I was suspicious of.

@Zarxrax · 11 months ago

While I get what you are saying, it’s pretty clear that what he was saying was that if you actually populate the dataset by downloading the images contained in the links (which anyone who is actually using the dataset to train a model would need to do), then you have inadvertantly downloaded illegal images.

It is mentioned repeatedly in the article that the dataset itself is simply a list of urls to the images.

@General_Effort · 11 months ago

Makes one wonder if there is some lobby org behind this. The benefits to major corporate interests are obvious, and it feels a little campaigny.

@[email protected] · edit-2 10 months ago

deleted

@General_Effort · 11 months ago

What?

@[email protected] · 11 months ago

He’s (correctly) taking the piss

@General_Effort · 11 months ago

I don’t get it. What’s the joke?

@[email protected] · edit-2 10 months ago

deleted

SineSwiper · 11 months ago

This new “journalism” site is not doing itself any favors with bullshit headlines like this. And this is not the first wildly inaccurate article I’ve seen from 404 Media.

@[email protected] · edit-2 10 months ago

deleted

SineSwiper · 11 months ago

LAION is a database of URLs, gathered from publicly-available data on the Web. Who is “taking” anything?

@[email protected] · 11 months ago

“Taking” is doing a lot of work there, and fundamentally the issue at heart.

@[email protected] · edit-2 10 months ago

deleted

FaceDeer · 11 months ago

“Copyright violation” is probably the wording you’re looking for. Copyright violation is not taking or theft or stealing or any of those other words - it’s copyright violation.

Whether training an AI on a copyrighted work without permission of the copyright holder is a violation of copyright is something that is debatable. But it most definitely is not stealing or theft. Theft is covered by completely different laws.

@[email protected] · edit-2 10 months ago

deleted

@[email protected] · 11 months ago

Unless you feel like being a pedant, copyright infringement is also known as content theft.

https://www.deviantart.com/team/journal/Calling-All-Creator-Platforms-to-Fight-Art-Theft-901238948

@General_Effort · edit-2 11 months ago

It occurs to me that a lot of people don’t know the background here.

LAION is a German Verein (a club). It’s mainly a German physics/comp sci teacher who does this in his spare time. (German teachers have the equivalent of a Master’s degree.)

He took data collected by an American non-profit called Common Crawl. “Crawl” means that they have a computer program that automatically follows all links on a page, and then all links on those pages, and so on. In this way, Common Crawl basically downloads the internet (or rather the publicly reachable parts of it).

Search engines, like Google or Microsoft’s Bing, crawl the internet to create the databases that power their search. But these and other for-profit businesses aren’t sharing the data. Common Crawl exists so that independent researchers also have some data to study the internet and its history.

Obviously, these data sets include illegal content. It’s not feasible to detect all of it. Even if you could manually look at all of it, that would be illegal in a lot of jurisdictions. Besides, which standards of illegal content should one apply? If a Chinese researcher downloads some data and learns things about Tiananmen Square in 1989, what should the US do about that?

Well, that data is somehow not the issue here, for some reason. Interesting, no?

The German physics teacher wrote a program that extracted links to images, as well as their accompanying text descriptions, from Common Crawl. These links and descriptions were put into a list - a spreadsheet, basically. The list also contains metadata like the image size. On top of that, he used AI to guess if they are “NSFW” (IE porn), and if people would think they are beautiful. This list, with 5 billion entries, is LAION-5b.

Sifting through Petabytes of data to do all that is not something you can do on your home computer. The funding that Stability AI provided is a few thousand USD for supercomputer time in “the cloud”.

German researchers at the LMU - a government funded university in Munich - had developed a new image AI, which is especially efficient and can be run on normal gaming PCs. (The main people now work on a start-up in New York.) The AI was trained on that open source data set and named Stable Diffusion in honor of Stability AI, which had provided the several 100k USD needed to pay for the supercomputer time.

These supposed issues are only an issue for free and open source AI. The for-profit AI companies keep their data sets secret. They are fairly safe from accusations.

Maybe one should use PhotoDNA to search for illegal content? The for-profit company PhotoDNA, which so kindly provided its services for free to this study, is owned by Microsoft, which is also behind OpenAI.

Or maybe one should only use data that has been manually checked by humans? That would be outsourced to a low wage country for pennies, but no need: Luckily, billion-dollar corporations exist that offer just such data sets.

This article solely attacks non-profit endeavors. The only for-profit companies mentioned (PhotoDNA, Getty), stand to gain from these attacks.

@[email protected] · 11 months ago

On a different note how do these big companies train AI’s to detect CSAM without using a bunch of illegal CSAM to train it?

FaceDeer · 11 months ago

It’s perverse how the laws are so ultra-strict that you can break them by making an attempt to comply with them. The article describes how at several points the researchers had to “outsource” part of their work to people in less-strict jurisdictions And. LAION itself is based in Germany, which adds yet another jurisdiction to the situation.

CSAM always turns into a ridiculous minefield. So many different jurisdictions and different definitions, and everyone is ultra adamant about theirs being the one that must be enforced globally.

@fishos · edit-2 11 months ago

I’ve heard there are specific data sets you can download that have the training data, but not the images themselves. Someone else already ran the images through a training model and you’re just grabbing the processed data and plugging it into your model. I’m sure I’m missing some nuance and haven’t looked into it myself, but I’ve seen that given as the answer when someone asked before.

@piecat · 11 months ago

IIRC from a previous thread, different law enforcement agencies will release hashes or similar so the image can be detected without distributing the original