Child sex abuse images found in dataset training image generators, report says

Sapphire Velvet · 1 year ago

Child sex abuse images found in dataset training image generators, report says

AutoTL;DR · 1 year ago

This is the best summary I could come up with:

More than 1,000 known child sexual abuse materials (CSAM) were found in a large open dataset—known as LAION-5B—that was used to train popular text-to-image generators such as Stable Diffusion, Stanford Internet Observatory (SIO) researcher David Thiel revealed on Wednesday.

His goal was to find out what role CSAM may play in the training process of AI models powering the image generators spouting this illicit content.

“Our new investigation reveals that these models are trained directly on CSAM present in a public dataset of billions of images, known as LAION-5B,” Thiel’s report said.

But because users were dissatisfied by these later, more filtered versions, Stable Diffusion 1.5 remains “the most popular model for generating explicit imagery,” Thiel’s report said.

While a YCombinator thread linking to a blog—titled “Why we chose not to release Stable Diffusion 1.5 as quickly”—from Stability AI’s former chief information officer, Daniel Jeffries, may have provided some clarity on this, it has since been deleted.

Thiel’s report warned that both figures are “inherently a significant undercount” due to researchers’ limited ability to detect and flag all the CSAM in the datasets.

The original article contains 837 words, the summary contains 182 words. Saved 78%. I’m a bot and I’m open source!