I just wonder in the not so distant future, majority of the content online being produced will be AI-generated, hence with lower quality, wouldn’t this lead to gradual decrease of the quality of the AI-models?
There’s a theory that databases of pre AI content will become extremely valuable for precisely this reason, which is part of why the whole reddit API change happened.
No. There’s only model collapse (the term for this in academia) if literally all the content is synthetic.
In fact, a mix of synthetic and human generated performs better than either/or.
Which makes sense, as the collapse is a result of distribution edges eroding, so keeping human content prevents that, but then the synthetic content is increasingly biased towards excellence in more modern models, so the overall data set has an improved median oven the human only set. Best of both worlds.
Using Gab as an example, you can see from other comments that in spite of these instructions the model answers are more nuanced and correct than Gab posts. So if you only had Gab posts you’d have answers from morons, and the synthetic data is better. But if you only had synthetic data, it wouldn’t know what morons look like to avoid those answers and develop nuance around them.
I just wonder in the not so distant future, majority of the content online being produced will be AI-generated, hence with lower quality, wouldn’t this lead to gradual decrease of the quality of the AI-models?
There’s a theory that databases of pre AI content will become extremely valuable for precisely this reason, which is part of why the whole reddit API change happened.
that aint theory, that’s a take at best
No. There’s only model collapse (the term for this in academia) if literally all the content is synthetic.
In fact, a mix of synthetic and human generated performs better than either/or.
Which makes sense, as the collapse is a result of distribution edges eroding, so keeping human content prevents that, but then the synthetic content is increasingly biased towards excellence in more modern models, so the overall data set has an improved median oven the human only set. Best of both worlds.
Using Gab as an example, you can see from other comments that in spite of these instructions the model answers are more nuanced and correct than Gab posts. So if you only had Gab posts you’d have answers from morons, and the synthetic data is better. But if you only had synthetic data, it wouldn’t know what morons look like to avoid those answers and develop nuance around them.