• @kromem
    link
    English
    2
    edit-2
    8 months ago

    No. There’s only model collapse (the term for this in academia) if literally all the content is synthetic.

    In fact, a mix of synthetic and human generated performs better than either/or.

    Which makes sense, as the collapse is a result of distribution edges eroding, so keeping human content prevents that, but then the synthetic content is increasingly biased towards excellence in more modern models, so the overall data set has an improved median oven the human only set. Best of both worlds.

    Using Gab as an example, you can see from other comments that in spite of these instructions the model answers are more nuanced and correct than Gab posts. So if you only had Gab posts you’d have answers from morons, and the synthetic data is better. But if you only had synthetic data, it wouldn’t know what morons look like to avoid those answers and develop nuance around them.