• @General_Effort
    link
    English
    44 months ago

    This has all been tested and is being continuously retested. Start here, for example: https://en.wikipedia.org/wiki/Neural_scaling_law

    I know, on lemmy you will get the impression that engineers and scientists are all just bumbling fools who are intellectually outclassed by any high schooler with internet access. But how likely is that, really?

    • @[email protected]
      link
      fedilink
      English
      14 months ago

      Scaling laws are disputed, but if an effort has in fact already been undertaken to train a general purpose LLM using only permissively-licensed data, great! Can you send me the checkpoint on Huggingface, a github page hosting relevant code, or even a paper or blog post about it? I’ve been looking and hadn’t found anything like that yet.

      • @General_Effort
        link
        English
        24 months ago

        Scaling laws are disputed

        Not in general.

        There is not enough permissively licensed text to train models of any size, and what there is, lacks in diversity. Wikipedia, government documents, stack overflow, century old stuff, … An LLM trained on that is not likely to be called “general purpose”, because scaling laws. Sometimes such small models are trained for research purposes but I don’t have a link ready. They are not something you’d actually use. Perhaps you could look at Microsoft’s Phi series of models. They are trained on synthetic data, though that’s probably not what you are looking for.