Office space meme:

“If y’all could stop calling an LLM “open source” just because they published the weights… that would be great.”

  • KillingTimeItself
    link
    fedilink
    English
    302 days ago

    i mean, if it’s not directly factually inaccurate, than, it is open source. It’s just that the specific block of data they used and operate on isn’t published or released, which is pretty common even among open source projects.

    AI just happens to be in a fairly unique spot where that thing is actually like, pretty important. Though nothing stops other groups from creating an openly accessible one through something like distributed computing. Which seems to be a fancy new kid on the block moment for AI right now.

    • Fushuan [he/him]
      link
      fedilink
      English
      14
      edit-2
      1 day ago

      The running engine and the training engine are open source. The service that uses the model trained with the open source engine and runs it with the open source runner is not, because a biiiig big part of what makes AI work is the trained model, and a big part of the source of a trained model is training data.

      When they say open source, 99.99% of the people will understand that everything is verifiable, and it just is not. This is misleading.

      As others have stated, a big part of open source development is providing everything so that other users can get the exact same results. This has always been the case in open source ML development, people do provide links to their training data for reproducibility. This has been the case with most of the papers on natural language processing (overarching branch of llm) I have read in the past. Both code and training data are provided.

      Example in the computer vision world, darknet and tool: https://github.com/AlexeyAB/darknet

      This is the repo with the code to train and run the darknet models, and then they provide pretrained models, called yolo. They also provide links to the original dataset where the tool models were trained. THIS is open source.

    • @FooBarrington
      link
      10
      edit-2
      2 days ago

      But it is factually inaccurate. We don’t call binaries open-source, we don’t even call visible-source open-source. An AI model is an artifact just like a binary is.

      An “open-source” project that doesn’t publish everything needed to rebuild isn’t open-source.

    • @[email protected]
      link
      fedilink
      21 day ago

      Is it common? Many fields have standard, open datasets. That’s not the case here, and this data is the most important part of training an LLM.

    • @Treczoks
      link
      22 days ago

      That “specific block of data” is more than 99% of such a project. Hardly insignificant.