I used to be the Security Team Lead for Web Applications at one of the largest government data centers in the world but now I do mostly “source available” security mainly focusing on BSD. I’m on GitHub but I run a self-hosted Gogs (which gitea came from) git repo at Quadhelion Engineering Dev.

Well, on that server I tried to deny AI with Suricata, robots.txt, “NO AI” Licenses, Human Intelligence (HI) License links in the software, “NO AI” comments in posts everywhere on the Internet where my software was posted. Here is what I found today after having correlated all my logs of git clones or scrapes and traced them all back to IP/Company/Server.

Formerly having been loathe to even give my thinking pattern to a potential enemy I asked Perplexity AI questions specifically about BSD security, a very niche topic. Although there is a huge data pool here in general over many decades, my type of software is pretty unique, is buried as it does not come up on a GitHub search for BSD Security for two pages which is all most users will click, is very recent comparitively to the “dead pool” of old knowledge, and is fairly well recieved, yet not generally popular so GitHub Traffic Analysis is very useful.

The traceback and AI result analysis shows the following:

  1. GitHub cloning vs visitor activity in the Traffic tab DOES NOT MATCH any useful pattern for me the Engineer. Likelyhood of AI training rough estimate of my own repositories: 60% of clones are AI/Automata
  2. GitHub README.md is not licensable material and is a public document able to be trained on no matter what the software license, copyright, statements, or any technical measures used to dissuade/defeat it. a. I’m trying to see if tracking down whether any README.md no matter what the context is trainable; is a solvable engineering project considering my life constraints.
  3. Plagarisation of technical writing: Probable
  4. Theft of programming “snippets” or perhaps “single lines of code” and overall logic design pattern for that solution: Probable
  5. Supremely interesting choice of datasets used vs available, in summary use, but also checking for validation against other software and weighted upon reputation factors with “Coq” like proofing, GitHub “Stars”, Employer History?
  6. Even though I can see my own writing and formatting right out of my README.md the citation was to “Phoronix Forum” but that isn’t true. That’s like saying your post is “Tick Tock” said. I wrote that, a real flesh and blood human being took comparitvely massive amounts of time to do that. My birthname is there in the post 2 times [EDIT: post signature with my name no longer? Name not in “about” either hmm], in the repo, in the comments, all over the Internet.

[EDIT continued] Did it choose the Phoronix vector to that information because it was less attributable? It found my other repos in other ways. My Phoronix handle is the same name as GitHub username, where my handl is my name, easily inferable in any, as well as a biography link with my fullname in the about.[EDIT cont end]

You should test this out for yourself as I’m not going to take days or a week making a great presentation of a technical case. Check your own niche code, a specific code question of application, or make a mock repo with super niche stuff with lots of code in the README.md and then check it against AI every day until you see it.

P.S. I pulled up TabNine and tried to write Ruby so complicated and magically mashed, AI could offer me nothing, just as an AI obsucation/smartness test. You should try something similar to see what results you get.

  • @[email protected]
    link
    fedilink
    English
    16 months ago

    How would an LLM answering questions about a git repo be legally different from a person answering those same questions (think stackoverflow)? Specific to this case, US law does not consider “APIs” to be copyrightable (Oracle v Google, Google reimplemented Java using the same APIs but their own implementation code, court ruled that Oracle couldn’t copyright the APIs).

    Regarding “replace”, the primary use of the git repo is the code itself, not the Q&A about how to use it. The LLM doesn’t generate code that fully replaces that library or program, or if it does, it is distinct enough to be a different work.

    • AlexanderESmith
      link
      fedilink
      16 months ago

      First, a chat bot is not an API. Second, they were talking about the the formatting and delivery method of the data, not the content.

      Regarding the output of the model: Some repos are entirely READMEs by their nature. No code, just documentation and walkthroughs. Notwithstanding that; If I set a flag that’s says “don’t use my data” and they use it anyway, that’s theft, even if it’s only one file, even if the file is just a description of the code. That’s my work, not yours. You don’t get to use it however you want, unless I specifically note that it’s public domain (or you use it and follow the license, like attributing me, or linking to the repo, etc).

      As to the difference between a bot and a human (re: stack overflow)? The former is a representative of a company (automation or not, whether it’s a bot or a page on their corporate site), the latter is a person relating experience and opinion. The legal difference is that one is using the data commercially, and the other is just a person in the world, answering another person’s question for no reason other than a desire to be helpful (and if they’re decent, attributing the source instead of claiming that they’re generating wisdom on their own).

      That last parenthetical used to be called plagiarism, by the way.