I created this account two days ago, but one of my posts ended up in the (metaphorical) hands of an AI powered search engine that has scraping capabilities. What do you guys think about this? How do you feel about your posts/content getting scraped off of the web and potentially being used by AI models and/or AI powered tools? Curious to hear your experiences and thoughts on this.


#Prompt Update

The prompt was something like, What do you know about the user [email protected] on Lemmy? What can you tell me about his interests?" Initially, it generated a lot of fabricated information, but it would still include one or two accurate details. When I ran the test again, the response was much more accurate compared to the first attempt. It seems that as my account became more established, it became easier for the crawlers to find relevant information.

It even talked about this very post on item 3 and on the second bullet point of the “Notable Posts” section.

For more information, check this comment.


Edit¹: This is Perplexity. Perplexity AI employs data scraping techniques to gather information from various online sources, which it then utilizes to feed its large language models (LLMs) for generating responses to user queries. The scraping process involves automated crawlers that index and extract content from websites, including articles, summaries, and other relevant data. It is an advanced conversational search engine that enhances the research experience by providing concise, sourced answers to user queries. It operates by leveraging AI language models, such as GPT-4, to analyze information from various sources on the web. (12/28/2024)

Edit²: One could argue that data scraping by services like Perplexity may raise privacy concerns because it collects and processes vast amounts of online information without explicit user consent, potentially including personal data, comments, or content that individuals may have posted without expecting it to be aggregated and/or analyzed by AI systems. One could also argue that this indiscriminate collection raise questions about data ownership, proper attribution, and the right to control how one’s digital footprint is used in training AI models. (12/28/2024)

Edit³: I added the second image to the post and its description. (12/29/2024).

  • @AA5B
    link
    175 days ago

    I’m pretty much fine with AIs scraping my data. What they can see is public knowledge and was already being scraped by search engines.

    I object to:

    • sites like Reddit whose entire existence is due to user content, deciding they can police and monetize my content. They have no right
    • sharing of data, which includes more personal and identifiable data
    • whatever the AI summarizes me as being treated as fact, such as by a company hr, regardless of context, accuracy, hallucinations
    • @Keening
      link
      24 days ago

      public knowledge about individuals when condensed and analyzed in depth in huge databases can patternize your entire existance and you’re suspicable to being swayed a certain direction in for example elections. Creating further divide and into someone elses pockets.

      • @AA5B
        link
        2
        edit-2
        4 days ago

        Maybe but I can’t object too much if I put my content out in public. When forced to create an account I use minimal/false information and a unique generated email. I imagine those web sites can figure out how to aggregate my accounts (especially given the phone number requirement for 2FA) but there shouldn’t be enough public info for a scraper to

        • @Keening
          link
          24 days ago

          Gotta think larger than yourself though. What happens when your spouse uses real info? your kids? your parents? they’ll shadowplay your person with great accuracy and fill in the gaps. You don’t even have to “put content” out there. Said databases can just put two and two together. How will you, or other uses even know you’re actually talking to a human? perhaps you’re on Lemmy and we’re all bots trying to get you to admit fragments of your latest crimes in order to get you into jail for said crime? etcetera. At first glance this all looks harmless but any accumulated information in huge databases is a major infringement to personal integrety at best; and complete control of your freedom at worst. The ultimate power is when someone can make you do X or Y and you don’t even realize you’re doing their bidding; but believe you have a choice when you don’t. (Similiar to how it is in my living situation at home with my gf that is :P jk.)

          Hakuna matata. Happy new year

          • @AA5B
            link
            24 days ago

            I completely agree, except that I think of them as multiple related privacy issues. In the scope of ai bots scraping my public content, most of these are out of scope

      • @[email protected]
        link
        fedilink
        55 days ago

        Not the person you are replying to but Reddit does not make the content you created available for everyone (blocking crawlers, removing the free API) but instead sells it to the highest bidder.

        • @AA5B
          link
          14 days ago

          Right, that’s my objection. After benefitting from my content, they police it, as in restrict other sites from seeing it, until it’s monetized. It’s not Reddits to charge money for

      • @AA5B
        link
        14 days ago

        Probably not the right word, but my content should still be my content. I offered it to Reddit but that doesn’t mean they have the right to charge others for it or restrict it to others for commercial reasons.

    • Atemu
      link
      fedilink
      05 days ago

      sites like Reddit whose entire existence is due to user content, deciding they can police and monetize my content. They have no right

      Um, not they do in fact have “every right” here. It’s shitty of course but you explicitly gave them that right in form of an perpetual, irrevocable, world-wide etc. license to do whatever they like to everything you publish on their site.

      They also have every right to “police” your content, especially if it’s objectionable. If you post vile shit, trolling or other societal garbage behaviour on the internet, nobody wants to see it.