• @mholiv
    link
    399 months ago

    I hate to break it to you, but federated services are basically impossible to protect from scraping. The whole idea is openness and federation.

    The only reason why places like Twitter and Reddit try to prevent scraping is so they can sell the data for profit.

    If you post stuff publicly anywhere it will be scraped. On the fediverse it will be scraped via the open and federated APIs. On proprietary platforms it will be scraped via the proprietary paid APIs.

    • @[email protected]
      link
      fedilink
      19 months ago

      Another question related to your answer : how can I guarantee that the content I create (comments) are available for scraping ?

      The issue I have with Reddit and all is that we can’t freely access to the content, especially the past content. I don’t want instances to be sold in like 10 years, compromising access to old content (or with advertising in them). I would like to be able to replicate one rogue instance into a new free instance.

      • originalucifer
        link
        fedilink
        99 months ago

        its the wild west right now in the fediverse.

        a multitude of products are being created right now. most havent hit version 1.0 yet. there are no guarantees other than what you get as assurances from your community instance/implementation.

        the only solid guarantee you will ever get would be by creating your own instance so you can curate your own content (as well as the content pulled in from the 'verse).

        it took reddit 20+ years to get where it is. lets give the fediverse a little time.

      • @mholiv
        link
        29 months ago

        I want to make a distinction between scraping and archiving here.

        You don’t need to do anything to ensure your content is “scrapeable”. Just post your content on the fediverse and it is available to scrape. Anyone can do it. This being said unless someone goes out of their way to save what they scrape eventually as your content ages the only copy will be on the server that it originates from. I believe all posts are stored on the instance where the community lives. I believe all comments are the same the difference being that your instance also stores a local copy of your comment. I could be wrong there though.

        Archiving is different. Archiving is providing a long term store of your content. That is harder. If you run your own instance the comments you put on the communities that live on your instance are safe. Anywhere else, you are subject to that instance just dying or selling out. You would need a specialized tool to take a “snapshot” or something. Maybe adding the post thread to archive.org could work. It’s messy in any case.