Is there an open source package that the Internet Archive runs? What is it? I assume sites like archive.is run the same. I’d like to know if I can also run it for self-hosted archiving.

    • Possibly linux
      link
      fedilink
      English
      1411 months ago

      Archive box is a piece of software and the Internet archive is a organization that is focused on predicting the content on the internet.

      The Internet Archive has PBs worth of data. I doubt any home user could manage that.

      • @z00s
        link
        English
        12
        edit-2
        11 months ago

        archive

        predicting

        ?

      • max
        link
        fedilink
        English
        611 months ago

        i dont think op is looking to mirror archive.org, my take was that they wanted someyhing like archive.org but selfhosted and for personal / small-scale use

        • Avid AmoebaOP
          link
          fedilink
          English
          1411 months ago

          Exactly. I’m already running a local wiki, but I don’t want stuff I link to in my wiki to result in 404 in a few years. Or worse, to some AI-ridden ad-infested dumpster fire.

          • layzerjeyt
            cake
            link
            fedilink
            English
            111 months ago

            You can use something as simple as a browser extension like SingleFile that can automatically download complete, contained copies of anything bookmarked or only certain URLs.

    • Avid AmoebaOP
      link
      fedilink
      English
      3
      edit-2
      11 months ago

      Oh yes, this looks like a winner. Thanks!

      It seems like it’s written in Python too, which means I can maintain it if need be.

      Oh boy I wish I had set this up many years ago. I wouldn’t have to resort to scouring [email protected] for the top quality memes of the past when I need them…

      On a far side of the moon note, I wonder if ActivityPub could be used to federate multiple archiveboxes to create a more resilient Internet Archive alternative. 🤔 Then integrate that with Lemmy to autoarchive links from posts. Aaand lemmy.world ran out of disk space. 🤣

      • density
        link
        fedilink
        211 months ago

        a network between networks to make them more resilient i think you’ve just invented the arpanet?.

  • @[email protected]
    link
    fedilink
    English
    12
    edit-2
    11 months ago

    I believe they used heritrix at one point. The important bit is that there is a special archive format that they use which is a standard. There are several tools that support it (both capturing to it and viewing it) - it allows for capturing a website in a ‘working’ condition with history or something. I’m a bit fuzzy on it since it’s been some time since I looked into it.

    • Avid AmoebaOP
      link
      fedilink
      English
      411 months ago

      Kind of. Linkwarden seems to save as PDF. That’s better than nothing, however preserving a functional copy of the pages would be better. Archivebox seems to do this.

  • Possibly linux
    link
    fedilink
    English
    311 months ago

    I don’t know for certain but I’m sure they run lots of different software. They have PBs of data.