I was scouring the indie-web earlier, and found a pretty useful list of bots to add to your robots.txt. But, since I’m not convinced that this is enough to keep them away, I also figured out a simple way to at least potentially completely block them from accessing your websites.

  • @[email protected]
    link
    fedilink
    English
    156 days ago

    Again it must always be stressed that this is a false sense of security. You can only block crawlers that identify themselves, or by pulling an IP block list of offenders which means they’ve already offended in order to be identified and they can just change their IP address.

    You can’t block them, but you can make their life harder. Return 200 OK on 404 Not Found so malicious bots trying to drive-by you for random URLs like /admin or whatever will think they found something. Make honeypots that redirect and loop, filled with bait wordlists and forms that go nowhere. Poison to well. Deliberately serve incorrect, broken or AI-generated data to known bots.

    Waste their time, instead of wasting your own time.

    • Net_Runner :~$OP
      link
      fedilink
      English
      66 days ago

      Yeah, that’s true. If they’re not using names, then there’s not a whole lot you can do. And blocking IPs is impossible, because they use different IPs constantly.

      But!

      With my post, and your suggestions, this is the “something” that’s better than doing absolutely nothing

      • @[email protected]
        link
        fedilink
        English
        36 days ago

        I suppose it comes down to being offensive or defensive. I don’t think being defensive is worth my time. I’m not paying for bandwidth and compute-time is so cheap it’s irrelevant so I’m on the offensive. You can do both if you want. There’s definitely more ready-to-go defensive solutions than there are offensive (your own article, for example), but I think tinkering and adapting my own solution is fun. It’s like a game of cat and mouse but they have money to lose and I don’t.

    • UltraHamster64
      link
      36 days ago

      Hmm, how would one attempt to actually do this in practice?

      • @[email protected]
        link
        fedilink
        English
        4
        edit-2
        6 days ago

        Eventually I’m gonna make a proper article about it, but what I’m doing right now boils down to this:

        • Intercept 404
        • Redirect to error-hole.php
        • error-hole.php returns 200 and spits out a bunch of bot-targets

        The next iteration of this will include a lot of uncompressed filler data so hopefully the bots have to download half a gigabyte of data every time they do this. I’m not paying for bandwidth, it doesn’t matter to me.

        See for yourself https://drkt.eu/fdhasklfh

        I can see that it works by just looking at my access logs.

  • UltraHamster64
    link
    26 days ago

    This is cool! Thank you so much for the guide!