this post was submitted on 30 Mar 2025

129 points (98.5% liked)

Selfhosted

45395 readers

450 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

[email protected]

129

Cheapskate's Guide: Nuking web-scraping bots (cheapskatesguide.org)

submitted 3 days ago by [email protected] to c/[email protected]

28 comments fedilink hide all child comments

Lemmy newb here, not sure if this is right for this /c.

An article I found from someone who hosts their own website and micro-social network, and their experience with web-scraping robots who refuse to respect robots.txt, and how they deal with them.

top 28 comments

sorted by: hot top controversial new old

[–] [email protected] 9 points 2 days ago

I've found that many of these solutions/hacks block legitimate users that are using the tor browser and Internet Archive scrapers, which may be a dealbreaker for some but maybe acceptable for most users and website owners.

[–] [email protected] 13 points 2 days ago (3 children)

This is signal detection theory combined with an arms race that keeps the problem hard. You cannot block scrapers without blocking people, and you cannot inconvenience bots without also inconveniencing readers. You might figure something clever out temporarily, but eventually this truism will resurface. Excuse me while I solve a few more captchas.

[–] [email protected] 1 points 2 days ago

Excuse me while I solve a few more captchas.

Buster for captcha.

[–] [email protected] 0 points 2 days ago

Time to start hosting Trojans on your website

[–] [email protected] 3 points 2 days ago

The internet as we know it is dead, we just need a few more years to realise it. And I'm afraid that telecommunications will be going the same way, when no-one can trust that anyone is who they say anymore.

[–] [email protected] 3 points 2 days ago (1 children)

Thanks, great site! 😊

[–] [email protected] 3 points 1 day ago

You're welcome.

I believe I found it originally via the "distribuverse"... specifically, ZeroNet.

[–] [email protected] 11 points 2 days ago (3 children)

They block VPN exit nodes. Why bother hosting a web site if you don't want anyone to read your content?

Fuck that noise. My privacy is more important to me than your blog.

[–] [email protected] 1 points 1 day ago

A problem with this approach was that many readers use VPN's and other proxies that change IP addresses virtually every time they use them. For that reason and because I believe in protecting every Internet user's privacy as much as possible, I wanted a way of immediately unblocking visitors to my website without them having to reveal personal information like names and email addresses.

I recently spent a few weeks on a new idea for solving this problem. With some help from two knowledgeable users on Blue Dwarf, I came up with a workable approach two weeks ago. So far, it looks like it works well enough. To summarize this method, when a blocked visitor reaches my custom 403 error page, he is asked whether he would like to be unblocked by having his IP address added to the website's white list. If he follows that hypertext link, he is sent to the robot test page. If he answers the robot test question correctly, his IP address is automatically added to the white list. He doesn't need to enter it or even know what it is. If he fails the test, he is told to click on the back button in his browser and try again. After he has passed the robot test, Nginx is commanded to reload its configuration file (PHP command: shell_exec("sudo nginx -s reload");), which causes it to immediately accept the new whitelist entry, and he is granted immediate access. He is then allowed to visit cheapskatesguide as often as he likes for as long as he continues to use the same IP address. If he switches IP addresses in the future, he has about a one in twenty chance of needing to pass the robot test again each time he switches IP addresses. My hope is that visitors who use proxies will only have to pass the test a few times a year. As the whitelist grows, I suppose that frequency may decrease. Of course, it will reach a non-zero equilibrium point that depends on the churn in the IP addresses being used by commercial web-hosting companies. In a few years, I may have a better idea of where that equilibrium point is.

[–] [email protected] 3 points 2 days ago (2 children)

They block VPN exit nodes. Why bother hosting a web site if you don’t want anyone to read your content?

Fuck that noise. My privacy is more important to me than your blog.

It's a minimalist private blog that sets no 3rd party cookies and loads no 3rd party resources. I presume that alleviates your concerns? 😜

[–] [email protected] 6 points 2 days ago

That's not what I'm complaining about. I'm unable to access the site because they're blocking anyone coming through a VPN. I would need to lower my security and turn off my VPN to read their blog. That's my issue.

[–] [email protected] 2 points 2 days ago (1 children)

The admin could use a CDN and not worry about it, if it's just static content.

[–] [email protected] 2 points 1 day ago

I believe using a CDN would defeat the author's goal of not being reliant on third-party service providers.

[–] [email protected] 3 points 2 days ago (1 children)

and filtering malicious traffic is more important to me than you visiting my services, so I guess that makes us even :-)

[–] [email protected] 1 points 2 days ago (2 children)

You know how popular VPNs are, right? And how they improve privacy and security for people who is them? And you're blocking anyone who's exercising a basic privacy right?

It's not an ethically sound position.

[–] [email protected] 2 points 2 days ago (1 children)

You had me until the "ethically sound position" part.

You're saying that Joe Blogger is acting unethically because he doesn't allow VPN users to visit his site. C'mon, brother.

[–] [email protected] 0 points 1 day ago

You're saying targeting people who are taking steps to improve their privacy and security is ethical? Out do you just believe that there's no such thing as ethics in CIS?

[–] [email protected] 3 points 2 days ago (1 children)

Absolutely; if I was a company, or hosting something important, or something that was intended for the general public, then I'd agree.

But I'm just an idiot hosting whimsical stuff from my basement, and 99% of it is only of interest for my friends. I know ~everyone in my target audience, and I know that none of them use a VPN for general-purpose browsing.

As it is, I don't mind keeping the door open to the general public, but nothing of value will be lost if I need to pull the plug on some more ASN's to preserve my bandwidth. For example when a guy hopping through a VPN in Sweden decides to download the same zip file thousands of times, wasting terabytes of traffic over a few hours (this happened a week ago).

[–] [email protected] 1 points 1 day ago

I know that none of them use a VPN for general-purpose browsing.

Interesting. The most common setup I encounter is when the VPN is implemented in the home router - that's the way it is in my house. If you're connected to my WiFi, you're going through my VPN.

I have a second VPN, which is how my private servers are connected; that's a bespoke peer-to-peer subnet set up in each machine, but it handles almost no outbound traffic.

My phone detects when it isn't connected to my home WiFi and automatically turns on the VPN service for all phone data; that's probably less common. I used to just leave it on all the time, but VPN over VPN seemed a little excessive.

It sounds like you were a victim of a DOS attack - not distributed, though. It could have just been done directly; what about it being through a VPN made it worse?

[–] [email protected] 18 points 3 days ago (2 children)

I have plenty of spare bandwidth and babysitting-resources so my approach is largely to waste their time. If they poke my honeypot they get poked back and have to escape a tarpit specifically designed to waste their bandwidth above all. It costs me nothing because of my circumstances but I know it costs them because their connections are metered. I also know it works because they largely stop crawling my domains I employ this on. I am essentially making my domains appear hostile.

It does mean that my residential IP ends up on various blocklists but I'm just at a point in my life where I don't give an unwiped asshole about it. I can't access your site? I'm not going to your site, then. Fuck you. I'm not even gonna email you about the false-positive.

It is also fun to keep a log of which IPs have poked the honeypot have open ports, and to automate a process of siphoning information out of those ports. Finding a lot of hacked NVR's recently I think are part of some IoT botnet to scrape the internet.

[–] [email protected] 10 points 2 days ago (1 children)

I found a very large botnet in Brazil mainly and several other countries. And abuseipdb.com is not marking those IPs are a thread. We need a better solution.

I think a honeypot is a good way. Another way is to use proof of work basically on the client side. Or we need a better place to share all stupid web scraping bot IPs.

[–] [email protected] 5 points 2 days ago (1 children)

I love the idea of abuseipdb and I even contributed to it briefly. Unfortunately, even as a contributor, I don't get enough API resources to actually use it for my own purposes without having to pay. I think the problem is simply that if you created a good enough database of abusive IPs then you'd be overwhelmed in traffic trying to pull that data out.

[–] [email protected] 7 points 2 days ago

Not really.. We do have this wonderful list(s): https://github.com/firehol/blocklist-ipsets

And my firewall is using for example the Spamhaus drop list source: https://raw.githubusercontent.com/firehol/blocklist-ipsets/refs/heads/master/spamhaus_drop.netset

So I know its possible. And hosting in a git repo like that, will scale a lot. Since tons of people using this already that way.

[–] [email protected] 3 points 3 days ago (1 children)

That last bit looks like something you should send off to a place like 404 media.

[–] [email protected] 2 points 2 days ago

I wouldn't even know where to begin, but I also don't think that what I'm doing is anything special. These NVR IPs are hurling abuse at the whole internet. Anyone listening will have seen them, and anyone paying attention would've seen the pattern.

The NVRs I get the most traffic from have been a known hacked IoT device for a decade and even has a github page explaining how to bypass their authentication and pull out arbitrary files like passwd.

[–] [email protected] 24 points 3 days ago (1 children)

Interesting approach but looks like this ultimately ends up:

being a lot of babysitting / manual work
blocking a lot of humans
not being robust against scrapers

Anubis seems like a much better option, for those wanting to block bots without relying on Cloudflare:

https://anubis.techaro.lol/

[–] [email protected] 6 points 2 days ago (1 children)

Are there any guides to using it with reverse proxies like traefik? I've been wanting to try it out but haven't had time to do the research yet.

[–] [email protected] 7 points 2 days ago* (last edited 2 days ago)

https://github.com/TecharoHQ/anubis/issues/92

note: this is pretty much temporary until we get first class support for Traefik

They seem to be working on a traefik middleware, but in the meantime there is a guide to set it up manually with traefik.