Incoherent rant.
I’ve, once again, noticed Amazon and Anthropic absolutely hammering my Lemmy instance to the point of the lemmy-ui container crashing. Multiple IPs all over the US.
So I’ve decided to do some restructuring of how I run things. Ditched Fedora on my VPS in favour of Alpine, just to start with a clean slate. And started looking into different options on how to combat things better.
Behold, Anubis.
“Weighs the soul of incoming HTTP requests to stop AI crawlers”
From how I understand it, it works like a reverse proxy per each service. It took me a while to actually understand how it’s supposed to integrate, but once I figured it out all bot activity instantly stopped. Not a single one got through yet.
My setup is basically just a home server -> tailscale tunnel (not funnel) -> VPS -> caddy reverse proxy, now with anubis integrated.
I’m not really sure why I’m posting this, but I hope at least one other goober trying to find a possible solution to these things finds this post.
Edit: Further elaboration for those who care, since I realized that might be important.
- You don’t have to use caddy/nginx/whatever as your reverse proxy in the first place, it’s just how my setup works.
- My Anubis sits between my local server and inside Caddy reverse proxy docker compose stack. So when a request is made, Caddy redirects to Anubis from its Caddyfile and Anubis decides whether or not to forward the request to the service or stop it in its tracks.
- There are some minor issues, like it requiring javascript enabled, which might get a bit annoying for NoScript/Librewolf/whatever users, but considering most crawlbots don’t do js at all, I believe this is a great tradeoff.
- The most confusing part were the docs and understanding what it’s supposed to do in the first place.
- There’s an option to apply your own rules via json/yaml, but I haven’t figured out how to do that properly in docker yet. As in, there’s a main configuration file you can override, but there’s apparently also a way to add additional bots to block in separate files in a subdirectory. I’m sure I’ll figure that out eventually.
Edit 2 for those who care: Well crap, turns out lemmy-ui crashing wasn’t due to crawlbots, but something else entirely.
I’ve just spent maybe 14 hours troubleshooting this thing, since after a couple of minutes of running, lemmy-ui container healthcheck would show “unhealthy” and my instance couldn’t be accessed from anywhere (lemmy-ui, photon, jerboa, probably the api as well).
After some digging, I’ve disabled anubis to check if that had anything to do with it, it didn’t. But, I’ve also noticed my host ulimit -n was set to like 1000… (I’ve been on the same install for years and swear an update must have changed it)
After changing ulimit -n (nofile) and shm_size to 2G in docker compose, it hasn’t crashed yet. fingerscrossed
Boss, I’m tired and I want to get off Mr. Bones’ wild ride.
I’m very sorry for not being able to reply to you all, but it’s been hectic.
Cheers and I really hope someone finds this as useful as I did.
I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.
It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.
Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.
Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)
git.blob42.xyz { @bot <<CEL header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))') CEL abort @bot defender garbage { ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8 } rate_limit { zone dynamic_botstop { match { method GET # to use with defender #header X-RateLimit-Apply true #not header LetMeThrough 1 } key {remote_ip} events 1500 window 30s #events 10 #window 1m } } reverse_proxy upstream.server:4242 handle_errors 429 { respond "429: Rate limit exceeded." } }
If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud
That’s an ARIN block according to Wikipedia so North America, under Northen Telecom until 2010. It does look like Alibaba operate many networks under that
/8
, but I very much doubt it’s the whole/8
which would be worth a lot; a/16
is apparently worth around $3-4M, so a/8
can be extrapolated to be worth upwards of a billion dollars! I doubt they put all their eggs into that particular basket. So you’re probably matching a lot of innocent North American IPs with this.Right I must have just blanket banned the whole /8 to be sure alibaba cloud is included. Did some time ago so I forgot
When I blocked Alibaba, the AI crawlers immediately started coming from a different cloud provider (Huawei, I believe), and when I blocked that, it happened again. Eventually the crawlers started coming from North American and then European cloud providers.
Due to lack of time to change my setup to accommodate Anubis, I had to temporarily move my site behind Cloudflare (where it sadly still is).