

9·
19 hours agoI am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.
It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.
Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.
Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)
git.blob42.xyz {
@bot <<CEL
header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
CEL
abort @bot
defender garbage {
ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
}
rate_limit {
zone dynamic_botstop {
match {
method GET
# to use with defender
#header X-RateLimit-Apply true
#not header LetMeThrough 1
}
key {remote_ip}
events 1500
window 30s
#events 10
#window 1m
}
}
reverse_proxy upstream.server:4242
handle_errors 429 {
respond "429: Rate limit exceeded."
}
}
If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud
Right I must have just blanket banned the whole /8 to be sure alibaba cloud is included. Did some time ago so I forgot