Anubis is awesome! Stopping (AI)crawlbots

zoey@lemmy.librebun.com · edit-2 2 months ago

Anubis is awesome! Stopping (AI)crawlbots

blob42@lemmy.ml · edit-2 2 months ago

I am planning to try it out, but for caddy users I came up with a solution that works after being bombarded by AI crawlers for weeks.

It is a custom caddy CEL expression filter coupled with caddy-ratelimit and caddy-defender.

Now here’s the fun part, the defender plugin can produce garbage as response so when a matching AI crawler fits it will poison their training dataset.

Originally I only relied on the rate limiter and noticed that AI bots kept trying whenever the limit was reset. Once I introduced data poisoning they all stopped :)

git.blob42.xyz {
    @bot <<CEL
        header({'Accept-Language': 'zh-CN'}) || header_regexp('User-Agent', '(?i:(.*bot.*|.*crawler.*|.*meta.*|.*google.*|.*microsoft.*|.*spider.*))')
    CEL


    abort @bot
    

    defender garbage {

        ranges aws azurepubliccloud deepseek gcloud githubcopilot openai 47.0.0.0/8
      
    }

    rate_limit {
        zone dynamic_botstop {
            match {
                method GET
                 # to use with defender
                 #header X-RateLimit-Apply true
                 #not header LetMeThrough 1
            }
            key {remote_ip}
            events 1500
            window 30s
            #events 10
            #window 1m
        }
    }

    reverse_proxy upstream.server:4242

    handle_errors 429 {
        respond "429: Rate limit exceeded."
    }

}

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

azertyfun@sh.itjust.works · 2 months ago

If I am not mistaken the 47.0.0.0/8 ip block is for Alibaba cloud

That’s an ARIN block according to Wikipedia so North America, under Northen Telecom until 2010. It does look like Alibaba operate many networks under that /8, but I very much doubt it’s the whole /8 which would be worth a lot; a /16 is apparently worth around $3-4M, so a /8 can be extrapolated to be worth upwards of a billion dollars! I doubt they put all their eggs into that particular basket. So you’re probably matching a lot of innocent North American IPs with this.

blob42@lemmy.ml · 2 months ago

Right I must have just blanket banned the whole /8 to be sure alibaba cloud is included. Did some time ago so I forgot

Cozog@feddit.dk · 2 months ago

When I blocked Alibaba, the AI crawlers immediately started coming from a different cloud provider (Huawei, I believe), and when I blocked that, it happened again. Eventually the crawlers started coming from North American and then European cloud providers.

Due to lack of time to change my setup to accommodate Anubis, I had to temporarily move my site behind Cloudflare (where it sadly still is).

blob42@lemmy.ml · 2 months ago

We need a decentralized community owned cloudflare alternative. Anubis looks on good track.

Anubis is awesome! Stopping (AI)crawlbots

Anubis is awesome! Stopping (AI)crawlbots

Incoherent rant.

Behold, Anubis.

“Weighs the soul of incoming HTTP requests to stop AI crawlers”