How do I block AI scrapers and LLM training bots?

Blocking AI Scrapers and LLM Training Bots

The rapid growth of artificial intelligence and large language models has created a new challenge for website owners: preventing AI training bots from scraping content without permission. Unlike traditional web crawlers (Google, Bing) that are beneficial for SEO, AI bots like OpenAI's GPTBot, Google's Extended crawler, and countless others are collecting content to train machine learning models. Website owners have legitimate concerns about content being used to train competing AI services, copyright violations, and diminished value of proprietary content.

While robots.txt provides a first line of defense, it's important to understand that it's advisory only—compliant bots respect it, but malicious scrapers ignore it entirely. A comprehensive defense strategy combines multiple technical measures with legal/contractual protections.

The Challenge with robots.txt

Why robots.txt Alone Isn't Enough

robots.txt is a simple text file that tells crawlers what they should and shouldn't access. However:

It's Voluntary: Following robots.txt rules is entirely optional. Well-behaved bots follow it, but poorly-designed or malicious scrapers often ignore it.

It's Not Secure: The file is world-readable, making your site structure publicly visible.

It's Not Legal: It has no legal enforcement. You can't sue someone for ignoring your robots.txt.

It's Easy to Circumvent: Scrapers can simply ignore the file or access pages through alternative methods.

However, robots.txt still serves a purpose as a first defense against compliant bots and reduces unnecessary server load from casual scrapers.

Adding AI Scrapers to robots.txt

Major AI Bots to Block

OpenAI's GPTBot:

User-agent: GPTBot
Disallow: /

Anthropic's Claude Bot:

User-agent: ClaudeBot
Disallow: /

Google's Extended Crawler (for Google's AI products):

User-agent: GoogleExtended
Disallow: /

Perplexity Bot:

User-agent: PerplexityBot
Disallow: /

Common Crawl:

User-agent: CCBot
Disallow: /

Microsoft Copilot:

User-agent: Copilot
Disallow: /

Other AI Bots:

User-agent: Bytespider
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Baiduspider-render
Disallow: /

Comprehensive robots.txt Blocking Example

# Block AI training bots
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Baiduspider-render
Disallow: /

User-agent: Copilot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# Allow legitimate search engines
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

User-agent: Slurp
Disallow:

# Block all other bots (default)
User-agent: *
Crawl-delay: 10

Advanced Technical Defenses

Meta Robots Tag

Add to your HTML <head>:

<!-- Prevent archiving/caching by AI services -->
<meta name="robots" content="noarchive,nocache">

<!-- Prevent indexing by specific services -->
<meta name="robots" content="noimageindex">

X-Robots-Tag Header

For APIs or multiple resources, use HTTP headers:

X-Robots-Tag: noarchive, nocache
X-Robots-Tag: noimageindex

Robots.txt with Rate Limiting

User-agent: GPTBot
Disallow: /

# Slow down other bots
User-agent: *
Crawl-delay: 10
Request-rate: 1/10s

Content Security Policy (CSP)

Content-Security-Policy: script-src 'self'; style-src 'self' fonts.googleapis.com;

Prevents scrapers from loading resources that might be rate-limited.

Firewall and Server-Level Blocking

CloudFlare Rules

If using CloudFlare:

Go to Firewall Rules
Create rule to block known bot User-Agents
Example rule:

(cf.bot_management.score < 30) or (http.user_agent contains "GPTBot")

Server-Level Blocking (.htaccess)

# Apache .htaccess
RewriteEngine On

# Block GPTBot
RewriteCond %{HTTP_USER_AGENT} "GPTBot" [NC]
RewriteRule .* - [F]

# Block Claude bot
RewriteCond %{HTTP_USER_AGENT} "Claude-Web" [NC]
RewriteRule .* - [F]

# Block Perplexity
RewriteCond %{HTTP_USER_AGENT} "PerplexityBot" [NC]
RewriteRule .* - [F]

Nginx Blocking

# nginx configuration
if ($http_user_agent ~* (GPTBot|Claude-Web|PerplexityBot)) {
    return 403;
}

Monitoring and Detection

Check for Known Scrapers

Install monitoring to see what bots are visiting:

CloudFlare Analytics:

Bot Management
See traffic by bot type
Block specific bots

Server Logs Analysis:

grep "GPTBot" /var/log/apache2/access.log
grep "CCBot" /var/log/apache2/access.log

Identify Unknown Bots

Look for:

Unusual User-Agent strings
Rapid page requests
Unusual bot patterns
Requests to feed endpoints (indicating content scraping)

Reverse IP Lookup:

Identify bot owners
Add to block list if malicious

Legal and Policy Approaches

Terms of Service Update

Add to your site's ToS:

Automated Access Restrictions:
- Automated access, scraping, or data mining
  of our website is prohibited
- This includes training machine learning models
- Violators will be pursued legally
- Use our API if available for approved access

Copyright and Attribution

Add copyright notices:

<!-- HTML head -->
<meta name="copyright" content="© 2024 Company Name">
<meta name="rights-holder" content="Company Name">

Licensing Statements

<!-- Creative Commons license declaration -->
<link rel="license" href="https://creativecommons.org/licenses/by-nc/4.0/">
<meta name="license" content="CC BY-NC 4.0">

DMCA Takedown Notices

If you catch scraping:

Document the scraping
Send DMCA takedown notice
Report to hosting provider

Content-Level Protections

Hide Content from Bots

<!-- Content hidden from bots, visible to users -->
<div style="display:none" aria-hidden="true">
  protected content
</div>

JavaScript-Rendered Content

Render important content with JavaScript (harder for bots to scrape):

// Load sensitive content after page load
document.addEventListener('DOMContentLoaded', function() {
    fetch('/api/protected-content')
        .then(r => r.json())
        .then(data => {
            // Inject content
        });
});

Rate Limiting

Limit requests per IP/bot:

// Server-side rate limiting
const rateLimit = require('express-rate-limit');

const limiter = rateLimit({
    windowMs: 15 * 60 * 1000,
    max: 100,
    message: 'Too many requests'
});

app.use('/api/', limiter);

Legitimate Bot Allowlisting

Whitelist Search Engines

# Whitelist good bots
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

User-agent: Slurp
Disallow:

User-agent: DuckDuckBot
Disallow:

Tools to Monitor

Dark Visitors: darkvisitors.com
Tracks AI crawlers and their policies
Monitor what's visiting your site
Get updated lists of AI bots

Important Limitations to Understand

Bots Can Ignore Rules

Compliant bots: Follow robots.txt (~70% of bots) Partially compliant: Sometimes follow (~20% of bots) Non-compliant: Ignore entirely (~10% of bots)

User-Agent Spoofing

Scrapers can fake User-Agent headers:

Original: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
Spoofed: Mozilla/5.0 (looks like browser)

Defense requires deeper technical analysis (request patterns, IP reputation, etc.)

Legal Gray Areas

Scraping public content (legal in some jurisdictions)
robots.txt violations (not legally binding)
Terms of Service violations (only if enforced)
Copyright infringement (valid legal claim)

Comprehensive Strategy

For maximum protection:

robots.txt: Block known AI bots (minimal impact, better than nothing)
Firewall Rules: Use CloudFlare or server rules to block User-Agents
Rate Limiting: Limit requests per IP address
Monitoring: Track what's accessing your site
Terms of Service: Explicitly prohibit scraping
Copyright Notices: Claim copyright on content
Legal Action: DMCA/cease-and-desist for persistent violators
Server Config: Block at Apache/.htaccess or Nginx level
Licensing: Consider CC licenses with restrictions
Content Protection: JavaScript rendering, access controls for sensitive data

Important Note

Even with all protections, determined actors can circumvent defenses. The most important protections are:

Legal frameworks (ToS, copyright)
Technical hurdles (making scraping difficult)
Monitoring (catching violators)
Enforcement (legal action when needed)

robots.txt and User-Agent blocking are helpful but incomplete solutions.

Conclusion

Blocking AI scraper bots requires a multi-layered approach. Start with robots.txt to discourage compliant bots and reduce unnecessary load. Add firewall rules to actively block known AI bots. Implement rate limiting to prevent rapid scraping. Update your terms of service to explicitly prohibit automated access. For serious violations, pursue legal remedies. Understanding that robots.txt is advisory only ensures you don't rely exclusively on it but rather use it as part of a comprehensive protection strategy. Regular monitoring via tools like Dark Visitors helps you stay informed about new bots and adapt your defenses accordingly.