Blocking AI Scrapers and LLM Training Bots
The rapid growth of artificial intelligence and large language models has created a new challenge for website owners: preventing AI training bots from scraping content without permission. Unlike traditional web crawlers (Google, Bing) that are beneficial for SEO, AI bots like OpenAI's GPTBot, Google's Extended crawler, and countless others are collecting content to train machine learning models. Website owners have legitimate concerns about content being used to train competing AI services, copyright violations, and diminished value of proprietary content.
While robots.txt provides a first line of defense, it's important to understand that it's advisory only—compliant bots respect it, but malicious scrapers ignore it entirely. A comprehensive defense strategy combines multiple technical measures with legal/contractual protections.
The Challenge with robots.txt
Why robots.txt Alone Isn't Enough
robots.txt is a simple text file that tells crawlers what they should and shouldn't access. However:
It's Voluntary: Following robots.txt rules is entirely optional. Well-behaved bots follow it, but poorly-designed or malicious scrapers often ignore it.
It's Not Secure: The file is world-readable, making your site structure publicly visible.
It's Not Legal: It has no legal enforcement. You can't sue someone for ignoring your robots.txt.
It's Easy to Circumvent: Scrapers can simply ignore the file or access pages through alternative methods.
However, robots.txt still serves a purpose as a first defense against compliant bots and reduces unnecessary server load from casual scrapers.
Adding AI Scrapers to robots.txt
Major AI Bots to Block
OpenAI's GPTBot:
User-agent: GPTBot
Disallow: /
Anthropic's Claude Bot:
User-agent: ClaudeBot
Disallow: /
Google's Extended Crawler (for Google's AI products):
User-agent: GoogleExtended
Disallow: /
Perplexity Bot:
User-agent: PerplexityBot
Disallow: /
Common Crawl:
User-agent: CCBot
Disallow: /
Microsoft Copilot:
User-agent: Copilot
Disallow: /
Other AI Bots:
User-agent: Bytespider
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Baiduspider-render
Disallow: /
Comprehensive robots.txt Blocking Example
# Block AI training bots
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Baiduspider-render
Disallow: /
User-agent: Copilot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# Allow legitimate search engines
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: Slurp
Disallow:
# Block all other bots (default)
User-agent: *
Crawl-delay: 10
Advanced Technical Defenses
Meta Robots Tag
Add to your HTML <head>:
<!-- Prevent archiving/caching by AI services -->
<meta name="robots" content="noarchive,nocache">
<!-- Prevent indexing by specific services -->
<meta name="robots" content="noimageindex">
X-Robots-Tag Header
For APIs or multiple resources, use HTTP headers:
X-Robots-Tag: noarchive, nocache
X-Robots-Tag: noimageindex
Robots.txt with Rate Limiting
User-agent: GPTBot
Disallow: /
# Slow down other bots
User-agent: *
Crawl-delay: 10
Request-rate: 1/10s
Content Security Policy (CSP)
Content-Security-Policy: script-src 'self'; style-src 'self' fonts.googleapis.com;
Prevents scrapers from loading resources that might be rate-limited.
Firewall and Server-Level Blocking
CloudFlare Rules
If using CloudFlare:
- Go to Firewall Rules
- Create rule to block known bot User-Agents
- Example rule:
(cf.bot_management.score < 30) or (http.user_agent contains "GPTBot")
Server-Level Blocking (.htaccess)
# Apache .htaccess
RewriteEngine On
# Block GPTBot
RewriteCond %{HTTP_USER_AGENT} "GPTBot" [NC]
RewriteRule .* - [F]
# Block Claude bot
RewriteCond %{HTTP_USER_AGENT} "Claude-Web" [NC]
RewriteRule .* - [F]
# Block Perplexity
RewriteCond %{HTTP_USER_AGENT} "PerplexityBot" [NC]
RewriteRule .* - [F]
Nginx Blocking
# nginx configuration
if ($http_user_agent ~* (GPTBot|Claude-Web|PerplexityBot)) {
return 403;
}
Monitoring and Detection
Check for Known Scrapers
Install monitoring to see what bots are visiting:
CloudFlare Analytics:
- Bot Management
- See traffic by bot type
- Block specific bots
Server Logs Analysis:
grep "GPTBot" /var/log/apache2/access.log
grep "CCBot" /var/log/apache2/access.log
Identify Unknown Bots
Look for:
- Unusual User-Agent strings
- Rapid page requests
- Unusual bot patterns
- Requests to feed endpoints (indicating content scraping)
Reverse IP Lookup:
- Identify bot owners
- Add to block list if malicious
Legal and Policy Approaches
Terms of Service Update
Add to your site's ToS:
Automated Access Restrictions:
- Automated access, scraping, or data mining
of our website is prohibited
- This includes training machine learning models
- Violators will be pursued legally
- Use our API if available for approved access
Copyright and Attribution
Add copyright notices:
<!-- HTML head -->
<meta name="copyright" content="© 2024 Company Name">
<meta name="rights-holder" content="Company Name">
Licensing Statements
<!-- Creative Commons license declaration -->
<link rel="license" href="https://creativecommons.org/licenses/by-nc/4.0/">
<meta name="license" content="CC BY-NC 4.0">
DMCA Takedown Notices
If you catch scraping:
- Document the scraping
- Send DMCA takedown notice
- Report to hosting provider
Content-Level Protections
Hide Content from Bots
<!-- Content hidden from bots, visible to users -->
<div style="display:none" aria-hidden="true">
protected content
</div>
JavaScript-Rendered Content
Render important content with JavaScript (harder for bots to scrape):
// Load sensitive content after page load
document.addEventListener('DOMContentLoaded', function() {
fetch('/api/protected-content')
.then(r => r.json())
.then(data => {
// Inject content
});
});
Rate Limiting
Limit requests per IP/bot:
// Server-side rate limiting
const rateLimit = require('express-rate-limit');
const limiter = rateLimit({
windowMs: 15 * 60 * 1000,
max: 100,
message: 'Too many requests'
});
app.use('/api/', limiter);
Legitimate Bot Allowlisting
Whitelist Search Engines
# Whitelist good bots
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: Slurp
Disallow:
User-agent: DuckDuckBot
Disallow:
Tools to Monitor
- Dark Visitors: darkvisitors.com
- Tracks AI crawlers and their policies
- Monitor what's visiting your site
- Get updated lists of AI bots
Important Limitations to Understand
Bots Can Ignore Rules
Compliant bots: Follow robots.txt (~70% of bots) Partially compliant: Sometimes follow (~20% of bots) Non-compliant: Ignore entirely (~10% of bots)
User-Agent Spoofing
Scrapers can fake User-Agent headers:
Original: Mozilla/5.0 (Windows NT 10.0; Win64; x64)
Spoofed: Mozilla/5.0 (looks like browser)
Defense requires deeper technical analysis (request patterns, IP reputation, etc.)
Legal Gray Areas
- Scraping public content (legal in some jurisdictions)
- robots.txt violations (not legally binding)
- Terms of Service violations (only if enforced)
- Copyright infringement (valid legal claim)
Comprehensive Strategy
For maximum protection:
- robots.txt: Block known AI bots (minimal impact, better than nothing)
- Firewall Rules: Use CloudFlare or server rules to block User-Agents
- Rate Limiting: Limit requests per IP address
- Monitoring: Track what's accessing your site
- Terms of Service: Explicitly prohibit scraping
- Copyright Notices: Claim copyright on content
- Legal Action: DMCA/cease-and-desist for persistent violators
- Server Config: Block at Apache/.htaccess or Nginx level
- Licensing: Consider CC licenses with restrictions
- Content Protection: JavaScript rendering, access controls for sensitive data
Important Note
Even with all protections, determined actors can circumvent defenses. The most important protections are:
- Legal frameworks (ToS, copyright)
- Technical hurdles (making scraping difficult)
- Monitoring (catching violators)
- Enforcement (legal action when needed)
robots.txt and User-Agent blocking are helpful but incomplete solutions.
Conclusion
Blocking AI scraper bots requires a multi-layered approach. Start with robots.txt to discourage compliant bots and reduce unnecessary load. Add firewall rules to actively block known AI bots. Implement rate limiting to prevent rapid scraping. Update your terms of service to explicitly prohibit automated access. For serious violations, pursue legal remedies. Understanding that robots.txt is advisory only ensures you don't rely exclusively on it but rather use it as part of a comprehensive protection strategy. Regular monitoring via tools like Dark Visitors helps you stay informed about new bots and adapt your defenses accordingly.


