SEO Best Practices for robots.txt
The robots.txt file is a critical component of your technical SEO foundation. While often overlooked, a well-configured robots.txt can significantly impact how search engines crawl your site, improve crawl efficiency, and prevent indexing of pages you don't want ranked. Conversely, a misconfigured robots.txt can accidentally block search engines from important content, damaging your SEO. Understanding robots.txt best practices ensures search engines can effectively crawl and index your site while avoiding unnecessary waste of crawl budget.
robots.txt Basics for SEO
File Location and Syntax
The robots.txt file must be:
- Located at the root of your domain: example.com/robots.txt
- Plain text (never HTML or XML)
- Following consistent formatting
- Accessible without authentication
Basic Structure
User-agent: [bot name or *]
Disallow: [path to block]
Allow: [path to allow]
Crawl-delay: [seconds between requests]
Order Matters
More specific user-agents listed first:
User-agent: Googlebot # Most specific
Disallow: /admin/
User-agent: * # Least specific (default)
Disallow: /private/
SEO Best Practices
Practice 1: Allow Search Engine Crawlers
Why: You want Google, Bing, Yahoo to index your content.
# Allow Google
User-agent: Googlebot
Allow: /
# Allow Bing
User-agent: Bingbot
Allow: /
# Default - allow all good bots
User-agent: *
Allow: /
Best Practice: Don't block legitimate search engines unless you have specific reasons.
Practice 2: Block Only What Needs Blocking
Why: Blocking too much wastes your crawl budget and confuses search engines.
Common Things to Block:
# Admin areas (not meant for public)
Disallow: /admin/
Disallow: /wp-admin/
# Private user areas
Disallow: /account/
Disallow: /profile/
Disallow: /settings/
# Duplicate content filters
Disallow: /*?
Disallow: /search?
Disallow: /filter?
# Temporary pages
Disallow: /temp/
Disallow: /draft/
# Session IDs and parameters
Disallow: /*session=
Disallow: /*utm_
Don't Block:
- Your main content pages
- Categories and taxonomies
- Blog posts and articles
- Product pages
- Contact pages
Practice 3: Allow Search-Related Pages
Important pages should be explicitly allowed:
User-agent: Googlebot
Disallow: /private/
Allow: / # Allow main site
Allow: /public/* # Allow specific sections
Practice 4: Disallow Duplicate Content Intelligently
Parameter-Based Duplicates:
# Block URL parameters that create duplicates
Disallow: /*?
Disallow: /*&
Session Variables:
Disallow: /*session=
Disallow: /*sessionid=
Disallow: /*s=
Tracking Parameters (use Google Search Console instead):
# Block UTM parameters
Disallow: /*utm_
Practice 5: Optimize Crawl Budget with Crawl-Delay
What It Does: Tells search engines how long to wait between requests.
For Most Sites (general rule):
# Reasonable delay - don't stress server
User-agent: *
Crawl-delay: 1
For Slow Servers:
User-agent: *
Crawl-delay: 5 # Wait 5 seconds between requests
For Fast Servers with Large Sites:
User-agent: *
Crawl-delay: 0.5 # No significant delay needed
Better Alternative (in Google Search Console): Instead of crawl-delay in robots.txt, use Google Search Console:
- Go to Settings
- Set crawl rate
- More effective than robots.txt
Practice 6: Use Sitemap Directive
What It Does: Tells search engines where to find your sitemap.
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Best Practice:
- Include both sitemap.xml (regular pages)
- Include sitemap-news.xml (news/blog content)
- Include sitemap-images.xml (image-heavy sites)
Practice 7: Be Consistent with Canonical Tags
Relationship:
- robots.txt controls crawling
- Canonical tags control indexing
- Both should point to same "main" version
robots.txt: Disallow: /duplicate-page
HTML: <link rel="canonical" href="https://example.com/main-page">
When they conflict, use canonical tags (more powerful).
Practice 8: Test Your robots.txt
Google Search Console:
- Go to Search Console
- Settings
- Test robots.txt
- Enter a URL to see if it's blocked
Online Testers:
- seotesting.com
- robotstxt.org
- regex.org
Practice 9: Monitor robots.txt Performance
Google Search Console Reports:
- Coverage report shows blocked pages
- Monitoring shows crawl errors
- Inspect URL shows if blocked
Check Regularly:
- After changes, monitor crawl rate
- Watch for unexpected increases/decreases
- Verify blocked pages aren't important
Practice 10: Include All Versions and Subdomains
Subdomains:
# example.com/robots.txt
User-agent: *
Disallow: /private/
# subdomain.example.com/robots.txt
User-agent: *
Allow: /
Each subdomain needs its own robots.txt.
HTTPS and WWW:
# Both should be available
https://example.com/robots.txt
https://www.example.com/robots.txt
http://example.com/robots.txt # Redirect to HTTPS
http://www.example.com/robots.txt # Redirect to HTTPS
Common Robots.txt Examples
Blog
User-agent: *
Allow: / # Allow all
Disallow: /admin/ # Block admin
Disallow: /draft/ # Block drafts
Disallow: /*? # Block pages with query parameters
Allow: /?s= # But allow search
Allow: /?page= # But allow pagination
Disallow: /wp-admin/ # Block WordPress admin
Sitemap: https://example.com/sitemap.xml
E-commerce
User-agent: *
Allow: / # Allow all products
Disallow: /admin/ # Block admin
Disallow: /checkout/ # Block checkout process
Disallow: /*? # Block parameter-based duplicates
Allow: /?sort= # Allow sorting (intentional variants)
Allow: /?filter= # Allow filtering
Disallow: /account/ # Block user accounts
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-products.xml
Large Site with Many Subfolders
User-agent: *
# Allow main content
Allow: /blog/
Allow: /articles/
Allow: /products/
# Block administrative areas
Disallow: /admin/
Disallow: /internal/
Disallow: /private/
# Block duplicate versions
Disallow: /old-version/
Disallow: /staging/
Disallow: /test/
# Block parameters that create duplicates
Disallow: /*?
Allow: /?page=
Allow: /?sort=
Sitemap: https://example.com/sitemap.xml
Common Robots.txt Mistakes
Mistake 1: Blocking Everything
WRONG:
User-agent: *
Disallow: /
# Site won't appear in search results!
Mistake 2: Blocking Important Content
WRONG:
Disallow: /blog/
Disallow: /articles/
# Hides your main content from search engines
Mistake 3: Inconsistent with HTML
robots.txt says: Disallow: /private/
HTML has: <meta name="robots" content="index, follow">
# Conflicting signals confuse search engines
Fix: Use HTML meta tags for specific control, robots.txt for broad rules.
Mistake 4: Overly Complex Blocking
WRONG:
Disallow: /products/*?*?
Disallow: /search?*&*=
# Too complex, may not work as intended
Better: Use simple, clear rules.
Mistake 5: No Sitemap
WRONG:
User-agent: *
Disallow: /private/
# Missing sitemap helps search engines
Better: Include sitemap directive.
Advanced robots.txt Patterns
Disallow by File Type
User-agent: *
Disallow: /*.pdf$
Disallow: /*.zip$
Disallow: /*.exe$
Disallow Session IDs
User-agent: *
Disallow: /*?sid=
Disallow: /*?sessionid=
Block Bad Bots
User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: *
Allow: /
Specific Googlebot Rules
User-agent: Googlebot
Disallow: /private/
Crawl-delay: 0 # No delay for Google
User-agent: *
Disallow: /private/
Crawl-delay: 2 # Slower for others
robots.txt and SEO Strategy
Preserve Crawl Budget
Search engines allocate crawl budget—limit it to important pages:
# Don't waste budget on parameters
Disallow: /*?
# But allow intentional sorting/filtering
Allow: /?sort=price
Allow: /?filter=category
Prevent Content Duplication
Canonical tags handle this, but robots.txt can assist:
# Block obvious duplicates
Disallow: /duplicate/
Disallow: /old/
# Let search engines find canonical versions
Sitemap: https://example.com/sitemap.xml
Control Index Size
Smaller index = faster crawling and ranking:
# Only index essential pages
User-agent: *
Allow: /products/
Allow: /blog/
Disallow: /*
Monitoring and Maintenance
Quarterly Review
- Check robots.txt in Search Console
- Verify Sitemap URLs still valid
- Review blocked pages (are they still needed?)
- Check crawl stats (increased or decreased?)
After Site Changes
- Update robots.txt if URLs change
- Update sitemap if structure changes
- Test in Search Console
- Monitor for crawl errors
Conclusion
A well-configured robots.txt improves SEO by helping search engines crawl your most important pages efficiently while protecting admin areas and reducing duplicate content. Follow these best practices: allow search engine bots, block only necessary content, optimize crawl delay, include sitemaps, and test regularly. Monitor your robots.txt effectiveness through Google Search Console and adjust as your site evolves. Remember that robots.txt is advisory only—it guides good bots but doesn't guarantee search engine behavior. Combine it with canonical tags, noindex directives, and proper site structure for comprehensive crawlability control.


