Robots.txt for SEO: The Complete Guide for 2026
Learn how robots.txt works, what to block (and what never to block), and the syntax mistakes that silently deindex websites. Includes copy-paste templates for every common platform.
One line in robots.txt can deindex your entire website. It has happened to billion-dollar companies: a developer ships a staging configuration to production, Disallow: / goes live, and organic traffic falls off a cliff over the following weeks.
This guide covers how robots.txt actually works (including the parts most tutorials get wrong), what to block and what never to block, and the exact templates to use for common platforms.
What Robots.txt Does (and Doesn't Do)
Robots.txt is a plain text file at the root of your domain — https://example.com/robots.txt — that implements the Robots Exclusion Protocol. Before a well-behaved crawler requests any page on your site, it fetches this file and checks whether the URL it wants is allowed.
Three things robots.txt does:
- Controls crawling — which URLs search engine bots may request
- Preserves crawl budget — keeps bots out of infinite URL spaces (filters, calendars, search results)
- Declares your sitemap — points crawlers to your XML sitemap location
Three things robots.txt does not do:
- It does not remove pages from the index. A disallowed URL can still rank if other pages link to it — Google just shows it without a snippet.
- It does not protect private content. The file is public, and bad bots ignore it. Anyone can read
example.com/robots.txtand see exactly which paths you tried to hide. - It does not pass or block link equity. Crawl directives and indexing signals are separate systems.
The golden rule: robots.txt controls crawling. The
noindexmeta tag controls indexing. Mixing them up is the source of nearly every robots.txt disaster — including the classic mistake of disallowing a page and adding noindex to it. If Google can't crawl the page, it never sees the noindex.
Robots.txt Syntax, Line by Line
A robots.txt file is a set of rule groups. Each group starts with one or more User-agent lines and is followed by Allow and Disallow rules:
# Group 1: rules for every crawler
User-agent: *
Disallow: /admin/
Disallow: /cart
Allow: /admin/public-docs/
# Group 2: rules only for Googlebot
User-agent: Googlebot
Disallow: /experiments/
Sitemap: https://example.com/sitemap.xml| Directive | What it does | Notes |
|---|---|---|
User-agent | Names the crawler the group applies to | * matches all bots; a bot uses the most specific group that matches it, not all groups |
Disallow | Blocks URLs starting with this path | Empty value (Disallow:) means allow everything |
Allow | Re-allows a sub-path inside a disallowed area | The more specific (longer) rule wins |
Sitemap | Declares your XML sitemap URL | Must be an absolute URL; can appear multiple times |
Crawl-delay | Asks bots to wait between requests | Ignored by Google; respected by some other bots |
Wildcards and Anchors
Two pattern characters are supported by all major search engines:
*matches any sequence of characters$anchors the pattern to the end of the URL
# Block every URL containing a query string
Disallow: /*?
# Block all PDFs
Disallow: /*.pdf$
# Block paginated archive pages beyond page 1
Disallow: /blog/page/
# Block internal search results
Disallow: /search
Disallow: /*?s=The Specificity Trap
When Allow and Disallow rules conflict, the longest matching rule wins — not the first one:
User-agent: *
Disallow: /downloads/
Allow: /downloads/whitepaper.pdfHere /downloads/whitepaper.pdf is crawlable because the Allow rule (25 characters) is longer than the Disallow rule (11 characters). This is also why Allow: / never overrides a more specific Disallow.
What You Should (Usually) Block
Every site is different, but these URL spaces are almost always safe — and beneficial — to disallow:
- Internal search results (
/search,/*?s=) — infinite, thin, duplicate content - Faceted navigation and filters (
/*?color=,/*?sort=) — the #1 crawl budget killer on e-commerce sites - Cart, checkout, and account pages (
/cart,/checkout,/account) — no search value, often session-specific - Admin and login paths (
/wp-admin/,/admin/) — though remember this is visibility control, not security - Tracking and campaign URLs (
/*?utm_) — duplicate content under infinite parameter variations - Staging or preview paths that share the production domain (
/preview/,/draft/)
What You Must Never Block
These mistakes are common and expensive:
- CSS and JavaScript files. Google renders pages like a browser. If it can't fetch your CSS/JS, it may see a broken page and rank you accordingly. Blocking
/wp-includes/or/assets/was standard advice in 2012 — today it actively hurts you. - Pages you want deindexed. Counterintuitive, but as covered above: Google has to crawl a page to see its
noindextag. - Your entire site with
Disallow: /— verify this isn't in production right now. Seriously, go check. We'll wait. - Image, font, or media folders your visible pages depend on — blocking them degrades how Google renders and understands your pages and removes you from image search.
Copy-Paste Templates
A Sensible Default for Most Websites
User-agent: *
Disallow: /admin/
Disallow: /cart
Disallow: /checkout
Disallow: /account
Disallow: /search
Disallow: /*?s=
Disallow: /*?utm_
Sitemap: https://example.com/sitemap.xmlWordPress
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /?s=
Disallow: /search/
Sitemap: https://example.com/sitemap_index.xmlNote the Allow for admin-ajax.php — many themes and plugins load front-end content through it.
Next.js / Modern JS Frameworks
User-agent: *
Disallow: /api/
Disallow: /_next/static/chunks/pages/admin
Sitemap: https://example.com/sitemap.xmlDon't block /_next/static/ wholesale — that's where your CSS and JS live. In Next.js you can generate this file dynamically with a robots.ts in your app/ directory.
Handling AI Crawlers
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
# Allow AI search crawlers that cite and link sources
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /Whether to block AI crawlers is a business decision — see the FAQ below for the trade-offs.
How to Test Your Robots.txt
Never deploy robots.txt changes blind:
- Google Search Console → Settings → robots.txt shows the last fetched version, any parse errors, and lets you request a recrawl after fixing issues.
- The URL Inspection tool tells you whether a specific URL is blocked by robots.txt — test your money pages after every change.
- Run a full site scan. A technical SEO audit catches robots.txt problems alongside the issues that usually travel with them — missing sitemaps, noindex conflicts, and orphaned pages. WebScore's SEO module checks your robots.txt configuration on every scan, free.
Robots.txt Debugging Checklist
When organic traffic drops and you suspect robots.txt:
- Fetch
yourdomain.com/robots.txtdirectly — does it return200with the content you expect? - Check for
Disallow: /underUser-agent: *(the staging-config-in-production classic) - Confirm CSS/JS paths aren't blocked (test a page in Search Console's URL Inspection → View crawled page)
- Verify the file is under 500 KB (Google's limit — rules beyond it are ignored)
- Check the file returns
200, not5xx— if robots.txt returns a server error, Google may stop crawling your site entirely - Look for conflicting
noindex+Disallowcombinations on pages you're trying to remove
Key Takeaways
- Robots.txt controls crawling, not indexing — use
noindexto remove pages from search - The longest matching rule wins, not the first one
- Never block CSS/JS; almost always block internal search, filters, and cart/checkout paths
- A broken robots.txt fails silently — traffic erodes over weeks, not overnight
- Test every change in Search Console before and after deploying
Not sure what your robots.txt is doing right now? Run a free WebScore scan — it checks your robots.txt, sitemap, crawlability, and 100+ other SEO factors in under 60 seconds, and shows you exactly what to fix.
Related Articles
Scan Your Website Now
Get a comprehensive analysis of your website's performance, SEO, security, and more.